On-Chip Intelligent Dynamic Frequency Scaling for Real-Time Systems

,

benchmark kit called Mibench was used to verify the utility of the suggested technique.

Introduction
In this era where more and more handy and mobile devices are replacing conventional electronics with wires and heavy batteries, energy conservation becomes significant to reduce unnecessary power loss and heat dissipation, thus, in a nutshell, improving battery life.But, in the efforts to improve the same, the focus has shifted away from user satisfaction, i.e. not compromising the performance required by a user for comfortable programming.As the demand for high-performance CPUs increases, the chip-making industry is competing for leadership by a mere 1-2 percent margins in frequencies of operation concerning their counterparts.Thus, conscious efforts have to be put in so that we do not face a big trade-off between power and performance while using the Dynamic Voltage and Frequency Scaling(DVFS) schemes.In this paper, we propose an on-chip machine-learning-based frequency selection method that intelligently predicts the performance demand based on the CPU utilization and Memory Access Rate [1] (MAR is the ratio of cache misses to the total number of instructions executed), while achieving significant power savings at the same time.Also, most of the earlier work in this sphere focuses on either CPU-intensive or memory-intensive operations, but not both.But in reality, when the user operates a device, both kinds of applications are accessed.Thus, we propose a technique which addresses a more practical need, wherein, while the CPU utilization tells the ANN the engagement of the user with the processor, the MAR controls the frequency prediction for programs in which the memory is accessed more, like in image and video processing or ones with high IO activity [12].In the previous mathematical models of DVFS, a crucial issue has been the time-complexity.In [2], Rangan et al. first posited that the sampling intervals of Operating System (OS) scheduler are in the range of milliseconds, while the computational requirements may be in nanoseconds due to CPU events like cache misses.Thus, the existing DVFS techniques often fail to respond to such granular program behavior.The increased time-complexity delays the frequency selection decision, thus, reducing its effectiveness.Also, in some previous implementations, the frequency scaling is done based on explicit user input to gauge user satisfaction.This delays the decision further.Thus, to address the issues, a lightweight artificial neural network named Kohonen Self-Organizing Map is used, for a quicker selection of frequency.In the proposed implementation, the pre-trained network is implemented on-chip, thus, reducing latency and power overhead considerably.Since no communication of data to the cloud is required, it enables much faster inferencing on-site.Our approach is based on breaking down the frequency scaling into a classification problem, i.e. classify the user-behavior into appropriate frequency levels, which give the highest performance per watt.The proposed technique has been implemented and evaluated on the Intel i7-4720HQ Haswell processor, but can be extended to any computer or edge processor.It has been compared with the existing DVFS technique of the processor (on-demand governor in a Linux-based OS) on the CPU benchmark named MiBench.The experimental results suggest that the proposed frequency selection algorithm can achieve up to 20% performance boost and save up to up to 16% SoC power.As compared to the existing DVFS implementation, the proposed technique can give up to 30% Performance per Watt improvement.The paper is structured as follows.Section 2 describes the related work in this field, while section 3 explains the proposed algorithm.Section 4 focuses on the on-chip implementation of the neural network.Section 5 contains the experiment and verification of the methodology.Finally, Section 6 concludes our analysis and hints at the future work on this project.

Related Works
ANN-based DVFS techniques proposed in the past have focused mainly on just power savings, without taking performance impact into account.Also, these implementations primarily cater to CPU-intensive operations.In [3], Lahiri et al. proposed a back-propagation neural network model that predicts the future computational load to select frequency intelligently.This method delivered 20% power savings, but no performance boost.To optimize expected energy per user-instruction, Moeng, and Melhem [4] employed a decision tree algorithm.Qingchen [5] proposed a deep Q-learning model to form a hybrid of various DVFS technologies for saving power.Jung and Pedram [6] have implemented a supervised-learning-based DVFS technique to predict the system performance state based on some predefined input features, which is then used to judge the optimal power policy.Tesauro et al. [7] have proposed a reinforcement learning approach that saves power at the cost of minor performance loss.Further, there have been certain implementations that focus only on MAR [1], catering to saving only power for memory-intensive activities.In [12], a feedback-based DVFS scheme was proposed for memory-bound applications that monitor the cache misses running a software feedback loop.Our proposed methodology aims at reducing the latency caused due to using their algorithmic approach by using Artificial Intelligence.Thus, related works in this sphere primarily focus on saving power (at the cost of performance) of either CPU or memory-intensive applications at a time, but not both.Also in this paper, we propose a very lightweight unsupervisedlearning-based on-chip frequency-selection technique, which has negligible power overhead and latency.Thus, it boosts performance boost and saves power simultaneously.Also, the selected neural network is capable of taking multiple inputs without increasing much complexity, thus, able to take both CPU uti-lization and Memory Access Rate (MAR) into account.Also, some more previous work in this sphere has focused on taking explicit input from the user for expression of user-satisfaction, for example, HAPPE [8] (Human and Application-Driven Frequency Scaling for Processor Power Efficiency).In the proposed implementation, we make the system more implicit by training the ANN and then making it independent of any user inputs in the future, thereby reducing the human error and bias in the prediction, while also reducing the delay in decision-making.

CPU and Memory-aware Frequency Scaling Approach
In the proposed implementation, we have considered the CPU Utilization and MAR to be the impacting factors.CPU utilization by a process or program can not only vary due to its complexity but also upon how the user uses it.A simple Microsoft Word document, for example, is used for varying lengths of time and with varying speeds dependent on the user's traits.Similarly, monitoring cache misses is important to determine the frequency of operation in practical applications where memory and IO accesses are frequent.Both these parameters form the basis of prediction in our methodology.Now, the raw data required to train this network is unlabeled, making a linear relationship between input and output impossible.Thus, to find patterns in the raw, unlabeled data, and cluster them, we needed to choose an unsupervised classification algorithm.Kohonen Self Organizing Map [9] fits the criteria, wherein, the inputs proximal to each other are mapped to nearby or the same clusters.The network and its applications are further explained in [15].There are two stages in the proposed algorithm -offline training and online inference.

Offline training
The training of the network takes place offline, i.e. not on-chip.Figure 1 describes the training algorithm in detail.In the figure, 'lr' denotes learning rate, lr limit denotes the minimum value up to which the learning rate can reduce before the training stops, and alpha is the damping factor for the learning rate.
The entire range of user-behavior, i.e.CPU utilization and MAR values (lowest possible to highest possible), is fed into the network as 2-dimensional input for offline training (on a software).In our experiment, 500 data points covering all kinds of utilization/user-behavior are used as training data.Figure 2 shows the same.
These input data points can be classified into n number of clusters, based on their values.Here, we have taken 48 clusters and each cluster assigned a 2dimensional weight value (one dimension each for CPU utilization and MAR) randomly before the commencement of training.At the commencement of the training phase, each data point is taken one at a time, its Euclidean distance compared with all the 2-dimensional weights, and the cluster whose weight is closest to the input is chosen as the winning cluster/node, and a fraction of the input data point added to the weight.This process is repeated for all the data-points.On completion of training, it is observed that the input user-behaviors (training data points) have been classified into 48 different groups based on the amount of CPU and memory usage, with each group having a final weight value.Finally, after clustering, each kind of behavior (cluster) is assigned a frequency level.Note, the assignment of frequency level does not involve any intelligent algorithm.18 equally-spaced frequency levels are chosen from the 900MHz to 2.6GHz, and assigned to the clusters in such a way that the cluster with the weight value with the highest CPU utilization and lowest MAR value parameter is assigned the highest CPU frequency, and so on.Figure 3 shows the allotment of the clusters.This is because we are aware that CPU utilization is directly proportional to CPU operating frequency, and from [1], we gather the knowledge that MAR is inversely proportional to CPU speed, i.e. if MAR increases, we need to reduce CPU frequency to gain maximum performance per watt.

Online Inference
Once training is done, the inference circuit is deployed on the SoC, thus, requiring no other communication of the neural network with the cloud, etc.Now, the aim is to select one of the 18 frequency levels based on user behavior.The network takes the CPU-utilization and MAR of the processor as 2-D input in real-time.It then compares the input with the weight of each cluster, and the input is grouped into the cluster whose weight value is closest to it (i.e.winning node/cluster), and finally selecting the frequency assigned to the winning cluster as the new CPU frequency.Thus, intelligent frequencyselection is achieved based on the CPU and memory-intensive applications accessed by the user.Figure 4 further explains the inference.

On-Chip Implementation of Inference
To measure the hardware cost, the inference circuit has been implemented on Verilog and netlist generated using Synopsys Design Compiler.Figure 5 shows the block-level circuit implementation, wherein the Euclidean distance between input and each weight is calculated.The comparator is used to compare the distances and deliver a 48-bit output.If n th cluster is the winning node, the n th bit from MSB is set to 1, while others to 0. Also, in the figure, fp tc is used to convert 2's complement of the input, fp adder is a 32-bit 2-input floating-point adder, and fp mult is a 32-bit 2-input floating-point multiplier.Overhead cost data is mentioned in the following section.

Experiments and Results
In the Linux Kernel, there is a modularized interface that aids in scaling CPU frequency, known as the CPUfreq subsystem.In Linux, we have a policy manager called the governor who controls the CPU frequency through the CPUfreq interface.
The various kernel-level governors supported by Linux for CPU frequency management are as follows -'Performance', 'Ondemand', and 'Userspace'.The performance governor always selects the highest frequency throughput, whereas the Ondemand governor selects frequency based on CPU usage.By default, the Ondemand governor is used for DVFS in Linux.In this paper, we compare the proposed implementation with the Ondemand governor.The proposed governor can access the CPU utilization and MAR and set the frequency directly through the CPUfreq subsystem.The experiment has been carried out on an Intel i7-4720HQ Haswell processor with a 2.6GHz base frequency.The latency overheads of the existing and proposed frequency selection schemes are 50 µs [10] and 0.13 µs, respectively.Also, the measured power overhead of the proposed implementation is 0.22 W, which is meager as compared to 47 For measuring the performance improvement, MiBench, a popular CPU and memory-intensive benchmark for Linux systems have been used.In [18], the experiment is based on another Linux benchmark called SysBench, which Fig. 5: On-chip inference circuit was giving a 47% performance boost with that implementation.But, it does not cater to processes having high memory access times, providing unnecessary fluctuations in frequency and inaccurate predictions for those programs.To verify our system for all flavours of programs, we use Mibench in this paper which gives us correct estimates of frequency for both memory-intensive and/or processor-intensive programs.The following sub-tests of Mibench are chosen: bitcount, fft, susan, patricia, qsort, typeset, crc32, sha, and stringsearch.Subtest 'bitcount' is the most CPU-intensive workload among these, with negligible memory activity, while stringearch is the most memory-intensive [11].For our experiment, we have simulated five real-life user-behaviors, ranging from extensive use of memory-bound applications to that of CPU-intensive ones.E.g.User1 executes highly memory-intensive applications, while User5 is focused on running CPU-intensive programs.Each of these behaviors is mimicked by a subset of the above-mentioned benchmarks.Table 1 shows the same.
Figure 6 shows a comparison of CPU frequency of the proposed implementation concerning the existing implementation, for each user.Since CPU frequency transitions incur certain latency, and the proposed implementation uses unsupervised learning to group a certain range of utilization into one par- Fig. 6: CPU Frequency per user ticular frequency, it reduces the number of p-state (CPU frequency) transitions (as compared to the ondemand governor, which changes frequency every time the CPU utilization value changes even by 1 percent), thus, reducing the latency of CPU operations and hence, better performance.Moreover, the existing DVFS technique does not take the MAR into account.
From [1], we know that if MAR increases, to get higher performance per watt, we need to lower the CPU frequency (i.e. to hit the critical speed).Thus, in our proposed implementation, this is achieved intelligently through the two-input neural network, wherein we see that for the memory-intensive applications, the CPU frequency is reduced as compared to the existing technology to get better results.Finally, since the implementation is lightweight, the inference is faster, leading to the quicker selection of CPU frequency, thus, more boost in performance.Figures 7 to 9 show the comparison of the two techniques in terms of power and performance for each sub-test, here RAPL stands for Running Average Power Limit which is a commercial standard to provide estimated energy measurements for CPUs. Figure 10 shows the percentage impact of the proposed algorithm and implementation.Replacing complicated mathematical models, this paper has proposed and implemented a very simplified intelligent frequency-selection algorithm using unsupervised clustering (Kohonen Self-Organizing Maps), which focuses on optimizing the CPU frequency for both CPU and memory-intensive applications.Moreover, due to the learning-based scheme being implemented on-chip, it has enabled faster and more lightweight neural inferencing.Thus, it incurs a much lower overhead cost than the existing DVFS scheme, which employs a relatively complex algorithm.The proposed approach focuses on finding the perfect CPU frequency such that, unlike related works in this field where either CPU or memory-bound (but, not both) applications are considered for DVFS, our implementation focuses on both simultaneously, thus, catering to the more realistic needs of the user.Moreover, unlike other works that focus only on saving power at the cost of performance, our implementation boosts or maintains  Further, voltage scaling can be incorporated as well.But, implementing such schemes shall increase the complexity, and hence, the overhead cost significantly.Thus, the ML algorithm should also be chosen wisely to reduce the cost as much as possible, thus, carrying out a proper trade-off between the benefits of the algorithm and power and performance overhead.

Table 1 :
Benchmarks per User