Grouping-Based Dynamic Power Management for …morris/papers/09/euc09.pdfAs development into system...

8
Grouping-Based Dynamic Power Management for Multi-Threaded Programs in Chip-Multiprocessors Mu-Kai Huang Department of Electronic Engineering National Taiwan University of Science and Technology Taipei, Taiwan [email protected] J. Morris Chang Department of Electrical and Computer Engineering Iowa State University Iowa, USA [email protected] Wei-Mei Chen Department of Electronic Engineering National Taiwan University of Science and Technology Taipei, Taiwan [email protected] Abstract—In the embedded systems field, the research focus has shifted from performance to considering both performance and power consumption. Previous research has investigated methods to forecast the processing behavior of programs and adopt Dynamic Voltage and Frequency Scaling (DVFS) technique to adjust the frequency of processor to meet the needs of various phase behavior of threads of programs. However few researches have paid attention to the overhead of DVFS. Generally, DVFS brings processor core unavailable time from 10μs to 650μs. Ad- justing frequency for every thread may encounter unanticipated overhead especially for multi-threaded programs. The objective of this study is to take performance, power consumption and overhead into consideration and give a low overhead power management that adjusts the frequency of processor for every group of threads instead of every thread. The proposed approach consists of three works: phase behav- ior prediction, DVFS controlling and workload migration. To demonstrate the effect of our approach, we implemented these works on a real Linux system and compared our approach with the system without DVFS and the system with DVFS for every thread. The results present that our approach improves 15-40% power consumption with 2-10% performance penalty. Moreover, it can reduce 94-97.5% processor core unavailable time, more than the system with DVFS for every thread. I. I NTRODUCTION As development into system framework in the embedded systems, power consumption has become a primary design constraint. In the embedded systems, one of the major power demand devices is the processor due to its high-speed clock rate. For power saving, processors have been equipped with the on-chip regulator which has ability to adjust system voltage and frequency [12]. In recent years, researchers have turned to the processor power management issue based on the Dynamic Voltage and Frequency Scaling (DVFS) technique that drives the on-chip regulator dynamically. Due to the advances in VLSI technology, the Chip- Multiprocessors (CMP) technologies nowadays are becoming attractive and cost effective in the hardware design of embed- ded systems [3]. Although the CMP improves system perfor- mance, it encounters great challenges in power consumption. As taking power and performance into consideration, adjusting Mu-Kai Huang is a PhD student at National Taiwan University of Science and Technology. This paper was done when he was a visiting student at Iowa State University. the frequency of each processor to meet the needs of running programs has become a challenge. Multi-threaded programming allows a program to be parti- tioned into multiple threads which can be potentially run in parallel. The throughput of multi-threaded applications can be improved greatly by running multiple threads in parallel on a CMP. As the cost of CMP continues to drop, it becomes more and more attractive to have multi-threaded applications in the CMP embedded systems. In the context of multi-threaded programs, each thread has various execution behaviors during its run-time. The behaviors of thread are generally categorized into memory-intensive phase and processor-intensive phase [13]. In the processor- intensive phase, typical instructions are arithmetic logic in- structions and branch instructions. These instructions tend to operate on registers and do not depend on operands from memory. During the memory-intensive phase, memory-based instructions (e.g. load and store instructions) are commonly used. Owing to the speed gap between the processor and the memory, a number of stall cycles will be generated during the run-time of the processor in the memory-intensive phase. Slowing down the processor frequency during the memory- intensive phase can mask the memory latency without sacri- ficing performance. Moreover, it can preserve power consump- tion of processor which is critical to the embedded systems. Much research has demonstrated different approaches to trace the phase behavior of programs. Some researchers have explored approaches to predict future program behavior based on the history behaviors [10][11][22]. A number of studies have traced the phase behavior depending on program ex- ecution properties [2][18][20][21]. Several researchers have utilized the hardware performance monitors (PMs) to track the phase behavior of programs [4][13][14][24][25]. The DVFS technology has been used widely to regulate processor power consumption. Some research adjusts the processor frequency downward under the power constraint [9][14]. Some other approaches took power consumption and program responding deadline into consideration for weakly hard real-time systems [17][26][27]. Few studies predict the future behavior of threads and scheduled threads to a processor with frequency which matches the need of those threads 2009 International Conference on Computational Science and Engineering 978-0-7695-3823-5/09 $26.00 © 2009 IEEE DOI 10.1109/CSE.2009.48 56 2009 International Conference on Computational Science and Engineering 978-0-7695-3823-5/09 $26.00 © 2009 IEEE DOI 10.1109/CSE.2009.48 56 2009 International Conference on Computational Science and Engineering 978-0-7695-3823-5/09 $26.00 © 2009 IEEE DOI 10.1109/CSE.2009.48 56 2009 International Conference on Computational Science and Engineering 978-0-7695-3823-5/09 $26.00 © 2009 IEEE DOI 10.1109/CSE.2009.48 56 2009 International Conference on Computational Science and Engineering 978-0-7695-3823-5/09 $26.00 © 2009 IEEE DOI 10.1109/CSE.2009.48 56 2009 International Conference on Computational Science and Engineering 978-0-7695-3823-5/09 $26.00 © 2009 IEEE DOI 10.1109/CSE.2009.48 56 2009 International Conference on Computational Science and Engineering 978-0-7695-3823-5/09 $26.00 © 2009 IEEE DOI 10.1109/CSE.2009.48 56 2009 International Conference on Computational Science and Engineering 978-0-7695-3823-5/09 $26.00 © 2009 IEEE DOI 10.1109/CSE.2009.48 56 2009 International Conference on Computational Science and Engineering 978-0-7695-3823-5/09 $26.00 © 2009 IEEE DOI 10.1109/CSE.2009.48 56 2009 International Conference on Computational Science and Engineering 978-0-7695-3823-5/09 $26.00 © 2009 IEEE DOI 10.1109/CSE.2009.48 56 2009 International Conference on Computational Science and Engineering 978-0-7695-3823-5/09 $26.00 © 2009 IEEE DOI 10.1109/CSE.2009.48 56 2009 International Conference on Computational Science and Engineering 978-0-7695-3823-5/09 $26.00 © 2009 IEEE DOI 10.1109/CSE.2009.48 56

Transcript of Grouping-Based Dynamic Power Management for …morris/papers/09/euc09.pdfAs development into system...

Page 1: Grouping-Based Dynamic Power Management for …morris/papers/09/euc09.pdfAs development into system framework in the embedded systems, power ... Grouping-based dynamic power management.

Grouping-Based Dynamic Power Management forMulti-Threaded Programs in Chip-Multiprocessors

Mu-Kai HuangDepartment of Electronic Engineering

National Taiwan University ofScience and Technology

Taipei, [email protected]

J. Morris ChangDepartment of Electrical and

Computer EngineeringIowa State University

Iowa, [email protected]

Wei-Mei ChenDepartment of Electronic Engineering

National Taiwan University ofScience and Technology

Taipei, [email protected]

Abstract—In the embedded systems field, the research focushas shifted from performance to considering both performanceand power consumption. Previous research has investigatedmethods to forecast the processing behavior of programs andadopt Dynamic Voltage and Frequency Scaling (DVFS) techniqueto adjust the frequency of processor to meet the needs of variousphase behavior of threads of programs. However few researcheshave paid attention to the overhead of DVFS. Generally, DVFSbrings processor core unavailable time from 10µs to 650µs. Ad-justing frequency for every thread may encounter unanticipatedoverhead especially for multi-threaded programs.

The objective of this study is to take performance, powerconsumption and overhead into consideration and give a lowoverhead power management that adjusts the frequency ofprocessor for every group of threads instead of every thread.The proposed approach consists of three works: phase behav-ior prediction, DVFS controlling and workload migration. Todemonstrate the effect of our approach, we implemented theseworks on a real Linux system and compared our approach withthe system without DVFS and the system with DVFS for everythread. The results present that our approach improves 15-40%power consumption with 2-10% performance penalty. Moreover,it can reduce 94-97.5% processor core unavailable time, morethan the system with DVFS for every thread.

I. INTRODUCTION

As development into system framework in the embeddedsystems, power consumption has become a primary designconstraint. In the embedded systems, one of the major powerdemand devices is the processor due to its high-speed clockrate. For power saving, processors have been equipped with theon-chip regulator which has ability to adjust system voltageand frequency [12]. In recent years, researchers have turned tothe processor power management issue based on the DynamicVoltage and Frequency Scaling (DVFS) technique that drivesthe on-chip regulator dynamically.

Due to the advances in VLSI technology, the Chip-Multiprocessors (CMP) technologies nowadays are becomingattractive and cost effective in the hardware design of embed-ded systems [3]. Although the CMP improves system perfor-mance, it encounters great challenges in power consumption.As taking power and performance into consideration, adjusting

Mu-Kai Huang is a PhD student at National Taiwan University ofScience and Technology. This paper was done when he was a visiting studentat Iowa State University.

the frequency of each processor to meet the needs of runningprograms has become a challenge.

Multi-threaded programming allows a program to be parti-tioned into multiple threads which can be potentially run inparallel. The throughput of multi-threaded applications can beimproved greatly by running multiple threads in parallel on aCMP. As the cost of CMP continues to drop, it becomes moreand more attractive to have multi-threaded applications in theCMP embedded systems.

In the context of multi-threaded programs, each thread hasvarious execution behaviors during its run-time. The behaviorsof thread are generally categorized into memory-intensivephase and processor-intensive phase [13]. In the processor-intensive phase, typical instructions are arithmetic logic in-structions and branch instructions. These instructions tend tooperate on registers and do not depend on operands frommemory. During the memory-intensive phase, memory-basedinstructions (e.g. load and store instructions) are commonlyused. Owing to the speed gap between the processor and thememory, a number of stall cycles will be generated duringthe run-time of the processor in the memory-intensive phase.Slowing down the processor frequency during the memory-intensive phase can mask the memory latency without sacri-ficing performance. Moreover, it can preserve power consump-tion of processor which is critical to the embedded systems.

Much research has demonstrated different approaches totrace the phase behavior of programs. Some researchers haveexplored approaches to predict future program behavior basedon the history behaviors [10][11][22]. A number of studieshave traced the phase behavior depending on program ex-ecution properties [2][18][20][21]. Several researchers haveutilized the hardware performance monitors (PMs) to trackthe phase behavior of programs [4][13][14][24][25].

The DVFS technology has been used widely to regulateprocessor power consumption. Some research adjusts theprocessor frequency downward under the power constraint[9][14]. Some other approaches took power consumption andprogram responding deadline into consideration for weaklyhard real-time systems [17][26][27]. Few studies predict thefuture behavior of threads and scheduled threads to a processorwith frequency which matches the need of those threads

2009 International Conference on Computational Science and Engineering

978-0-7695-3823-5/09 $26.00 © 2009 IEEE

DOI 10.1109/CSE.2009.48

56

2009 International Conference on Computational Science and Engineering

978-0-7695-3823-5/09 $26.00 © 2009 IEEE

DOI 10.1109/CSE.2009.48

56

2009 International Conference on Computational Science and Engineering

978-0-7695-3823-5/09 $26.00 © 2009 IEEE

DOI 10.1109/CSE.2009.48

56

2009 International Conference on Computational Science and Engineering

978-0-7695-3823-5/09 $26.00 © 2009 IEEE

DOI 10.1109/CSE.2009.48

56

2009 International Conference on Computational Science and Engineering

978-0-7695-3823-5/09 $26.00 © 2009 IEEE

DOI 10.1109/CSE.2009.48

56

2009 International Conference on Computational Science and Engineering

978-0-7695-3823-5/09 $26.00 © 2009 IEEE

DOI 10.1109/CSE.2009.48

56

2009 International Conference on Computational Science and Engineering

978-0-7695-3823-5/09 $26.00 © 2009 IEEE

DOI 10.1109/CSE.2009.48

56

2009 International Conference on Computational Science and Engineering

978-0-7695-3823-5/09 $26.00 © 2009 IEEE

DOI 10.1109/CSE.2009.48

56

2009 International Conference on Computational Science and Engineering

978-0-7695-3823-5/09 $26.00 © 2009 IEEE

DOI 10.1109/CSE.2009.48

56

2009 International Conference on Computational Science and Engineering

978-0-7695-3823-5/09 $26.00 © 2009 IEEE

DOI 10.1109/CSE.2009.48

56

2009 International Conference on Computational Science and Engineering

978-0-7695-3823-5/09 $26.00 © 2009 IEEE

DOI 10.1109/CSE.2009.48

56

2009 International Conference on Computational Science and Engineering

978-0-7695-3823-5/09 $26.00 © 2009 IEEE

DOI 10.1109/CSE.2009.48

56

Chris Huang
線段
Page 2: Grouping-Based Dynamic Power Management for …morris/papers/09/euc09.pdfAs development into system framework in the embedded systems, power ... Grouping-based dynamic power management.

Fig. 1. The basic data structure of runqueue.

[4][23]. Several investigators forecasted the future behaviorof every thread and adopted the per-thread DVFS whichdynamically adjusts frequency of processor to the needs ofeach thread [1][10][15][25].

Although much published research is concerning powerconsumption of the per-thread DVFS, little attention has beenpaid to the overhead of the per-thread DVFS in the systemswith considerable threads. Generally, controlling of the on-chip regulator can bring processor core unavailable time from10µs to 650µs [6][15]. During this time, processor core cannotbe used for any operation. This is also referred to the DVFSoverhead. In this paper, we use the term ”processor unavailabletime” as it is appeared in [6]. When system executes largenumber of threads and adjusts the frequency of processor foreach thread, system performance will be constrained by a largenumber of processor core unavailable time.

This paper considers the DVFS overhead, performanceand power consumption to the multi-threaded environment inthe CMP systems, and proposes a per-group DVFS powermanagement. The per-group DVFS groups threads based onLinux 2.6 Scheduling mechanisms [16], and reduces processorcore unavailable time by adjusting frequency of processor tomeet the needs of every group instead of every thread. Inaddition, we present a phase-based workload migration, whichbrings the threads with similar phase behavior into the samegroup to improve the accuracy of the per-group DVFS. Inthis paper, the experimental results show that our approachreduces 15-40% power consumption with 2-10% performancepenalty. Moreover, our approach reduces 94-97.5% processorcore unavailable time more than the per-thread DVFS.

The rest of the paper is organized as follows. Section 2describes overview of the Linux 2.6 scheduling mechanism.Section 3 gives our dynamic power management methodologyincludes the phase predictor, the per-group DVFS and theworkload migration. Section 4 presents the implementationof our prototype system. Section 5 provides the experimentalresults and Section 6 summarizes the conclusions.

II. THE LINUX 2.6 SCHEDULING MECHANISMS

The objective of this research is to develop a grouping-based dynamic power management with less processor coreunavailable time. It can reduce processor core unavailable timeby adjusting frequency of processor for every group of threadsinstead of every thread. In our approach, the group is definedas the queue used in Linux scheduling mechanisms, Section3 will describe the detail of our approach. However, first, wepresent the Linux 2.6 scheduling mechanisms.

The Linux 2.6 scheduling mechanism is a priority-basedscheduling which gives priority to executable threads relyingon their worth and need for processor time [16]. Each priorityis mapped to a value of timeslice that represents allowablerunning time. When the executable thread exhausts its times-lice, the system will switch the executable thread to anotherthread. This action is referred as the context switch.

Figure 1 gives the basic data structure of the Linux 2.6scheduling mechanisms. The basic data structure of the sched-uler is runqueue. The runqueue is a set of executable threadson a given processor, and each processor maintains its ownrunqueue. The runqueue is composed of two subqueues, theactive queue and the expired queue. The active queue holdsall the threads that have timeslice remaining, and the expiredqueue consists of all the unfinished threads that have exhaustedtheir timeslice but not finished yet.

When an unfinished thread has exhausted its timeslice, thescheduler updates the timeslice for the next execution termand moves this thread from the active queue to the expiredqueue. Once all threads in the active queue have used up theirtimeslices, i.e. the active queue becomes null, the schedulerwill swap these two subqueues so that the active queue changesinto the expired queue and vice versa. Then, the threads in the”new” active queue are to be executed again.

III. GROUPING-BASED DYNAMIC POWER MANAGEMENT

In this section, we present our dynamic power managementthat based on the Linux 2.6 scheduling mechanisms. Ourapproach consists of Phase Behavior Predictor (PBP), Per-Group DVFS Controller (PGDC) and Phase-Based WorkloadMigration (PBWM).

Figure 2 shows a system overview of grouping-based dy-namic power management. The PBP utilizes the hardwareperformance monitors to predict the future phase behaviorof preempted thread on every context switch. When all thethreads in a group have exhausted their timeslices or havebeen completed, system will turn to execute the threads of nextgroup. Before system engages in the thread in next group, thePGDC determines a suitable frequency for all threads in nextgroup. The PBWM considers the phase behavior of threadsand collects the threads with same phase behavior in a group.Following subsections discuss the details of each component.

A. The Phase Behavior Predictor

Generally, behaviors of thread are categorized into thememory-intensive phase and the processor-intensive phase[13]. Because threads with memory-intensive phase associatewith cache and memory accesses and cannot make use of all ofthe available frequency, using a lower processor frequency forexecuting threads with memory-intensive phase is an effectiveway to reduce power consumption.

Due to the usual practice of programming and the charac-teristic of compiler, programs perform similar phase behaviorwithin a span (Section 4 will demonstrate the phase behaviorof general programs). The Phase Behavior Predictor (PBP)is a statistical predictor that is the predictor based on the last

575757575757575757575757

Page 3: Grouping-Based Dynamic Power Management for …morris/papers/09/euc09.pdfAs development into system framework in the embedded systems, power ... Grouping-based dynamic power management.

Fig. 2. The overview of Grouping-based dynamic power management.

value. In this predictor, the next phase behavior of a threadis assumed to be the same as its last phase behavior, i.e.phase[t+ 1] = phase[t].

The hardware performance monitors are common means tomeasure the phase behavior of threads such as the numberof retired instructions, the number of memory accesses, thenumber of TLB misses, etc [8]. Existing processor like theIBM power4, the AMD processor families and the Intel pro-cessor families are equipped with the hardware performancemonitors.

Algorithm 1 The Phase Behavior PredictorRequire: An unfinished thread pEnsure: The phase behavior of p

1. SCPI ← IFU Mem Stall/Instr Ret2. if SCPI ≤ threshold then3. p.phase[t]← processor-intensive phase4. else5. p.phase[t]← memory-intensive phase6. end if7. p.phase[t+ 1]← p.phase[t]

To do the phase behavior prediction, we track two hard-ware activity events Instr Ret and IFU Mem Stall, whereInstr Ret counts the number of instruction retired andIFU Mem Stall counts the stalled cycles while waiting fordata from memory. Then we define a measure called Stall Cy-cle Per Instruction (SCPI) to differentiate the phase behaviorof threads:

SCPI =IFU Mem Stall

Instr Ret(1)

The PBP predicts future phase behavior of unfinished threadon every context switching. It predicts that the memory-intensive phase is forthcoming while SCPI is higher thanthreshold. On the contrary, the PBP forecasts that future phasebehavior is the processor-intensive phase when SCPI is lessthan threshold. In this paper, we set the threshold as 0.2. Thedetail of threshold will be presented in Section 4. Algorithm1 is the procedures of the PBP.

Fig. 3. The executing behavior of bzip on different frequency.

B. The Per-Group DVFS Controller

Adjusting appropriate frequency of processor for memory-intensive phase is effective to reduce power consumption with-out performance degradation. For instance, figure 3 presentsthe executing behavior of bzip (a compression application) ondifferent frequency. As shown in the figure, the executing timeof each memory-intensive phase portion on low frequency (see(a)) is same as the corresponding one on high frequency (see(b)). Therefore adjusting lower frequency of processor to exe-cute memory-intensive phase can benefit power consumption.

Previous works focused on the per-thread DVFS that dy-namically adjusts frequency of processor to each thread withdifferent phase behaviors [1][10][15][25]. Usually, adjustingfrequency of processor has to pay the processor core unavail-able time from 10µs to 650µs [6][15]. When system executesa large number of threads and adjusts frequency of processorfor each thread, system performance will be constrained byappreciable processor core unavailable time. Although the per-thread DVFS is an intuitional scheme for power consumption,it encounters a bottleneck on the system with the large numberof threads.

One way to avoid the bottleneck of per-thread DVFSon the system with a large number of threads is adjustingfrequency for each group of threads instead of each thread.This paper presents the Per-Group DVFS Controller (PGDC)that performs DVFS for every group of threads. The PGDCtakes a subqueue of running queue as a group, i.e. the threadsin active queue are defined as a group and the threads inexpired queue are also referred as a group as well. When theLinux scheduler exchanges these queues, the PGDC is enabledto determines an adequate frequency to execute the threads thatbelong to the ”new” active queue.

To estimate the suitable frequency of processor, weadd three counters memory phase, processor phase and un-known phase to the structure of the subqueue. Figure 4 givesthe structure of subqueues. Following is the function of thesecounters:

585858585858585858585858

Page 4: Grouping-Based Dynamic Power Management for …morris/papers/09/euc09.pdfAs development into system framework in the embedded systems, power ... Grouping-based dynamic power management.

Fig. 4. The phase behavior counters of runqueue.

• memory phase: The total number of threads withmemory-intensive phase in this subqueue.

• processor phase: The total number of threads withprocessor-intensive phase in this subqueue.

• unknown phase: The total number of threads whosephase behavior aren’t predicted yet. i.e. the threads arecreated but not executed yet.

When a thread with specific phase behavior is inserted intoa subqueue, the corresponding counter is increased. However,once a thread is removed from a subqueue, the correspondingcounter is decreased.

Generally, the processor with DVFS technology has variouslevels of frequency and voltage to select from. For instance,table 1 shows the Intel Pentium M 1.6 GHz Processor supportssix levels of frequency and voltage [6]. For the group ofthreads, the frequency of processor should be slowed downwhen most of the threads in the group are memory-intensivephase. On the other hand, selecting higher frequency ofprocessor is efficient while most of the threads in the group arecomputation-intensive phase. We select the adequate frequencyand voltage base on the ratio of the number of threadswith memory-intensive phase to the number of total threads.Assuming the processor has m levels frequency (from level1 to level m), and the lower level the faster frequency. ThePGDC selects the frequency of processor by Equation (3):

total thread =memory phase+ processor phase

+ unknown phase(2)

level ={

1 if memory phase = 0dmemory phase

total thread ×me otherwise(3)

Table 1: Supported frequency and voltage for the IntelPentium M 1.6GHz Processor [6].

level frequency voltage1 1.6 GHz 1.484 V2 1.4 GHz 1.420 V3 1.2 GHz 1.276 V4 1.0 GHz 1.164 V5 800 MHz 1.036 V6 600 MHz 0.956 V

When the active queue and the expired queue are exchanged,the PGDC is driven to estimate and adjust an adequate fre-quency for the threads in the active queue. Algorithm 2 is theprocedure of the PGDC. The PGDC determines the adequatefrequency level by equation (3) and subsequently adjusts the

frequency of processor to the corresponding frequency. Forexample, an Intel Pentium M 1.6 GHz processor system withsix levels of frequency exchanges the active queue and the ex-pired queue. If the memory phase counter of the ”new” activequeue is 0, the PGDC adjusts the frequency of processor tohighest frequency. On the other hand, if the ”new” active queuehas 25 threads all told, and 12 threads with memory-intensivephase among these threads, the PGDC utilizes equation (3) todetermine the adequate level of frequency that is level 3, thenadjusts the frequency of processor to 1.2 GHz (refer to table1).

Algorithm 2 The Per-Group DVFS ControllerRequire: Active queue q and supported frequency level mEnsure: Adjust the frequency of processor to f

1. memory ← q.memory phase2. total ← (q.memory phase + q.processor phase +q.unknown phase)

3. if memory = 0 then4. level← 15. else6. ratio← memory/total7. level← dm× ratioe8. end if9. if current frequency 6= corresponding frequency of level

then10. f ← corresponding frequency of level11. end if

C. The Phase-Based Workload Migration

As mentioned before, each processor maintains its ownrunqueue and only executes the threads in its runqueue. Oneway to promote the cooperation between two processors ismigration. The migration is a system activity which moves athread from a processor’s runqueue to another’s.

The migration can not only promote the cooperation be-tween processors but also improves the accuracy of the PGDC.For instance, the PGDC will adjust the lowest frequency foran active queue which has one thread with processor-intensivephase and ten threads with memory-intensive phase, and thisthread with processor-intensive phase may be executed byunsuitable frequency. The migration can improve the accuracyof the PGDC by moving the thread with processor-intensivephase to another high frequency processor.

Although the migration has significant advantages, it bringssome disadvantages. When the migration is applied, it’ll blockthe runqueue of the source processor and the runqueue of thedestination processor. If the migration is enabled haphazardly,the performance of processor would be constrained. Moreover,there are some threads that can’t be migrated:

• running threads: The threads that are executing now.• exclusive threads: The threads that only run on specific

processor.

595959595959595959595959

Page 5: Grouping-Based Dynamic Power Management for …morris/papers/09/euc09.pdfAs development into system framework in the embedded systems, power ... Grouping-based dynamic power management.

Fig. 5. Example of VEQ selection.

• cache hot threads: The threads that are executed persis-tently and are most likely in the processor’s cache (cachehot).

We present a migration mechanisms called the Phase-BasedWorkload Migration (PBWM) and implement it by usingLinux standard workload balancer. To reduce the overhead ofmigration, the PBWM is driven to move threads while theworkload of a processor is imbalance. The PBWM examinesthe workload of expired queue of each processor whenever anexpired queue is empty or every 200 ms while the system isbusy. When the system is idle, the PBWM is driven every 1ms.

When the PBWM detects a Starveling Expired Queue (SEQ)whose workload is lower than other expired queue’s, thePBWM will find a Victim Expired Queue (VEQ) and movethreads from VEQ to SEQ. Two types of queue are qualifiedto be a VEQ: the queue which has maximum workload andthe queue whose most threads’s phase behavior is similar tothe major phase behavior of SEQ.

To select an appropriate VEQ, the PBWM first scores eachexpired queues of processor i except SEQ by Equation (4)-(9).Then the PBWM picks the expired queue with maximum scoreas VEQ. Equation (4) calculates the number of threads of eachexpired queue; Equation (5) counts each expired queue’s high-frequency needs threads; and Equation (6) gives the majorphase behavior which is the phase behavior of most threadsof SEQ. We define s ratio and t ratio for selecting a properVEQ. The s ratio[i] (7) is the ratio of the number of threadswith major phase behavior belongs to the expired queue ofprocessor i to the total number of threads belong to everyexpired queue. The t ratio[i] (8) is the ratio of the number ofthreads belong to the expired queue of processor i to the thetotal number of threads belong to every expired queue. Thescore of each expired queue is determined by Equation (9),where α and β are the weight of the s ratio and the t ratio,respectively. In this paper, we defined the value of α and βby heuristic experiment, and α and β are set as 0.3 and 0.7,respectively.

EQ[i].total thread =memory phase+ processor phase

+ unknown phase

(4)

EQ[i].other thread = processor phase+unknown phase

(5)

major =

mem if SEQ.memory phase >

(SEQ.processor phase+SEQ.unknown phase)

processor otherwise

(6)

s ratio[i] =

EQ[i].memory phase∑

each core j

EQ[j].memory phase if major

= memEQ[i].other thread∑

each core j

EQ[j].other thread otherwise

(7)

t ratio[i] =EQ[i].total thread∑

each core j

EQ[j].total thread(8)

score[i] = α× s ratio[i] + β × t ratio[i] (9)

Figure 5 gives an example of selecting VEQ. The SEQ is theexpired queue of processor 3, and the major phase behaviorof SEQ is mem. First, the PBWB scores the expired queue ofprocessor 0, 1 and 2 by Equation (9). Then, the PBWB selectsthe expired queue with highest score (the expired queue ofprocessor 0) as VEQ.

After selecting VEQ, the PBWB moves appropriate threadsfrom VEQ to SEQ. The migrated threads are not only havephase behavior similar to the major phase behavior of SEQ, butalso movable (i.e. not running threads, not exclusive threadsand not cache hot threads).

The procedures used in the PBWB are presented in algo-rithm 3. The PBWB first finds a SEQ whose workload is lowerthan others. Next, it determines the major phase behavior ofSEQ and scores other expired queue by Equation (9). Then,it selects the expired queue with maximum scores as VEQ.Finally, the PBWB blocks the runqueue of SEQ and VEQ, andmoves adequate threads from VEQ to SEQ until the workloadof SEQ is balance or all threads of SEQ are examined.

IV. IMPLEMENTATION

We implemented our scheme on an Intel Core 2 QuadQ6600 processor based desktop computer, running Linuxkernel 2.6.22.9. The behavior of executing threads is measuredvia hardware performance monitors (PMs), and the futurephase behavior of threads is predicted according to the value ofSCPI. To apply DVFS, we control the frequency of processorvia hardware Model-Specific Registers (MSRs). Moreover, weuse a dynamic power measurement of CMOS circuit to takethe gauge of power consumption. The details of these aspectsfor our framework are discussed in the following subsections.

606060606060606060606060

Page 6: Grouping-Based Dynamic Power Management for …morris/papers/09/euc09.pdfAs development into system framework in the embedded systems, power ... Grouping-based dynamic power management.

Algorithm 3 The Phase-Based Workload MigrationRequire: A starveling expired queue SEQEnsure: Move and balance the workload of each processor

1. major ← the major phase behavior of SEQ2. for each possible processor i do3. score[i]← the score of EQ[i]4. end for5. j ← i | score[i] is maximum6. V EQ← EQ[j]7. block the runqueue of SEQ and V EQ8. T ← V EQ9. C ← {j | j ∈ V EQ ∧ j.phase = major}

10. repeat11. if C = Ø then12. p← t | t ∈ T13. T ← T − {t}14. else15. p← t | t ∈ C16. C ← C − {t}17. T ← T − {t}18. end if19. if p is movable then20. V EQ← V EQ− {p}21. SEQ← SEQ+ {p}22. end if23. until SEQ is balance or T = Ø24. release the runqueue of SEQ and V EQ

A. Executing Behavior Monitoring

In our experiments, the behavior of executing threads ismonitored during the run-time of threads via PMs. We con-figured two available PMs in the Intel Core 2 Quad Q6600processor to monitor the number of retired instructions andthe stalled cycles while waiting for data from memory, withthe Instr Ret and IFU Mem Stall event configurations.

To monitor the previous behavior of unfinished threads,we implement the PMs access in the system callcontext switch(previous thread, next thread), i.e. thesystem collects the information from the PMs on every contextswitch. Once it is called, the system collects behavior of theprevious thread via PMs and determines the phase behaviorof unfinished thread. Afterward PMs is reset to zero.

B. The Threshold of the Phase Behavior Predictor

As mentioned before, the phase behavior of threads aredistinguished by a threshold of SCPI. To determine the thresh-old, we observed the executing behavior of various programsand defined an suitable threshold. Figure 6 demonstrates theexecuting behavior of programs. Figure 6 (a) - (d) show thebehavior of benchmarks of SPEC CPU2006, and Figure 6(e) - (g) are the behavior of real programs. In addition, wepresent a minibenchmark which performs alternate randomlymemory access and arithmetic logic instructions. The resultsshow that programs have similar phase behavior within a span.

Fig. 6. The execution behavior of programs.

According to the behavior of minibenchmark, the SCPI ishigher than 0.2 when program performs memory access. Onthe other hand, the SCPI is lower than 0.2 when executingarithmetic logic instructions. From the figure, the differencebetween two phases can be clearly distinguished by the valueof 0.2 in SCPI. In our experiments, a SCPI threshold of 0.2is used to differentiate the program phase behavior.

C. The Implement of DVFS and Power Measurement

The Intel Core 2 Quad Q6600 processor supports two levelsof frequency and voltage for adjusting as shown in Table 2. Weadjusted the frequency of processor by writing correspondingp-state of each frequency to IA32 PERF CTL register inMSRs [7]. When the p-state is written in IA32 PERF CTL,the processor will change its frequency after processor coreunavailable time.

Table 2: Supported frequency and voltage for the IntelCore 2 Quad Q6600 Processor.

level frequency voltage1 2.4 GHz 1.4375 V2 1.6 GHz 1.1125 V

616161616161616161616161

Page 7: Grouping-Based Dynamic Power Management for …morris/papers/09/euc09.pdfAs development into system framework in the embedded systems, power ... Grouping-based dynamic power management.

A generic dynamic power measurement of CMOS circuitsis given in [6]. The dynamic power consumption of CMOScircuits (P) can be expressed:

P = C × V 2dd × f (10)

Where C is the effective switching capacitance, Vdd is thesupply voltage and f is the executing frequency. In our ex-periments, we measured the executing time of each frequencyand evaluated the power consumption by Equation (10).

V. EXPERIMENT RESULTS

In this section, we compare the performance and the powerconsumption of the system in three configurations: the systemwithout power management, the system with the proposedper-group DVFS, and the system with the per-thread DVFS.The system without power management executed the threadswith highest frequency, and the system with per-thread DVFSused information from the PBP to adjust frequency for everythread (include user threads and kernel threads). We evaluatethe performance and power consumption of these schemesby well-know benchmarks the SPEC CPU2006 [5] and thePhoronix Test Suite [19].

A. Evaluation with the SPEC CPU2006In our experiments, the system without power management

is denoted as FullFrequency, the system with per-thread DVFSis denoted as PerThreadDVFS and the system with our ap-proach is denoted as PerGroupDVFS. For comparison, theresults of PerThreadDVFS and PerGroupDVFS are normalizedto the results of FullFrequency.

The left part of Figure 7 depicts the comparison resultsfor SPEC CPU2006. From top to bottom, the figure presentsthe comparison of performance, power consumption, energy-delay product (EDP) and the number of DVFS times. Asshown in Figure 7 (a), the PerThreadDVFS encountered 23%to 50% performance degradation due to the overhead of per-sistent DVFS operation, and the PerGroupDVFS had at most8% performance degradation. Figure 7 (b) presents that thePerThreadDVFS reduced 16% to 63% power consumption andthe PerGroupDVFS saved 10% to 42% power consumption.The results in the Figure 7(c) show EDP of the PerGroupDVFSis better than others. Comparing to the system without DVFS,the average EDP improvement of the PerGroupDVFS is 21%with an average of 2% performance degradation. The resultof Figure 7 (d) demonstrates the PerGroupDVFS not onlydecreases an average of 97.5% the number of DVFS times butalso reduces an average of 97.5% processor core unavailabletime.

Comparing to the system without DVFS, the applica-tion with similar proportion of memory-intensive phase toprocessor-intensive phase such as bzip, hmmer, sjeng, gobmk,xalanchmk, astar and h264ref, can save an average of 15%power consumption with few performance degradation. Forthe applications with large number of memory-intensive phasesuch as gcc, libquantum, omnetpp and mcf, our approach canreduce an average 32% power consumption nearly without anyperformance degradation.

Fig. 7. Comparison results of: (a) performance; (b) power consumption; (c)Energy-Delay Product (EDP); (d) times of DVFS.

B. Evaluation with Multi-threaded Applications

We evaluated the comparison results for popular multi-threaded applications which is provided by Phoronix TestSuite [19]. The multi-threaded applications of Phoronix TestSuite include 7-Zip Compression, Java 2D, Sunflow RenderingSystem, MySQL, Apache Builder, ImageMagick Builder andPHP Builder.

The comparison results for multi-threaded applications aredepicted in the right part of Figure 7. As shown in Figure7 (a), the PerThreadDVFS encountered 31% to 55% per-formance degradation, and the PerGroupDVFS had at most12% performance degradation. Figure 7 (b) presents that thePerThreadDVFS reduced 19% to 67% power consumption andthe PerGroupDVFS saved 7% to 55% power consumption.The result of Figure 7 (c) shows the result of comparison ofEDP. The average EDP improvement of the PerThreadDVFSis 13% with an average of 40% performance degradation, andthe average EDP improvement of the PerGroupDVFS is 24%with an average of 5% performance degradation. The result ofFigure 7 (d) demonstrates that the PerGroupDVFS decreasesan average of 94% the number of DVFS times and reduces anaverage of 97% processor core unavailable time.

These results show that although the per-thread DVFS re-duced a lot of power consumption, it encountered appreciableperformance degradation from frequent DVFS operation. Theper-group DVFS not only reduced appreciable processor core

626262626262626262626262

Page 8: Grouping-Based Dynamic Power Management for …morris/papers/09/euc09.pdfAs development into system framework in the embedded systems, power ... Grouping-based dynamic power management.

unavailable time than per-thread DVFS but also preservedenergy only with little performance penalty.

VI. CONCLUSIONS

In this paper we introduce a grouping-based DVFS powermanagement strategy by adjusting frequency of processors tomeet the needs of every group of threads instead of everythread. The proposed scheme leads to a much lower processorunavailable time in the CMP systems with multi-threadedenvironment. Our proposed approach is not only lessening theDVFS overhead but also reducing power consumption withlow performance degradation. To achieve the power saving,slowing down the frequency of processor to execute the threadwith considerable memory access is important.

The proposed approach consists of three works: phase be-havior prediction, DVFS controlling and workload migration.The phase behavior predictor categorizes the threads into thememory-intensive phase and the processor-intensive phase bya well defined SCPI threshold. The DVFS controller adjustsfrequency of processors to meet the needs of every group ofthreads with similar behaviors. The workload migration bringsthe threads with similar phase behavior into the same groupto improve the accuracy of the per-group DVFS.

To demonstrate the performance of our approach, we im-plemented it on real Linux system and compared with thesystem without DVFS and the system with per-thread DVFSpower management. According to the experiment results, ourapproach can save an average of 30% power consumptionwith an average of 3.5% performance loss. Moreover, itonly needs negligible processor core unavailable time forpower management. The results show that our scheme canreduce energy consumption efficiently with low performancedegradation.

ACKNOWLEDGMENT

This work is supported by National Science Council underthe Grant NSC97-2221-E-011-097-. We would like to thankBashar M. Gharaibeh for his useful discussions, and thankChun-Yuan Chang for his help during the development of ourwork. Moreover, we appreciate all reviewers for their reviewand helpful suggestions.

REFERENCES

[1] G. Contreras and M. Martonosi, Power Prediction for Intel XScaler Pro-cessors Using Performance Monitoring Unit Events, Internal symposiumof Low Power Electronics and Design, 2005. pp 221-226.

[2] A. Dhodapkar and J. Smith, Managing multi-configurable hardware viadynamic working set analysis, In 29th Annual International Symposiumon Computer Architecture, 2002. pp 233-244.

[3] D. Geer, Chip Makers Turn to Multicore Processors, Computer, 38:5,2005, pp 11-13.

[4] S. Ghiasi, T. Keller and F. Rawson, Scheduling for HeterogeneousProcessors in Server Systems, Proceedings of the 2nd conference onComputing frontiers, 2005, pp 199-210.

[5] J. L. Henning, SPEC CPU2006 benchmark descriptions, ACM SIGARCHComputer Architecture News Volume 34 Issue 4, September 2006. pp 1-17.

[6] Intel Corporation, Enhanced Intelr SpeedStepr Tech-nology for the Intelr Pentiumr M Processor,http://www.intel.com/design/intarch/papers/301174.htm, March 2004.

[7] Intel Corporation, Intelr 64 and IA-32 Architectures Software De-velopers Manual Volume 3A: System Programming Guide, Part 1,http://www.intel.com/products/processor/manuals/, November 2008.

[8] Intel Corporation, Intelr 64 and IA-32 Architectures Software De-velopers Manual Volume 3B: System Programming Guide, Part 2,http://www.intel.com/products/processor/manuals/, November 2008.

[9] C. Isci, a. Buyuktosunogly, C. Cher, P. Bose and M. Martonosi, AnAnalysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget, 39th AnnualIEEE/ACM International Symposium on Microarchitecture, 2006. pp 347-358.

[10] C. Isci, G. Contreras and M. Martonosi, Live, Runtime Phase Monitoringand Prediction on Real Systems with Application to Dynamic PowerManagement, The 39th Annual IEEE/ACM International Symposium onMicroarchitecture, 2006. pp 359-370.

[11] C. Isci, M. Martonosi, and A. Buyuktosunoglu, Long-term WorkloadPhases: Duration Predictions and Applications to DVFS, IEEE Micro:Special Issue on Energy Efficient Design 25(5), September/October 2005.pp 39-51.

[12] W. Y. Kim, M. S. Gupta, G. Y. Wei and D. Brooks, System LevelAnalysis of Fast, Per-Core DVFS using On-Chip Switching Regulators,High Performance Ciomputer Architecture, 2008. pp 123-134.

[13] R. Kotla, A. Devgan, S. Ghiasi, Characterizing the Impact of DifferentMemory-Intensity Levels, International Workshop on Workload Charac-terization, 2004, pp 3-10.

[14] R. Kotla, S. Ghiasi, T. Keller and F. Rawson, Scheduling ProcessorVoltage and Frequency in Server and Cluster Systems, Proceedings of the19th IEEE International Parallel and Distributed Processing Symposium,April 2005, pp 234-241.

[15] H. Kweon, Y. Do, J. Lee and B. Ahn, An efficient Power-AwareScheduling Algorithm in Real Time System, Pacific Rim Conference onCommunications, Computers and Signal Processing, 2007. pp 350-353.

[16] R. Love, Linux kernel development 2nd ed., Indianapolis, Ind.:NovellPress, 2005.

[17] L. Niu and G. Quan, System Wide Dynamic Power Management forWeakly Hard Real-Time Systems, Journal of Low Power Electronics,Volume 2, Number 3, December 2006. pp 342-355.

[18] H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, and A. Karunanidhi,Pinpointing Representative Portions of Large Intel Itanium Programswith Dynamic Instrumentation, In Proceedings of the 37th Internationalsymposium on Microarchitecture, 2004. pp 81-92.

[19] Phoronix Media, Phoronix Test Suite Benchmark, http://www.phoronix-test-suite.com/.

[20] T. Sherwood, E. Perelman, G. Hamerly and B. Calder, AutomaticallyCharacterizing Large Scale Program Behavior, In Tenth InternationalConference on Architectural Support for Programming Languages andOperation Systems, October 2002. pp 45-57.

[21] T. Sherwood, E. Perelman, and B. Calder, Basic Block DistributionAnalysis to Find Periodic Behavior and Simulation Points in Applications,In International Conference on Parallel Architectures and CompilationTechniques, September 2001. pp 3-14.

[22] T. Sherwood, S. Sair, and B. Calder, Phase tracking and prediction, InProceedings of the 28th International Symposium on Computer Architec-ture, June 2003. pp 336-349.

[23] T. Sondag, V. Krishnamurthy and H. Rajan, Predictive Thread-to-CoreAssignment on a Heterogeneous Multi-Core Processor, Proceedings of the4th workshop on Programming languages and operating systems, October2007.

[24] R. Teodorescu and J. Torrellas, Variation-Aware Application Schedulingand Power Management for Chip Multiprocessors, Proceedings of the35th International Symposium on Computer Architecture, 2008, pp 363-374.

[25] F. Xie, M. Martonosi and S. Malik, Efficient Behavior-driven Run-time Dynamic Voltage Scaling Policies, Proceedings of the 3rdIEEE/ACM/IFIP international conference on Hardware/software codesignand system synthesis, September 2005. pp 19-21.

[26] D. Zhu, R. Melhem and B. Childers, Scheduling with Dynamic Volt-age/Speed Adjustment Using Slack Reclamation in Multi-Processor Real-Time Systems, Proceedings of the 22nd IEEE Real-Time Systems Sym-posium, 2001. pp 686-700.

[27] D. Zhu, N. AbouGhazaleh, D. Mosse and R. Melhem, Power AwareScheduling for AND/OR Graphs in Multi-Processor Real-Time Systems,Proceedings of the 2002 International Conference on Parallel Processing,2002. pp 849-864.

636363636363636363636363