Towards Energy Efficient Workload Placement in …walpole/class/cs533/papers/raniarpe.pdfTowards...

Towards Energy Efficient Workload Placement in Data

Centers

Rania Elnaggar,

Portland State University

[email protected]

Abstract. A new era of computing is being defined by a shift to aggregate

computing resources into large-scale data centers (DCs) that are shared by a

global pool of users. In this paradigm, DCs' operational energy costs are a rising

concern as they continue an upward trend that is poised to surpass the capital

cost of equipment in a typical lifetime usage model.

A DC is a complex distributed system comprised of a hierarchy of numerous

components; thus, power and energy management can be performed at many

levels of granularity and through various techniques. We contend that the

energy efficiency problem in DCs should be addressed through a holistic, end-

to-end approach that accounts for the many, and sometimes-conflicting,

parameters. In this paper, we discuss workload placement strategies as a model

for a holistic approach.

Earlier research that addressed workload placement, capitalized on a maximum-

idle approach that seeks to maximize both spatial and temporal idleness. We

show that the underlying concept for that approach does not hold as a basis for

energy-efficient placement; we investigate current and future system power

expenditure with respect to system load and analyze the contributing factors.

We then utilize our analysis to introduce a framework for energy-efficient load

placement strategies in DCs. Comparing our approach to maximum-idle-based

placement shows gains in compute energy efficiency. Finally, we discuss how

our new approach affects DC thermals and the energy required for cooling.

1 Introduction

Energy efficiency of has always been a first-class design goal in the mobile and

embedded fields due to battery-life limitations. However, until recently, it has been

less of a concern for servers and data centers. We witnessed waves of interest in DC

energy efficiency with the advent of new technologies such as WWW, clusters, grid,

utility compute models. The interest is now renewed with the inception of Cloud

Computing [2] [8], given the massive scale anticipated in future DCs. Cloud

Computing is charcterized by a move to aggregate computing resources (in terms of

hardware, software and services) into large-scale data centers that are shared by a

global pool of users. Those data centers are typically owned and run by third-party

entities and export a wide array of services and applications ranging from individual

consumer-oriented services to enterprise-class offerings.

In this new paradigm, computing energy costs are a rising concern. This is

especially the case in DCs where energy costs continue an upward trend that is poised

to surpass the cost of the equipment itself in a typical lifetime usage model [6]. In

2005, DCs consumed an estimated total of about 50 billion kWh in the U.S., and

around 130 billion kWh for the world. These figures accounted for 1.2% and 0.8% of

the total electricity consumption of the U.S. and the world respectively [24]. The U.S.

Environmental Protection Agency (EPA) estimates that if current energy-efficiency

trends continue, DCs will consume more than 120 billion kWh in the U.S. alone [13].

While cutting down operational costs of data centers is a chief concern, the

environmental impact of this spiraling computing energy expenditure is equivalently

important. It is projected that improving DC efficiency beyond current trends can

reduce carbon dioxide emissions by 15 to 47 million metric tons in 2011 [13].

A data center is a complex entity that can be partitioned into two inter-dependent

systems; the IT system and the facilities system. The IT system is comprised of

compute related parts such as servers, networks and management components. The

facilities system delivers the power and cooling needed by IT equipment as well as

other facilities overhead such as lighting. As servers’ performance grows, they

generate increasing amounts of heat [3] which in turn demands progressively demand

more cooling [6]. In fact, cooling and power delivery energy expenditure already

surpasses compute energy use in a great percentage of data centers [6]. DCs are

usually cooled using Computer Room Air Conditioning (CRAC) units, and are

typically arranged in a hot-aisle, cold-aisle configuration [47].

In this paper we contend that achieving improved energy-efficiency for a DC

should be driven by a holistic that takes into consideration all system components to

achieve maximum synergetic energy savings. This strategy governs a set of local

policies and protocols that effectively use existing power-saving features within each

system component in a way that is proportional to the workload of the component and

of the overall system.

Though ultimately, we are interested in defining an end-to-end global energy

optimization strategy over a generalized model of a multi-level distributed system, the

focus of our short-term research effort is on defining such a strategy for DCs, as a key

subset of the larger problem. We consider a DC workload that is a mix of online and

offline workloads that are mostly characterized as massively parallel, or throughput-

oriented.

In defining such strategies, we will initially examine just the compute-related part

of the DC, thus excluding networking, cooling and power delivery overheads. While

we recognize that these overheads are critically important, we also observe that, in

many cases, they are proportional to computing power expenditure in modern data

centers. Expanding our work to explicitly consider those other components is a topic

for future research.

The rest of this paper is organized as follows. In section 2 we present research

background and review related work. In section 3 we present our hypotheses and

outline an investigation methodology. In section 4 we present results and analysis. In

section 5 we introduce a framework for energy efficient workload placement. In

section 6 we conclude the paper and present direction for future research.

2 Background

A DC’s compute system is

where power management can be performed at many levels of

shown in Fig. 1 below.

Fig. 1. Scale and hierarchy of power management in a DC

We recognize three power

consumption for each component

idle power and dynamic power range

maximum workload of a component. Idle power is the power consumed when a

component has no workload but is powered

response to increasing workload. The dynamic power range defines the

between peak power and idle power, and it i

the workload of the component [

When addressing the power consumption problem in

cost components: the capital cost of power provisioning in

operational power cost during the life span of the DC.

directly related to expected maximum power consumption and is a pressing issue as

more DCs are built and typically

operators such as Google are

workload mix that keeps maximum utilization within a decreased power envelope

[13]. Decreasing operational

body of other research.

Approaches to decrease

the three power attribute

broadly classified into three categories: scaling solutions that track utilization le

sleep solutions that shut off parts of t

that combine the two former

On the component level,

been widely researched and applied to CPUs in particular as

percentage of the overall system power.

have been introduced for various platform components

utilized in wide-spread commercial products such as

AMD [1]. The Advanced

to standardize power management for different platforms

and accelerators has also been

of operating system timers

uninterrupted sleep [44]. Conserving disk drive

been explored through scaled

In the computer networks domain, the concept of maximizing idleness through

sleep states has also been the corner stone of a large body of research that includes

DC’s compute system is comprised of a hierarchy of components of different scale,

where power management can be performed at many levels of that hierarchy,

. Scale and hierarchy of power management in a DC

We recognize three power-related attributes that affect the average power

component, as well as, the overall system; namely: peak power

dynamic power range. Peak power is the power consumed at the

of a component. Idle power is the power consumed when a

component has no workload but is powered-on and active, thus has low-latency

response to increasing workload. The dynamic power range defines the distance

between peak power and idle power, and it is desirable for it to scale proportionally to

the workload of the component [3].

When addressing the power consumption problem in a DC, we identify two main

ponents: the capital cost of power provisioning in the infrastructure, and

operational power cost during the life span of the DC. The capital cost component is


typically amortized over an average of 15 years. Large DC

operators such as Google are exploring ways to cut down on that cost through a


operational power, however, is our area of interest and of a large

Approaches to decrease operational (also termed average) power have focused

power attributes we identified earlier. The solutions proposed can be

classified into three categories: scaling solutions that track utilization levels;

off parts of the system at low utilization; and hybrid solutions

former approaches.

On the component level, Dynamic Voltage and Frequency Scaling (DVFS) has

been widely researched and applied to CPUs in particular as they consume a large

percentage of the overall system power. Also, clock gating and system sleep states

have been introduced for various platform components. These techniques have been

spread commercial products such as those produced by Intel [22] and

Configuration and Power Interface (ACPI) [20] was defined

to standardize power management for different platforms. Use of heterogeneous cores

been explored to enhance energy efficiency [27], and the use

perating system timers has been scrutinized to enable longer periods of

Conserving disk drive related energy consumption has also

scaled-speed and sleep modes [9].



of a hierarchy of components of different scale,

erarchy, as

power

peak power,

Peak power is the power consumed at the

of a component. Idle power is the power consumed when a

latency

distance

for it to scale proportionally to

main

infrastructure, and the

The capital cost component is


an average of 15 years. Large DC

ways to cut down on that cost through a


is our area of interest and of a large

ed on

olutions proposed can be

vels;

hybrid solutions

has

they consume a large

o, clock gating and system sleep states

s have been

] and

defined

Use of heterogeneous cores

, and the use

longer periods of

has also



putting network devices and interfaces to sleep during periods of low traffic [19].

Since network chatter can prevent connected workstations from going to sleep,

proxying has been explored [18], as has explicitly waking-up sleeping devices [26].

Other research efforts have focused on selecting a workload-dependant best policy to

manage a device from a pool of candidate policies [40] [45]. Buffering has also been

introduced to create bursty/idle cycles to allow enough sleep time for systems of low

to moderate utilization [53], and the use of burstiness has been investigated for mobile

devices [30].

Power management policies can be largely classified as heuristic policies, or

stochastic policies. Heuristic policies are simpler to adapt [21] [46], while stochastic

policies can offer performance guarantees only for stationary workloads [34] [45].

Application-level work load profiling has been discussed in [15] to guide power

management for meeting user-defined QoS goals such as a battery-life objective.

Addressing cluster-level power management through workload distribution is

usually attempted by powering-off some servers according to the total DC workload

[36] [33]. In [12], a Vary-On Vary-Off policy is introduced and combined with

Dynamic Voltage Scaling (DVS) and also Coordinated DVS (CVS), with the latter

policy shown not to be worthwhile. An economic resource distribution model through

bidding is introduced in [10], and utilizes a cost-based on-off model that powers on

machines if utility exceeds cost. It is well-understood that the on-off model is limited

by the high latency, and decreasing it is instrumental to maintain QoS performance

levels [25].

In earlier research, the on-off approach suffered from a lack of means for live

workload migration in order to consolidate server and power-off lightly loaded

machines. This problem was alleviated by the emergence of virtualization

technologies such as Xen [52] and VMWare [50]. Resource provisioning and

isolation became critical in such a virtualized environment as discussed in [37]. Also,

translating power management directives from different VMs on the same hardware

becomes tricky and the need arises for a standardized method to interface and

coordinate VM power management [31].

Ensemble-level power management in the data centers provides greater

opportunity to maximize energy-efficiency as outlined in [49]. However, the need for

coordination between different power management modules arises and is investigated

through control theory in [35]. Such problems are optimization problems that are

difficult to solve for large-scale data centers [23]. Additionally, optimizing power-

usage is subject to QoS requirements as problem constraints. In [43], the authors

present a QoS-based workload placement approach that reduces average power

consumption by running tasks as slowly as possible within performance bounds.

As we have outlined in section 1, the magnitude of cooling costs in DCs is in the

same order as IT cost and thus has received much attention. Researchers have

investigated reducing cooling costs through controlling the heat load generated in the

DC. As heat recirculation and equipment layout create distinct thermal conditions in

each DC, thermal-aware workload placement strategies were introduced. In [29],

zonal heat distribution and recirculation are factored in selecting a set of servers to

run the workload. In [42] a thermal multiplier is generated to capture DC thermal

conditions and is then used to assign priority workload placement. The same concept

of thermal profiling is used in [48] to design scheduling algorithms aimed at reducing

server inlet temperatures. The scale of the thermal-aware placement approach is

further expanded in [32] to be used for choosing a particular DC among many in a

grid environment. As exceeding thermal limits is always a concern with increased

heat densities, DC-level thermal throttling and capping is also investigated as in [51].

3 Energy Efficient Workload Placement

3.1 The case for energy efficient placement

Reviewing the body of research outlined in section 2 shows particular focus on

maximizing both temporal and spatial idleness through workload placement. In the

spatial dimension, servers’ workloads are consolidated to run a set of active servers at

their maximum utilization, while idle servers are powered off. In the time dimension,

we observe a strategy to maximize idle periods and allow a few consolidated peak

utilization periods. We refer to these approaches as maximum-idle or (MI). Many

researchers have shown reductions in average power consumption by using variants

of MI.

In [43] researches break from the pack and discuss a strategy based on running

workloads as slowly as possible within certain performance constraints for meeting

deadlines. Their hypothesis is that running slower will reduce average power

consumption through DVFS in server systems. While we agree that there is a price tag

associated with increased performance, contrary to MI, we contend,, that the energy

consumption might not be reduced by such approach as we discuss below.

As we stated earlier, we our research goal is to reduce energy consumption, and

not average power consumption. Evidently, energy-based measurements are

inherently difficult to work with as they are scenario-based and will track a task from

start to finish. We lack quantization and benchmarking of energy efficiency albeit

some efforts such as the JouleSort benchmark presented in [41]. Researchers and

system designers have always resorted to the use of a power metric in lieu of an

energy metric, as power is a an average rate that is easy to deal with. Lower average

power doesn’t necessarily translate into lower energy expenditure; the time needed to

complete a given task may be lengthened, with a slower rate corresponding to lower

average power, such that the total energy expenditure increases. This concept is

illustrated in Fig. 2, where a task is run twice at two different power rates. The energy

consumed in each case, corresponds to the two areas under the curve, A1 and A2. Since

A2 is greater than A1, the total energy consumed in the second run of the task is

increased, even though the average power consumption has decreased. The necessary,

and sufficient condition for a lower-power strategy to translate into a lower-energy

strategy is for the performance difference (which is a time-based metric) to be less

than the power difference.

Fig. 2. Total energy consumed is the area under the curve

In order to investigate the underlying basis for MI success we must look at the

power behavior with varying workload levels and be able to compare that across

different systems. There are some inherent difficulties with this

− As different kinds of workload

different manners, the power behavior observed will be workload dependent.

− It is difficult to standardize a method t

utilization as it is also workload dependent.

can be specified by CPU utilization levels, while memory

be specified by memory bandwidth

one utilization form cannot

− If we try to standardize workloads for

performance benchmarks will typically exercise underlying systems at near peak

utilization levels and hence cannot produce power traces that are valid to study

graduated behavior.

workloads for repeatable

Addressing these issues led to the development of

described and characterized

transactional, throughput

applications. Also, high performance computing applications that are massively or

embarrassingly parallel fit into that category as well

compute tasks into small throughput

[16] developed primarily for their web search application

percentage of workloads that can be transformed into this category

Choosing throughput as representative of utilization is indeed helpful as it

eliminates the need to associate

utilization. For these reasons

particular, to represent workloads in this paper. Although the throughput workload we

are considering here is a specific kind of w

can be equally represented by a throughput metric.

Total energy consumed is the area under the curve



There are some inherent difficulties with this approach:

of workload will utilize different parts of server platforms in

, the power behavior observed will be workload dependent.

It is difficult to standardize a method to specify graduated levels of platform

utilization as it is also workload dependent. CPU-bound workloads for example

can be specified by CPU utilization levels, while memory-bound workloads can

be specified by memory bandwidth utilized. Power behavior corresponding to

cannot be compared or aggregated with another.

try to standardize workloads for the sake of comparison we see that existing


levels and hence cannot produce power traces that are valid to study

In fact, capturing a representative set of real data center

for repeatable experiments is a challenging task [28].

led to the development of SPEC Power benchmark that is

and characterized in [17]. The workload used in SPEC Power is a

throughput-oriented workload that is characteristic of web-based

Also, high performance computing applications that are massively or

embarrassingly parallel fit into that category as well. Frameworks to transform

compute tasks into small throughput-oriented jobs, such as Google’s MapReduce [39

developed primarily for their web search application [5], are expanding the

percentage of workloads that can be transformed into this category.


eliminates the need to associate with workload specific metrics such as CPU

For these reasons we chose throughput, in general, and SPEC Power, in


are considering here is a specific kind of web-based small requests, many workload

can be equally represented by a throughput metric.



parts of server platforms in

o specify graduated levels of platform

bound workloads for example

bound workloads can

responding to

sake of comparison we see that existing


levels and hence cannot produce power traces that are valid to study

In fact, capturing a representative set of real data center

SPEC Power benchmark that is

The workload used in SPEC Power is a

based

Also, high performance computing applications that are massively or

Frameworks to transform

39]

expanding the


workload specific metrics such as CPU

we chose throughput, in general, and SPEC Power, in


workloads

Fig. 3. SPEC Power curves for three different servers1

Fig. 3 shows Spec Power curves for three different servers of past generations.

Using this representation to study the basis of MI, we note that each power curve is

almost linear with a slope less than 1. That slope means that larger performance gains

can be achieved with smaller power increase, and thus the most power efficient

operation is at maximum utilization, and this is the reason MI has been successful to

date. This result however is a consequence of the power behavior of the system which

can be represented by a power curve. Changes in the shape of the power curve can

easily lead to different behavior.

We contend the behavior presented in Fig. 3 is a direct result of the inability of

underlying systems to efficiently scale dynamic power expenditure with workload.

Those systems achieved power saving primarily by entering sleeps states, thus

minimizing idle power; the workload-proportional behavior of those systems is

enabled mostly by temporal idleness, as illustrated in [17] by showing the inversely

proportional relationship between the percentage of time spent in C1 state versus

system throughput. The best workload scaling that can be achieved via this model is

linear with utilization as shown in Fig. 3.

We argue that future servers and systems will exhibit markedly different power

behavior with respect to utilization as they pursue more aggressive dynamic scaling

with workload as the nature of the dynamic power scaling is non-linear. A simple

example is the following relationship defining dynamic power,

� = ��.

If we closely scaled dynamic power of each server platform component, then

according to the above equation, that will yield quadratic power behavior with power

efficiency getting worse with higher utilization. In fact, state-of-the-art systems are

starting to exhibit that behavior already. We select two systems representative of the

state-of-the-art for the evaluation discussed later in this paper. There is reason to

1 All SPEC Power data in this document are public postings on SPEC website at

http://www.spec.org

0

50

100

150

200

250

300

350

400

450

0%

10

%

20

%

30

%

40

%

50

%

60

%

70

%

80

%

90

%

10

0%

Power (W)

Throughput

Hewlett-Packard

Company Proliant

DL580 G5 (1.86 GHz,

Intel Xeon processor

L7345)IBM Corporation IBM

System x3350

Dell Inc. PowerEdge

R300

believe this non-linear behavior will only get more pronounced in the future, as

hardware system developers pursue more aggressive power savings.

Given the above arguments, it is clear that the underlying principles of MI are

unlikely to hold in future systems. Evidently, we contend that there exists a workload

placement distribution corresponding to the utilization level of a DC, which

minimizes total energy expenditure, and that does not necessarily coincide with MI

placement.

The overall energy envelope of the data center consists of both IT power, which is

used to power severs and network, and facilities power that is used for cooling, power

provisioning and overheads. In our research we distinguish between two types of

energy-related costs; first, the cost of power, cooling and facilities provisioning in

capital expenditure; and second, the operational energy cost of powering servers,

cooling and facilities equipment in operational expenditure. The CapEx cost

component is proportional to peak power and set power budget, while the OpEx

component is proportional to average power. Our goal in the research is to focus on

improving operational energy expenditure and hence will not discuss efficiencies

related to under/over capacity provisioning. In this paradigm, the operational cooling

cost is dependent on the heat generated by IT equipment and hence, their power

expenditure. Subsequently, we will focus on discussing energy-efficient expenditure

in servers and correlate its effect on DC thermals and cooling.

To summarize our research directions, we believe that achieving maximum energy-

efficiency in data centers should be driven by a holistic strategy that takes into

consideration all system components to achieve maximum synergetic power savings.

This strategy seeks to effectively use existing power-management features within

each system component in a way that is proportional to the workload of the

component and of the overall system.

We contend that such holistic strategy can be achieved and coordinated through

intelligent workload placement. The workload placement strategy is informed of the

energy-proportionality of each individual subsystem (a server in this case) and will

assign workloads in a manner that achieves maximum energy efficiency over the

entire DC. We are not addressing thermal-aware workload placemat at this point, but

rather a placement based on finding the most efficient load distribution and that is

thermal-friendly, i.e., does not cause undesirable rise in cooling costs due to localized

increase in heat generation (formation of hot spots).

3.2 Methodology and Limitations

As outlined in the previous section, we contend that there exists a workload

distribution strategy that will maximize energy efficiency in the data center, and that

doesn’t necessarily coincide with MI. We identified normalized throughput-based

characterizations of workload, such as achieved by SPEC Power, to be a valid

representation of a system’s power proportionality. This is an optimization problem

that is seeks to minimize energy consumption as workloads are being assigned. In

order to investigate our hypothesis we constructed a profiling simulator written in

C++ to experiment with different settings. In order to simplify the problem we only

consider homogeneous DCs in the simulator. We will show in the results analysis

how to easily expand our framework to incorporate heterogeneous DCs and we intend

to incorporate that in our future work.

The Simulator has a cost matrix for each server type that is constructed from the

SPEC Power profiling of the server. Each cost matrix has 11 values corresponding to

power usage by the server at utilization levels ranging from 0% (active idle) to 100%

of maximum throughput. The number of servers in the DC is given as input to the

simulator for each run. Simulation runs exercise the DC in a normalized graduated

utilization manner at 10% increments of the DC’s maximum throughput workload.

Maximum DC workload is determined by multiplying the number of servers in the

DC by the maximum throughput of each server. DC throughput directly corresponds

to the arrival rate of jobs. We assume that for arrival rates higher than the total DC

capacity, jobs are queued at the workload dispatcher. The simulator finds all possible

combinations of workload distributions at each utilization level, and calculates the

aggregate corresponding cost over the entire DC. The simulator then identifies

minimum, maximum and MI cost values for each workload level, corresponding to

minimum, maximum and MI aggregate power consumption. To limit the search

space, we assume that workloads can only be distributed to servers in chunks of 10%

of the maximum server load.

Note that the slow-down factor we identified as a necessary condition to translate

power efficiency into energy efficiency, is not present in this framework; that is

because the requested workload throughput is not scaled or throttled but is be kept

constant as the server scales its power consumption accordingly.

The purpose of the profiling simulation is to characterize the DC’s aggregate

power behavior versus throughput, as well as, to compare and contrast different

machine configurations and varying number of machines. Profiling results can be

used to develop heuristics aiding in designing workload placement policies in DCs.

4 Results and Analysis

For presenting this section, we consider three different servers as building blocks for a

DC system of varying number of machines. The three server systems are respectively

named: m1, m2 and m3. The SPEC Power profiles of the three systems under

investigation are presented in Fig. 4 below. Note that both m1 and m2 exhibit the non-

linear power scaling we outlined in section 3, while m3 has a mostly linear power

curve.

Fig. 4. SPEC Power profile of the three machines under study.

We run the profiling simulator on three DC systems that are homogeneous

collections of the three machines. We use the graduated workload methodology

outlined in the section 3.2, to vary the DC throughput. The number of machines in

each DC is varied between 2 and 8. Nonetheless, we will show later that the results

are valid for larger configurations. The purpose of the simulations is to find a

minimum workload distribution at each DC throughput-level and study the properties

of this distribution. In Fig. 5 below we plot the minimum power distribution against

normalized DC utilization for a number of machines n ranging from 2 to 8 in a DC

system of m1. Fig. 6 shows the percentage increase of power savings above MI

placement, achieved by finding the minimum power distributions at each DC

utilization level.

Fig. 5. Minimum power versus throughput for different DC configurations, m1.

0

50

100

150

200

250

Power

Throuput

Hewlett-Packard Company

ProLiant DL380 G6 (2.26GHz, Intel

Xeon L5520 processor), m1

Dell Inc. PowerEdge R710 (Intel

Xeon X5570, 2.93 GHz), m2

Fujitsu Siemens Computers

PRIMERGY TX150 S6 (Intel Xeon

L3360), m3

0

200

400

600

800

1000

1200

1400

1600

0%

10

%

20

%

30

%

40

%

50

%

60

%

70

%

80

%

90

%

10

0%

Power (W)

DC Throughput

n = 2

n = 3

n = 4

n = 5

n = 6

n = 7

n = 8

Fig. 6. Percentage power savings versus MI, m1.

Examining Fig. 5 and Fig. 6 shows that, for an m1-based DC, there is a minimum

aggregate power workload distribution different than MI and computes the amount of

savings obtained at various utilization levels and with different numbers of machines.

We obtain analogous results for m2, while for m3, MI proves to always coincide with

the minimum as suggested by the shape of its power profile in Fig. 4. The above

figures, however, do not provide insights into the properties of the minimum power

distribution and additional analysis is needed.

Taking the analysis further, we normalize the resulting minimum power curve to

the value obtained at maximum throughput. This is akin to the way we normalized

throughput and thus we can compare the aggregate minimum power of different DC

configurations. The results of this approach are presented in Fig. 7, Fig. 8 and Fig. 9,

for m1, m2 and m3 respectively.

Fig. 7. Normalized percentage minimum aggregate power versus DC throughput

for m1-based DC.

0%

1%

2%

3%

4%

5%

6%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Power

DC Throughput

n = 8

n = 7

n = 6

n = 5

n = 4

n = 3

n = 2

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Power

DC Throughput

n = 2

n = 3

n = 4

n = 5

n = 6

n = 7

n = 8


for m2-based DC.


for m3-based DC.

The above charts show that varying the number of machines in the DC doesn’t

change the shape of the normalized aggregate minimum power curve obtained

through workload placement. The shape of that curve changes though with different

server base as we see by considering m1, m2 and m3. That means that the power

profile (or behavior) of the DC is independent of the number of machines for a

homogeneous system of a fixed workload type such as the system profiled.

We examine the power savings obtained by seeking minimum workload

distribution versus MI, and we plot that against throughput for both m1 and m2 while

fixing n. We ignore m3 because those savings always amount to zero in this case. Fig.

10, Fig. 11 and Fig. 12 show those savings for n = 2, n = 6 and n = 8 respectively.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

n = 2

n = 3

n = 4

n = 5

n = 6

n = 7

n = 8

0%

20%

40%

60%

80%

100%

120%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

n = 2

n = 3

n = 4

n = 5

n = 6

n = 7

n = 8

Fig. 10. Percentage of power saved versus MI, n = 4.



From the above figures, we observe that there is a sweet spot in the DC utilization

level that will result in a workload distribution corresponding to a maximum power

0%

1%

2%

3%

4%

5%

6%

7%

8%

9%

Power saved

DC Throughput

m1

m2

0%

1%

2%

3%

4%

5%

6%

7%

0%

10

%

20

%

30

%

40

%

50

%

60

%

70

%

80

%

90

%

10

0%

Power saved

DC Throughput

m1

m2

0%

1%

2%

3%

4%

5%

6%

0%

10

%

20

%

30

%

40

%

50

%

60

%

70

%

80

%

90

%

10

0%

Power saved

DC Throughput

m1

m2

savings over MI. The location of that sweet spot is independent of the machine type,

but dependent on the number of machines.

In the results above, we find a workload distribution that minimizes energy

consumption given a fixed DC configuration according to the number machines. This

approach however does not scale well as the size of the DC grows. We need to find a

way to further constrict the search space in the problem. Accordingly, we investigate

another approach to optimization of workload placement, by aggregating machines

into a hierarchy of nodes and distributing the workload between the top-most nodes in

the hierarchy at each level until bottom nodes (individual servers) are reached. That

hierarchy is similar to a tree structure with a constant branching factor equal to the

number of nodes at each level. We use our profiling simulator to test this strategy and

produce the DC power profiles as shown earlier. Simulations seek the most efficient

placement at each node level. We setup an experiment with a DC of 8 servers

organized in a node hierarchy of 3 levels with a branching factor of 2 at each level.

Fig. 13 shows the resulting normalized power curve at each hierarchal level and for a

DC of 8 at top most level.

Fig. 13. Normalized power curve for a DC of 8 nodes and a branching factor of 2 in

3 levels of hierarchy, m2.

Looking at the above chart we conclude that the aggregate power behavior of the

DC doesn’t change by breaking down the best placement decision into a hierarchy of

nodes. The trade-off between the branching factors and the depth of the hierarchy can

be used to reduce exhaustive search time for best placement. This observation can be

used to find best placement in a DC with a large number of servers. We also consider

using profiling to find adequate placement heuristics for a given branching factor, and

then repeatedly applying that across the hierarchy.

In fact, such an approach can be generalized to partition a heterogeneous DC into

several homogeneous clusters and treat them as end nodes at the top-most level for

workload distribution. We intend to investigate the heterogeneous structure in our

future work.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

Power

DC Throughput

Level 2

Level 1

Level 0

n = 8

5 Recommended Framework

In section 4, we profiled the problem of finding a minimum power workload

distribution in homogeneous DCs for fixed transactional workloads. We showed that

the aggregate power behavior of the DC, when normalized, is independent of the

number of machines. On the other hand, the maximum power savings obtained from

such a distribution corresponds to a certain DC utilization level that depends on the

total number of machines. We also showed how to structure the DC into a hierarchy

of nodes and thus limit the problem to the set of nodes at each level, in a repeatable

fashion.

It is worth noting that in our simulations we considered only active idle power at

0% utilization and didn’t power-off machines to get a power cost of zero. There is no

doubt that turning machines off will save more power at low utilization levels.

However, we only consider an active set of servers among a pool of all available

servers in the DC. Machines outside the active set are considered to be shutdown and

thus have no contribution to workload distribution problem. Machines are brought up

as the size of the active set expands and are turned off as it contracts. The size of this

active set should have enough servers to guard against rapid workload growth, and

thus to stay within latency thresholds. We intend to investigate the trade-offs between

the active-set size and DC energy-efficiency in our future work.

DC cooling cost is directly proportional to the maximum heat load generated. As

MI seeks to run servers at their maximum utilization through workload consolidation,

this will likely result in an increased heat load that is likely to affect cooling cost.

Our approach, however, distributes workloads such that servers avoid operating at

peak utilization; this is because of the increased energy cost of peak utilization present

in the power curves of well-power-managed systems. Hence, we expect our approach

to be thermal-friendly, albeit not thermal-aware as it does not capture the thermal

conditions in the DC.

We also consider spatially rotating the active set by adding and dropping servers to

it, to decrease localized heat densities. The active set of servers should also be

spatially dispersed for the same reason.

6 Conclusions and Future Work

The problem of reducing energy consumed by DCs is a chief concern as DCs grow in

size and complexity. Previous research has typically addressed this problem through

power management techniques that capitalize on maximizing both spatial and

temporal idleness.

In this paper we investigate the contributing factors to the energy expenditure in

the DC, and analyze power behavior corresponding to different levels of workloads.

We illustrate, through using power profiles, the reason MI was believed to offer best

workload distribution and how that is changing in future systems due to pursuing

aggressive dynamic power management. We discuss the differences between

minimizing energy and minimizing average power and show they are not

synonymous. We use a throughput metric to characterize both DC and server

utilization. Subsequently, we employ the corresponding power curves to design a new

cost-based workload placement strategy that seeks to minimize power usage at each

DC utilization level. As the throughput is kept constant, and not scaled or throttled

down, we ensure that the minimum power placement corresponds to a minimum

energy placement. We evaluate our new strategy through a profiling simulator that we

have developed, and study the power behavior of various DC configurations

employing three different server types. We validate our strategy through simulation

and show that it finds a minimum-power workload distribution that does not

necessarily coincide with MI. We then expand the result to show that that there is a

sweet spot for DC utilization where the power savings are maximized. Afterwards, we

show how to scale down the problems by arranging the DC into a hierarchy of nodes

with a given branching factor at each level. We expect that result to be very useful

when applied to large scale DCs, as well as, to heterogeneous configurations. We then

refine the framework introduced by introducing the concept of a rotating, spatially-

dispersed active set of servers. Although this approach is not thermally-aware, we

believe it is thermal-friendly as it avoids the increased localized heat loads.

Heterogeneity in both server type and workload type is a natural expansion of the

approach presented in this paper, and we plan to tackle in future work. We propose to

address heterogeneity by aggregating similar servers/workloads into groups in a

hierarchy of top-level heterogeneous nodes.

Another good area for future investigation is incorporating cost metrics

corresponding to energy consumption in other DC systems, such as: networking,

workload migration, heat generation, heat recirculation and thermal layout. We

believe, however, that the cost contribution of both heat recirculation and thermal

layout can be eliminated by careful DC design. Modern DCs already exhibit this

feature by tunneling hot or cold air in isolation, as well as a host of other techniques..

References

1. AMD Opteron Processor Power and Thermal Data sheet, publication #30417,

http://www.amd.com/us-en/Processors/TechnicalResources/0,,30_182_739_

9003,00.html

2. Armbrust, M., Fox, A., Griffith, R., Joseph, A., Katz, R., Konwinski, A., et al.:

Above the Clouds: A Berkeley View of Cloud Computing. University of

California, Berkeley, Technical. Report (2009)

3. Barroso, L. A.: The Price of Performance. ACM Queue , vol. 3, pp. 48-53. ACM,

New York (2005)

4. Barroso, L., & Holzle, U.: The case for energy-proportional computing.

Computer, IEEE Computer Society, vol. 40, no. 33. IEEE Press, New York

(2007)

5. Barroso, L., Dean, J., & Holzle, U.: Web search for a planet: The Google cluster

architecture. IEEE micro , vol. 23, pp. 22-28. IEEE Press, New York (2003)

6. Belady, C., & Malone, C.: Metrics and an Infrastructure Model to Evaluate Data

Center Efficiency. In proceedings of the ASME InterPACK, Vancouver, BC.

(2007)

7. Bohrer, P., Elnozahy, E., Keller, T., Kistler, M., Lefurgy, C., McDowell, C., et

al.: The case for power management in web servers. Power Aware Computing,

pp. 261. Kluwer Academic Publishers (2002)

8. Buyya, R., Yeo, C., Venugopal, S., Ltd, M., & Melbourne, A.: Market-oriented

cloud computing: Vision, hype, and reality for delivering it services as computing

utilities. In proceedings of the 10th IEEE International Conference on High

Performance Computing and Communications (HPCC). IEEE CS Press, Los

Alamitos, CA, USA (2008)

9. Carrera, E., Pinheiro, E., & Bianchini, R.: Conserving disk energy in network

servers. In proceedings of the 17th annual international conference on

Supercomputing, pp. 86-97. ACM New York, NY, USA (2003)

10. Chase, J., Anderson, D., Thakar, P., Vahdat, A., & Doyle, R.: Managing energy

and server resources in hosting centers. In proceedings of the eighteenth ACM

symposium on Operating systems principles ,pp. 103-116. ACM New York, NY,

USA (2001)

11. Dhiman, G., & Rosing, T.: Dynamic power management using machine learning.

In proceedings of the 2006 IEEE/ACM international conference on Computer-

aided design, pp. 747-754. ACM New York, NY, USA (2006).

12. Elnozahy, E., Kistler, M., & Rajamony, R.: Energy-efficient server clusters.

Lecture Notes in Computer Science , Vol. 2325/2003, pp. 179-196. Spriger,

Hiedelberg (2003)

13. Envorinmental Protection Agency (EPA): Report to Congress on Server and Data

Center Energy Efficiency Public Law 109-431 (2007)

14. Fan, X., Weber, W., & Barroso, L.: Power provisioning for a warehouse-sized

computer. In Proceedings of the 34th annual international symposium on

Computer architecture, pp. 13-23. ACM, New York, NY, USA (2007)

15. Flinn, J., & Satyanarayanan, M.: Managing battery lifetime with energy-aware

adaptation. ACM Transactions on Computer Systems (TOCS) , vol. 22, pp. 137-

179. ACM, New York (2004)

16. Ghemawat, S., & Dean, J.: MapReduce: Simplified Data Processing on Large

Clusters. In Proceedings of the Sixth Symposium on Operating System Design

and Implementation vol. 6, p. 10. (2004)

17. Gray, L., Kumar, A., & Li, H.: Characterization of SPECpower_ssj2008

benchmark. SPEC Benchmark Workshop. (2008)

18. Gunaratne, C., Christensen, K., & Nordman, B.: Managing energy consumption

costs in desktop PCs and LAN switches with proxying, split TCP connections,

and scaling of link speed. Internationa Journal on Network Management , vol. 15,

p.p 297-310. (2005)

19. Gupta, M., Grover, S., & Singh, S.: A feasibility study for power management in

LAN switches. In proceedings of the 12th IEEE International Conference on

Network Protocols (ICNP), pp. 361-371. (2004)

20. HP, Intel, Microsoft, Phoenix, Toshiba: Advanced configuration and power

interface specification. ACPI Specification Document, Revision 3. (2004)

21. Hwang, C., & Allen, C.: A predictive system shutdown method for energy saving

of event-driven computation. ACM Transactions on Design Automation of

Electronic Systems (TODAES) ,vol. 5, pp. 226-241. (2000)

22. Intel Xeon Processor 7400 Series datasheet,

http://www.intel.com/products/processor/xeon7000/documentation.htm?iid=prod

ucts_xeon7000+tab_techdocs. (2008)

23. Keeton, K., Kelly, T., Merchant, A., Santos, C., Wiener, J., Zhu, X., et al.: Don’t

settle for less than the best: use optimization to make decisions. In proceedings of

the 11th USENIX workshop on Hot topics in Operating Systems. USENIX

Association Berkeley, CA, USA (2007)

24. Koomey, J.: Estimating total power consumption by servers in the US and the

world. Final report. (2007)

25. Meisner, D., Gold, B., & Wenisch, T.: PowerNap: Eliminating Server Idle Power.

SIGPLAN Notices, vol. 44, pp. 205-216. ACM, New York, USA. (2008)

26. Mishra, N., Chebrolu, K., Raman, B., & Pathak, A.: Wake-on-WLAN. In

proceedings of the 15th international conference on World Wide Web, pp. 761-

769. ACM New York, NY, USA (2006)

27. Mogul, J., Mudigonda, J., Binkert, N., Ranganathan, P., & Talwar, V.: Using

Asymmetric Single-ISA CMPs to Save Energy on Operating Systems. IEEE

Micro , vol. 28, pp. 26-41. IEEE Press (2008)

28. Moore, J., Chase, J., Farkas, K., & Ranganathan, P.: Data center workload

monitoring, analysis, and emulation. In proceedings of Eighth Workshop on

Computer Architecture Evaluation using Commercial Workloads (2005)

29. Moore, J., Chase, J., Ranganathan, P., & Sharma, R.: Making scheduling “cool”:

Temperature-aware resource assignment in data centers. In proceedings of

Proceedings of the USENIX Annual Technical Conference (2005)

30. Nathuji, R., & Schwan, K.: Reducing system level power consumption for mobile

and embedded platforms. In proceedings of the International Conference on

Architecture of Computing Systems (ARCS) (2005)

31. Nathuji, R., & Schwan, K.: Virtualpower: Coordinated power management in

virtualized enterprise systems. ACM SIGOPS Operating Systems Review, vol.

21, no. 6, pp. 265-278..ACM New York, NY, USA. (2007)

32. Patel, C., Sharma, R., Bash, C., & Graupner, S.: Energy Aware Grid: Global

Workload Placement based on Energy Efficiency. In proceesings of IMECE (

2003)

33. Pinheiro, E., Bianchini, R., Carrera, E., & Heath, T.: Load balancing and

unbalancing for power and performance in cluster-based systems. In processing

of Workshop on Compilers and Operating Systems for Low Power, vol. 180, pp.

182-195 (2001)

34. Qiu, Q., & Pedram, M.: Dynamic power management based on continuous-time

Markov decision processes. In proceedings of the 36th ACM/IEEE conference on

Design automation, pp. 555-561. ACM New York, NY, USA (1999)

35. Raghavendra, R., Ranganathan, P., Talwar, V., Wang, Z., & Zhu, X.: No" power"

struggles: coordinated multi-level power management for the data center. In

proceedings of ASLOPS (2008)

36. Rajamani, K., Lefurgy, C., Res, I., & Austin, T.: On evaluating request-

distribution schemes for saving energy in server clusters. In proceedings of the

IEEE International Symposium on Performance Analysis of Systems and

Software, pp. 111-122. (2003)

37. Ramakrishnan, L., Irwin, D., Grit, L., Yumerefendi, A., Iamnitchi, A., & Chase,

J.: Toward a doctrine of containment: grid hosting with adaptive resource control.

In proceedings of the ACM/IEEE conference on Supercomputing. ACM New

York, NY, USA (2006)

38. Ranganathan, P., Leech, P., Irwin, D., & Chase, J.: Ensemble-level power

management for dense blade servers. In proceedings of the 33rd International

Symposium on Computer Architecture (ISCA), pp. 66-77. IEEE Computer

Society Washington, DC, USA (2006)

39. Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., & Kozyrakis, C.:

Evaluating MapReduce for multi-core and multiprocessor systems. In

proceedings of IEEE 13th International Symposium on High Performance

Computer Architecture, pp. 13-24. IEEE Computer Society Washington, DC,

USA (2007)

40. Ren, Z., Krogh, B., & Marculescu, R.: Hierarchical adaptive dynamic power

management. IEEE Transactions on Computers , vol. 54, pp. 409-420. IEEE

Press (2005)

41. Rivoire, S., Shah, M., Ranganathan, P., & Kozyrakis, C.: Joulesort: a balanced

energy-efficiency benchmark. In proceedings of the 2007 ACM SIGMOD

international conference on Management of data, pp. 365-376. ACM New York,

NY, USA (2007)

42. Sharma, R., Bash, C., Patel, C., Friedrich, R., & Chase, J.: Balance of power:

Dynamic thermal management for internet data centers. IEEE Internet

Computing ,vol. 9, pp. 42-49. IEEE Press (2005)

43. Sharma, V., Thomas, A., Abdelzaher, T., Skadron, K., & Lu, Z.: Power-aware

QoS management in web servers. In proceedings of 24th IEEE Real-Time

Systems Symposium, pp. 63-72. IEEE Press (2003)

44. Siddha, S., Pallipadi, V., & De, A. V.: Getting maximum mileage out of tickless.

In proceedings of Linux Synposium, pp. 201-208. (2007).

45. Simunic, T., Benini, L., Glynn, P., & Micheli, G. D.: Event-driven power

management. IEEE Transactions on Computer-Aided Design of Integrated

Circuits and Systems , vol. 20, pp. 840-857. IEEE Press (2001)

46. Srivastava, M., Chandrakasan, A., & Brodersen, R.: Predictive system shutdown

and other architectural techniques for energy efficient programmable

computation. IEEE Transactions on Very Large Scale Integration (VLSI)

Systems ,vol. 4, pp. 42-55. IEEE Press (1996)

47. Sullivan, R.: Alternating cold and hot aisles provides more reliable cooling for

server farms. Uptime Institute (2000)

48. Tang, Q., Gupta, S., & Varsamopoulos, G.: Energy-Efficient Thermal-Aware

Task Scheduling for Homogeneous High-Performance Computing Data Centers:

A Cyber-Physical Approach. IEEE Transactions on Parallel and Distributed

Systems , vol. 19, pp. 1458-1472. IEEE Press (2008)

49. Tolia, N., Wang, Z., Marwah, M., Bash, C., Ranganathan, P., & Zhu, X.:

Delivering Energy Proportionality with Non Energy-Proportional Systems--

Optimizing the Ensemble. In proceedings of the 1st Workshop on Power Aware

Computing and Systems (HotPower) (2008)

50. VM Ware, http://www.vmware.com

51. Weissel, A., & Bellosa, F.: Dynamic thermal management for distributed

systems. In proceedings of the First Workshop on Temperature-Aware Computer

Systems (2004)

52. Xen, http://www.xen.org/

53. Yavatkar, R., & Krishnamurthy, L.: Method and apparatus for managing energy

usage of processors while executing protocol state machines. US Patent App.

10/056,160 (2002)

Towards Energy Efficient Workload Placement in …walpole/class/cs533/papers/raniarpe.pdfTowards...

Documents

Transcript of Towards Energy Efficient Workload Placement in …walpole/class/cs533/papers/raniarpe.pdfTowards...