Power-aware Consolidation of Scientic Worko ws in ...korobkin/tmp/SC10/papers/pdfs/...Power-aware...

Power-aware Consolidation of Scientific Workflows inVirtualized Environments

Qian Zhu Jiedan Zhu Gagan AgrawalDepartment of Computer Science and Engineering, Ohio State University, Columbus OH

{zhuq, zhujie, [email protected]}

Abstract—The recent emergence of clouds with large, virtualizedpools of compute and storage resources raises the possibility of anew compute paradigm for scientific research. With virtualizationtechnologies, consolidation of scientific workflows presents a promisingopportunity for energy and resource cost optimization, while achievinghigh performance.

We have developed pSciMapper, a power-aware consolidation frame-work for scientific workflows. We view consolidation as a hierarchicalclustering problem, and introduce a distance metric that is based on in-terference between resource requirements. A dimensionality reductionmethod (KCCA) is used to relate the resource requirements to perfor-mance and power consumption. We have evaluated pSciMapper withboth real-world and synthetic scientific workflows, and demonstratedthat it is able to reduce power consumption by up to 56%, with less than15% slowdown. Our experiments also show that scheduling overheadsof pSciMapper are low, and the algorithm can scale well for workflowswith hundreds of tasks.

I. INTRODUCTION

Business and scientific computing processes can often be modeledas workflows. Formally, a “scientific workflow is a specification ofa scientific process, which represents, streamlines, and automatesthe analytical and computational steps that a scientist needs to gothrough from dataset selection and integration, computation andanalysis, to final data product presentation and visualization”1.Large-scale scientific workflows are used to support research in areaslike astronomy [1], bioinformatics [5], physics [12], and seismicresearch [2], and many others. Scientific workflow managementsystems like Pegasus [18] and Kepler [31] have been developedwith goals of supporting specification, management, and executionof workflows, especially on large datasets and high-end resourceslike the TeraGrid [4].

Another recent trend has been towards cloud computing. Lately,there is a growing interest in the use of cloud computing forscientific applications, as evidenced by a large-scale funded projectlike Magellan 2, and related research efforts like the Nimbus project3.Several projects have experimented with scientific workflows oncloud environments [17], [26], [23]. One of the characteristicsof cloud environments is the use of virtualization technologies,which enable applications to set up and deploy a customized virtualenvironment. Virtual Machines (VMs) are configured independentof the resources and can be easily migrated to other environments.Another feature of cloud computing is the pay-as-you-go model, orthe support for on-demand computing. Applications have access to

1http://www.cs.wayne.edu/ shiyong/swf/swf2010.html2http://www.cloudbook.net/doe-gov3http://workspace.globus.org

a seemingly unlimited set of resources, and users are charged basedon the number and type of resources used.

Power management is another critical issue for the clouds host-ing thousands of computing servers. A single high-performance300W server consumes 2628 KWh of the energy per year, withan additional 748 KWh in cooling [10], which also causes asignificant adverse impact on the environment in terms of CO2emissions. Even in 2006, data centers accounted for 1.5% of thetotal U.S. electricity consumption, and this number is expected togrow to 3.0% by 2011 4. Lately, there has been much interest ineffective power management, and techniques like Dynamic Voltageand Frequency Scaling (DVFS) have been applied to reduce powerconsumption [22], [51], [40], [38].

This paper focuses on effective energy and resource costs manage-ment for scientific workflows. Our work is driven by the observationthat tasks of a given workflow can have substantially differentresource requirements, and even the resource requirements of aparticular task can vary over time. Mapping each workflow taskto a different server can be energy inefficient, as servers with a verylow load also consume more than 50% of the peak power [8]. Thus,server consolidation, i.e. allowing workflow tasks to be consolidatedonto a smaller number of servers, can be a promising approach forreducing resource and energy costs.

We apply consolidation to tasks of a scientific workflow, withthe goal of minimizing the total power consumption and resourcecosts, without a substantial degradation in performance. Particularly,the consolidation is performed on a single workflow with multipletasks. Effective consolidation, however, poses several challenges.First, we must carefully decide which workloads can be combinedtogether, as the workload resource usage, performance, and powerconsumption are not additive. Interference of combined workloads,particularly those hosted in virtualized machines, and the resultingpower consumption and application performance need to be carefullyunderstood. Second, due to the time-varying resource requirements,resources should be provisioned at runtime among consolidatedworkloads.

We have developed pSciMapper, a power-aware consolidationframework to perform consolidation to scientific workflow tasks invirtualized environments. We first study how the resource usageimpacts the total power consumption, particularly taking virtual-ization into account. Then, we investigate the correlation betweenworkloads with different resource usage profiles, and how powerand performance are impacted by their interference. Our algorithmderives from the insights gained from these experiments. We firstsummarize the key temporal features of the CPU, memory, disk,

4http://www.energystar.gov/index.cfm?c=prod development.server efficiencystudy

c© 2010 IEEE Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creatingnew collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

SC10 November 2010, New Orleans, Louisiana, USA 978-1-4244-7558-2/10/$26.00

DataPartitioning

DataAggregation

Pipeline

data

gMM5

gWaterLevel

gTurbulentStress

gNetHeatFlux

gGridResolution

gPOM2D

gPOM3D

gVis

Fig. 1. GLFS Workflow

and network activity associated with each workflow task. Basedon this profile, consolidation is viewed as a hierarchical clusteringproblem, with a distance metric capturing the interference betweenthe tasks. We also consider the possibility that the servers may beheterogeneous, and an optimization method is used to map eachconsolidated set (cluster) onto a server. As an enhancement to ourstatic method, we also perform time varying resource provisioningat runtime.

We evaluated pSciMapper using several real and synthetic scien-tific workflows. The observations from our experiments show thatour consolidation algorithm, with time-varying resource provision-ing, is able to save up to 56% of the total consumed power, with only10-15% performance slowdown, over the case where each workflowtask is mapped to a different server. This is close to, or even betterthan, an optimal static approach that is based on exhaustive search. Inaddition, we also show that the overhead of pSciMapper is negligibleand it is practical even for workflows with hundreds of tasks.

The rest of the paper is organized as follows. We motivate ourwork by analyzing the resource requirements of a scientific workflowapplication in Section II. We describe our power-aware consolidationin Section III. Results from our experimental evaluation are reportedin Section IV. We compare our work with related research effortsin Section V and conclude in Section VI.

II. MOTIVATING APPLICATIONS

The work presented in this paper is driven by characteristicsof scientific workflows (or DAG-based computations), and theopportunities for optimizations they offer. We specifically study oneapplication in details.

A. Great Lake Forecasting System (GLFS)

We now describe one such scientific application in details. Par-ticularly, we first show its DAG structure. Then, we analyze theresource usage of the GLFS workflow and discuss the characteristicsthat make consolidation a promising solution to save power andresource costs in any virtualized environment, including the clouds.

This application monitors meteorological conditions of the LakeErie for nowcasting (for the next hour) and forecasting (for the nextday). Every second, data comes into the system from the sensorsplanted along the coastal line, or from the satellites supervisingthis particular coastal district. Normally, the Lake Erie is dividedinto multiple coarse grids, each of which is assigned available

resources for model calculation and prediction. We illustrate theGLFS workflow in Figure 1. The number of inputs processed bythe workflow may increase over time as more regions of LakeErie need to be covered. This will also translate to an increase inthe number of computational tasks. The initial satellite sea surfacetemperature and other measured data, including the the wind speed,pressure and air humidity, are fed into the gMM5 task. Then,the updated meteorological data go through a series of interactionmodels, including gTurbulentStress, gNetHeatFlux and gWaterLevel,as shown in the figure. These steps prepare the data that is requiredfor predicting corresponding meteorological information using thegPOM2D or the gPOM3D tasks. Before starting the POM model,the gGridResolution task decides the resolution that will be used inthe model. Finally, the grid resolution, along with the output fromvarious interaction models, serve as inputs to the gPOM2D task(which applies a 2D ocean model) and/or to the gPOM3D (whichapplies a 3D ocean model for more accurate prediction). Such POMmodels will predict meteorological information, including the waterlevel, current velocity, and water temperature, for Lake Erie. ThegVis task is then used to project the outputs onto images.

The runtime of most individual tasks in the GLFS workflow canbe quite small (seconds to few minutes). Yet, GLFS is clearly acompute-intensive application. On a standard PC (Dual Opteron254 (2.4GHz) processors, with 8 GB of main memory and 500 GBlocal disk space), predicting meteorological information for the nexttwo days for an area of 1 square miles took nearly 2 hours. Theexecution time could grow rapidly with need for increasing spatialcoverage (for example, the total area of Lake Erie is 9, 940 squaremiles), and/or better temporal granularity. Better response time canbe achieved by using different servers for different workflow tasks.Obviously, this will also increase resource and power costs, and theright tradeoff between the execution time and resource and powercosts needs to be maintained. In the next subsection, we perform adetailed analysis on the resource utilization of the GLFS workflowand motivate our work on using server consolidation to reduce powerand resource costs.

B. Resource Usage of GLFS Workflow

We execute the GLFS application in a Linux cluster, whichconsists of 64 computing nodes. Each node has a dual Opteron 254(2.4GHz) with 8 GB of main memory and 500 GB local disk space,and the nodes are interconnected with a switched 1 Gb/s Ethernet.During the execution of the GLFS workflow, we used Sysstat tools[3] to collect the following performance data every 1 second: CPUutilization, memory utilization, disk I/O (including reads and writes),and network I/O (including sends and receives). Like other scientificworkflows, GLFS has the long-running, iterative nature. For a givencertain input data, the steps between transferring data into the firsttask, gMM5, to generating the output from gVis are referred to asone invocation of the workflow. For different invocations, behaviorof the application could be different, due to different data size,and/or dynamics exposed by the application. Figure 2 shows theCPU, memory, disk, and network utilization of two GLFS tasks, for5 invocations of the workflow.

As illustrated in Figure 2, each resource usage data can beexpressed as a time series. The resource usage of the first taskvaries across different invocations. In contrast, resources consumedby the second task remains roughly the same across invocations.We further analyze the characteristics of the time series we have

0 500 1000 1500 2000 2500 3000 3500 4000 4500 50000%

50%

100%

CP

U U

tiliz

atio

n

0 500 1000 1500 2000 2500 3000 3500 4000 4500 50000

1000

2000

3000

Mem

ory

Util

izat

ion

(MB

)

0 500 1000 1500 2000 2500 3000 3500 4000 4500 50000

2

4

6x 10

5

Dis

k I/O

(Byt

es)

0 500 1000 1500 2000 2500 3000 3500 4000 4500 50000

2

4

6x 10

7

Net

wor

k I/O

(Byt

es)

Execution Time (Sec)

CPU

Memory

Disk

Network

<500, 3, 600><1000, 6, 600> <1000, 6, 600><2000, 12, 1200><1000, 6, 600>

0 500 1000 1500 2000 2500 3000 3500 4000 4500 50000%

50%

100%

CP

U U

tiliz

atio

n

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000300

350

Mem

ory

Util

izat

ion

(MB

)

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000−1

0

1

Dis

k I/O

(Byt

es)

0 500 1000 1500 2000 2500 3000 3500 4000 4500 50001

2

3x 10

7

Net

wor

k I/O

(Byt

es)


CPU

Memory

Disk

Network

<500, 3, 600><1000, 6, 600> <1000, 6, 600><2000, 12, 1200><1000, 6, 600>

Fig. 2. The CPU, Memory, Disk and Network Usage of Two GLFS Components (< 1000, 6, 600 >, < 500, 3, 600 > and < 2000, 12, 200 > Represent DifferentApplication Parameters

obtained. Particularly, we have applied the auto-correlation function(ACF) [27] to the series and it shows that the resource utilization datafrom the scientific workflows is expressed as a periodic time series.Furthermore, a time series can be either stationary, with observationsfluctuating around a constant mean with constant variance, or non-stationary, where the underlying process has unfixed moments (i.e.mean and variance) that intend to increase or decrease over time. Weclaim that the resource utilization time series from one invocationat least meets the weak stationary requirements, which are that thefirst two moments (mean and variance) do not vary with time. Theresource usage pattern of GLFS, in this respect, is similar to theobservations made from other scientific workflow executions [13],[50].

Overall, we can make the following observations from the re-source usage patterns we have seen. 1) The scientific workflow canbe composed of a large number of tasks. The workflow tasks have aperiodic behavior with respect to CPU, memory, disk, and networkusage, with most of them incurring low resource utilization (e.g.Figure 2(b)). 2) The resource usage of a workflow task is signifi-cantly smaller than its peak value for more than 80% of the time(except that network activity, which remains constant over time).Therefore, provisioning resources based on peak requirements willlikely lead to resource and power wastage. Furthermore, consideringthe time series of resource utilization of different tasks, we can seethat their respective usage of the same resource does not peak atthe same time. 3) The resource consumed by a workflow task canbe dependent on the values of the application parameters for theparticular invocation (Figure 2(a)), as well as characteristics of thehost server. Comparing the 1st (and 2nd) to the 4th invocation, theresource usage pattern remains the same, with only a change in theintensity of resources consumption. However, even the pattern itselfcould change because of a different set of application parameters(i.e. consider the 3rd invocation). In the 5th invocation, the sameparameter values were used with a change in the resource availabilityfrom the host server.

III. DESIGN OF PSCIMAPPER

This section presents the framework, pSciMapper, that we havedeveloped to consolidate scientific workflows in virtualized environ-ments. We first give an overview of our approach. Next, we discusshow we extract key temporal information from the given resourceutilization profiles (time-series) for each task of the workflow, andfurther relate it with application performance and power consump-

Offline Analysis Online Consolidation

Scientific Workflows

Temporal Feature Extraction

Time VaryingResource

ProvisioningKernel Canonical

Correlation AnalysisKnowledge

Base

Optimization SearchAlgorithm

Time series

Temporalsignatures

ConsolidatedWorkloads

KCCA model

Hidden Markov Model (HMM)Hierarchical Clustering

Fig. 3. The pSciMapper Framework Design

tion. Then, we present our power-aware consolidation algorithm,which places tasks of a scientific workflow on servers so thatpower consumption is minimized, without a substantial increase inthe completion time. Finally, towards the end of this section, wedescribe how we learn a model capturing the resource usage profile’sdependence on the application parameters and the characteristics ofthe resources on which the workflow is executed.

A. Overview

In order to reduce the total power in a computing environment,effective power management strategies have been developed [22],[51], [45], [39]. Dynamic Voltage and Frequency Scaling (DVFS)is a widely applied technique to minimize the processor powerby scaling down CPU frequencies. However, multi-frequency pro-cessors are required to support DVFS. Based on our observationson the workflows that have been studied, we focus on a differentapproach in this paper, which is based on server consolidation.Unlike the previous work on consolidating workloads from webapplications [49], [44], we not only specifically focus on scientificworkflows (DAGs), but also consider a more complex consolidationproblem. We investigate server heterogeneity, with respect to CPU,memory, disk, and network bandwidths, as well as the application re-quirements for these resources. As the power consumed by the sameworkload executing on different servers could vary significantly andthe total power consumption of two co-located workloads is notequal to the sum of individually consumed power, our workloadconsolidation problem cannot be simply solved by minimizing theproduct of the number of servers and application execution time.Instead, we also study the interference caused by resource contention

when multiple workflow tasks are placed on a single server, and howit impacts the power consumption and application performance.

As a motivation for our solution using consolidation, we con-ducted experiments to investigate the relationship between resourceusage and power consumption, particularly taking virtualization intoaccount. Then, performance and power consumption with consoli-dation of different types of workloads was analyzed. We summarizeour observations as follows. Details of our experiments are availablefrom an associated technical report [53].

• All of the resource activities impact power consumption. Varia-tion in the CPU utilization has the largest impact on total powerconsumption, though memory footprint and cache activity alsoimpact the consumed power.

• Disk activities as well as network I/O add a constant overheadto the unit power consumption.

• Enabling virtualization (Xen) does incur very small additionaloverheads. Furthermore, when there is CPU contention betweenXen VMs, dynamic provisioning of CPU (i.e., using Non Work-Conserving mode with the cap set in Xen credit-scheduler [6])at runtime saves power comparing to using Work-Conservingmode.

• Consolidation of workloads with dissimilar requirements canreduce the total power consumption and resource requirementssignificantly, without a substantial degradation in performance.

Our system, pSciMapper, is based on the observations from theseexperiments. The overall design of pSciMapper is illustrated inFigure 3. It consists of two major components: online consolidationand offline analysis. The online consolidation algorithm assumes thatfor each task of the workflow to be scheduled, a set of time seriesdata representing the CPU, memory, disk, and network utilization isavailable. In practice, such specific information may not be availablefor the given hardware and the application parameters withoutexecuting the workflow. We train a hidden Markov model (HMM)to predict such time series, which we describe towards the end ofthis section(Section III-D).

Now, for consolidation, we extract key temporal features from thetime-series(Section III-B). This results in 52 features per workflowtask, not all of which may be equally important for predictingthe execution time and power consumption. We use a dimension-ality reduction technique, Kernel Canonical Correlation Analysis(KCCA) [7], to relate the key features to the execution time andpower consumption.

Next, the online consolidation algorithm uses hierarchical cluster-ing [25] and the Nelder-Mead optimization algorithm [35], to searchfor the optimal consolidation, where power requirements are mini-mized, without a substantial impact on performance (Section III-C).As an enhancement to our static consolidation approach, we alsoperform time-varying resource provisioning based on the dynamicresource requirements of the workflow tasks.

B. Temporal Feature Extraction and Kernel Canonical CorrelationAnalysis

The relationships between resource usage time series and a work-flow task’s execution time and its power consumption are complex.To facilitate our analysis, we use a specific temporal signature tocapture the key characteristics of a given time series.Temporal Signature: Given a resource usage time series, whichspecifies the resource requirements with respect to a specific re-source (CPU, memory, disk, or network) of a workflow task, weextract the following three sets of information:

1) Peak value: We define it as the max value of the time seriesdata and denote it as max.

2) Relative variance: We compute the sample variance of thedata, denoted as v2. However, the sample variance dependson the scale of the measurement of a particular resource.Therefore, we transform the v2 values to a normalized [0, 1]space, with 0 being the lowest variance and 1 being thehighest. This value serves as an indicator of fluctuation aroundthe average resource utilization of the workflow.

3) Pattern: Besides peak and variance, we need to capture howthe resource usage varies around an expected baseline. Wetake the repetitive pattern of the time series data and generatea sequence of samples. Such samples should be selected in away such that the information from the original time seriesis preserved as much as possible. As suggested in [27], theoriginal function in continuous time, x(t), is multiplied bythe δ-function 5 in order to generate a sequence of sampledvalues, {xn}, with the sample interval to be 1 second. In thiswork, we further group such samples into 10 groups so thatsimilar values are within the same group. The average samplevalue is used to represent each group. Therefore, the patternis represented by 10 numbers, p1, p2, ..., p10. This approachcan be used with any number of groups, and the number 10was chosen through our experiments. It preserved the originaltime series, and did not cause complexity in our dimensionalityreduction method (KCCA).

Now the temporal signature for a resource utilization time seriesdata (Ti) can be represented as

Ti =< maxi, v2

i , p1,i, p2,i, ..., p10,i > (1)

As the last part of the offline analysis in pSciMapper, we need torelate the usage of resources with the workflow task’s performanceand power consumption when it is mapped onto a server. We traina model for this purpose. The input to such model is the temporalsignatures we extracted from the CPU, memory, disk, and networkutilization time series, plus resource variables representing the hostserver capacity. The model outputs time and consumed power for alltasks of the scientific workflow on this server due to dependenciesbetween individual tasks. Thus, we have 52 features in the inputspace and the number of features in the output space is the number oftasks in the workflow. In our work, we have used Kernel CanonicalCorrelation Analysis (KCCA) [7] for modeling the resource-timeand resource-power relationships.

KCCA

Resource UsageProjection

MetricProjection

Maximumcorrelation

Resource UsageTime Series

TemporalSignatures

<0.12, 0.3, 0.6, 0.3,0.3, 0.12, …, 0.6><0.0, 0.8, 0.8, 0.8,0.8, 0.8, …, 0.8>

...

Power &Performance

<34.5, 12.65><56.8, 33.85>

...

Fig. 4. KCCA Training: From Vectors of Resource Usage Features to Vectors ofMetric Features

KCCA: Canonical Correlation Analysis (CCA) is a dimension-ality reduction technique that is useful to determine the correlationbetween two sets of multi-variate datasets. KCCA is a generalizationof CCA using kernel functions to model nonlinear relationships [7].

5http://en.wikipedia.org/wiki/Kronecker delta

Algorithm III.1: WORKFLOWCONSOLIDATION(C,S)

INPUT C: the set of componentsS: the set of servers

OUTPUT Θ: the consolidation plan// initial assignment of componentsInitialCluster(C, S);// generate resource usage profilefor each Ci and Sj

GenerateTS(Ci, Sj);Θ = OptimalSearch(C,S);while (true)

for each clusteri and clusterj

//calculate distance between two clusterslinkagei,j = CalculateDist(clusteri, clusterj);if linkagei,j < mergethreshold

// consolidate workloads in two clustersMergeCluster(clusteri, clusterj)

Θ = OptimalSearch(C,S);if ( stopping criteria is met)

Θ = Θ;break;

T imeV aryingResourceProvisioning(Θ);

Fig. 5. Workflow Consolidation Algorithm

Given N temporal signatures from resource usage profiles, we forman N×N matrix Kx, where the (i, j)th entry is the kernel evaluationkx(xi, xj). Specifically, we have used the Gaussian kernel [37].Similarly, we also form an N × N matrix Ky, where the (i, j)thelement is the kernel evaluation ky(yi, yj) with respect to applicationperformance and power consumption. The KCCA algorithm takesthe matrices Kx and Ky, and solves the following generalizedeigenvector problem:

[

0 KxKy

KyKx 0

] [

A

B

]

= λ

[

KxKx 00 KyKy

] [

A

B

]

(2)

This procedure finds subspaces in the linear space spanned bythe eigenfunctions of the kernel functions such that projectionsonto these subspaces are maximally correlated [7]. We refer tosuch projections as resource usage projection and metric projection,respectively. We illustrate such projections in Figure 4. If the linearspace associated with the Gaussian kernel can be understood asclusters in the original feature space, then KCCA finds correlatedpairs of clusters in the resource usage vector space and the perfor-mance/power vector space. The details on how to train the KCCAmodel will be presented in Section IV-C.

C. Power-Aware Consolidation

We now present the details of the online consolidation algorithmin our pSciMapper framework. The main algorithm is summarizedin Figure 5. For generality, we assume that available servers may beheterogeneous. We first randomly assign each of the workflow tasksonto an underlying server. We assume for now that CPU, memory,disk, and network usage time-series for executing the task on thisparticular server are available. In practice, this is obtained using theHMM, which we will describe later.

Our consolidation problem can be viewed as a clustering problem.Normally, clustering groups similar objects together. Here, our goalis to group workflow tasks that are dissimilar in their resourcerequirements. This can be handled by defining a distance measure

that focuses on the interference between the resource requirementsof the workflow tasks.

Particularly, we apply the agglomerative hierarchical clusteringalgorithm [25] using the extracted temporal signature vectors. Theconsolidation procedure is iterative and it starts from the bottom ofthe hierarchical cluster structure, i.e., when each cluster is formedby a single workflow task. We decide whether two tasks, i and j,can be consolidated by calculating the distance between them. Tobe specific, we define the distance metric as the following,

Disti,j =∑

R1

i,R2

j

(aff score(R1

i , R2

j ) × Corr(peakR1

i , peakR1

j )

×Corr(peakR2

i , peakR2

j )) (3)

where R1

i , R2

j denotes any type of resources (thus 10 pairs intotal considering all CPU, memory, disk and network) from tasksi and j. and Corr(peakR1

i , peakR1

j ) is the Pearson’s correlationbetween two workloads with regards to the resource usage of R1.We argue that it is more important to consider the correlation of thepeak values rather than the whole temporal signature, as co-locatedtasks are more likely to incur resource contention at their peaks.The aff score is pre-specified based on our correlation analysis(details can be found in [53]). It varies between 0 and 1, with 0denoting the least interference when consuming two resources R1

i

and R2

j together. The intuition behind the distance definition is toconsolidate dissimilar tasks together. However, for tasks with similarresource usage, they could be co-located if resource contention canbe avoided. Note that the DAG structure has to be considered here.Although combining two workloads with active network activitiescould incur significant computing overhead, communication cost canbe reduced if there is a direct dependency between them. Thusaff score is small for two network-intensive workloads with aparent-child relationship. We merge two clusters if the distancebetween them is smaller than a threshold, i.e . merge threshold inthe pseudo-code.

Next, a well-known optimization algorithm, the Nelder-Mead [35], is applied to identify how the clusters obtained can bemapped to the given set of servers. The goal here is to minimizethe power, without incurring a significant performance degradation.The consumed power and execution time for the one-to-one mappingof clusters to the servers can be estimated using the KCCA modeltrained offline. To be specific, the input vector, which includes thetemporal signature of the resource usage profile along with theserver capacity, is projected into the resource subspace. We theninfer the coordinates in the metric subspace using the k nearestneighbors, where k = 3 in our implementation. Finally, we map themetric projection back to the metrics, which are the consumed powerand the execution time. A weighted sum of the metric projectionsfrom the nearest k neighbors has been used, with the weight to bethe reverse distance between coordinates of two projections in thesubspace. The optimal point of this iteration with the minimum totalpower consumption is recorded.

Then, temporal signature of the new cluster is updated from theconsolidated workloads. Such consolidation iterations stop when theclusters cannot be merged anymore since merging will incur signifi-cant interference, and/or the degradation in application performancewill be intolerable.

As an enhancement to the static consolidation algorithm, wefurther take into account the dynamic resource requirements in theconsolidated workloads to save power. Recall that the workloads

on a single server consume less power when CPU is dynamicallyallocated using the Non Work-Conserving mode, comparing to theWork-Conserving mode which grants CPU to workloads wheneverthere are idle cycles. To support such dynamics, we propose aheuristic method for performing the time varying resource provi-sioning for each of the consolidated servers. Specifically, we makea reservation plan that simply allocates virtual CPU and memory(currently, Xen only supports runtime changes to CPU and memory)to individual workloads. Clearly, the total requirements from thesetwo resources should not exceed what is available on the physicalserver. However, if we are unable to meet the peak requirements,resources are allocated to workloads in proportion to their peakvalues. Note that such resource reallocation only takes action at thetime instances where there are significant changes to the resourcerequirements, since dynamically changing the resource allocationsfrequently can involve significant overheads too. In order to decidethe point to trigger resource reallocation, for each of the co-locatedworkloads i, we calculate the mean (mi) and variance (v2

i ) of theCPU usage. If the current CPU utilization is larger than mi +3×v2

i ,we conclude that a significant change has occurred for this workload.Thus a reallocation of CPU is needed. This process is repeated formemory as well.

Level 1

Level 2

Level 3

Level 4

C1 C2 C3 C4 C5

CPU: moderateMem: lowDisk: lowNet: low

CPU: moderateMem: lowDisk: lowNet: moderate

CPU: moderateMem: highDisk: highNet: low

CPU: highMem: moderateDisk: lowNet: low

CPU: lowMem: lowDisk: highNet: moderate

Assignment <power, time>

{C1, S2}, {C2, S3}, {C3, S5}, {C4, S1},{C5, S4} <180.56, 83.93>

{(C1, C2), S2}, {C3, S5}, {(C4, C5), S1}

{(C1, C2, C3), S2}, {(C4, C5), S1}

<135.11, 88.03>

<93.62,92.87>

C1 C2 C3 C4 C5

Fig. 6. Consolidating Workflows: An Example

Example: We use an example to explain our consolidation algo-rithm, as illustrated in Figure 6. Assume that the workflow has 5tasks, C1, C2, ..., C5 and originally it is on 5 servers, S1, S2, ...

, S5. The resource usage profile for each task is known in advance.At the beginning of the consolidation algorithm, each Ci at level1 is randomly assigned to a server Sj . Then the optimizationalgorithm generates an assignment where the power consumptionis the minimum at the current level. Then we move up to level 2,where C1 and C2, and, C4 and C5 are merged as new clusters.This is consistent with what we expect by observing the resourceutilization characteristics. The optimal search will find the serversfor the consolidated workloads. Now the power consumption dropssignificantly as we now are using only 3 servers. This procedurerepeats in level 3, where C3 is merged with the cluster whichcontains C1 and C2. Note that due to the high memory footprintof C3 and intensive I/O activities, serious interference would occurif it is co-located with C4 or C5. Now, 5 tasks of the workfloware consolidated on 2 servers and we are able to save half of thepower with only a 9% performance degradation. Since no tasks canbe consolidated anymore, our algorithm will generate C1, C2, andC3 on S2, C4 and C5 on S1 as the output.

D. Hidden Markov Model for Resource Usage Estimate

The approach presented so far assumes that when a workflowis submitted, along with a set of application parameters, we knowthe time series of CPU, memory, disk I/O, and network I/O usage

for each workflow task on any of the servers in our environment. Inpractice, however, it is impossible to know this information. We nowdescribe how we train a hidden markov model (HMM) based on arepresentative set of application parameters and hardware specifica-tions. Using this HMM, resource usage time series can be generatedfor given application parameters and hardware specification.

A hidden Markov model(HMM) is a statistical model that assumesthat a sequence of observations O = O1,O2, ...OT is generatedfrom a set of N hidden states (S = S1, S2, ..., SN ), with statetransition probabilities (aij) between them, and emission probabil-ities bj(Ot), denoting the probability of an observation Ot beinggenerated from the state j. We use a first order HMM in which thecurrent state is only dependent on the previous state.

The HMM for resource utilization prediction is generated in thefollowing way. We present the details of HMM implementation inour experimental evaluation (section IV-E).

1) Hidden states: They represent equivalence classes in re-source usage. To be specific, a hidden state at time stept is denoted as qt, which is of the format of qt =<

cput, memt, diskt, nett >, denoting the CPU, memory uti-lization, disk I/O, and network I/O. One issue we had toaddress is that, in practice, each resource usage variable is acontinuous number, which leads an infinite number of hiddenstates. We discretize the variables and thus make the numberof states finite.

2) Observations: At time step t, the observation Ot is a setthat comprises of the values of the application parametersand the current resource availability of the hosting node (i.e.,CPU, memory, disk and network bandwidth). We denote theobservation at time t as Ot =< parat, resourcet >.

3) State transition probability: It is denoted as A = {aij}, where

aij = P (qt = Sj |qt−1 = Si), 1 ≤ i, j ≤ N (4)

The transition matrix characterizes the distribution over thestates for CPU, memory, disk, and network usage. During thetraining of HMM, aij is calculated as the ratio between thenumber of transitions from state Si to Sj and the total numberof transitions originated from Si.

4) Emission probability given an hidden state i: It is denoted asB = {bj(Ot)}, where

bj(Ot) = P (Ot|qt = Sj), 1 ≤ j ≤ N (5)

As each observation contains multiple variables, we use amultivariate Gaussian distribution for bj(Ot), for each of thehidden states.

5) Initial state distribution π: We assume that the applicationstarts with empty resource usage, i.e., q1 =< 0, 0, 0, 0 > withprobability one.

Given the above HMM, λ = (A,B, π), the task of predictingthe CPU, memory, disk, and network usage is essentially to solvethe following problem: given the observed application parametervalues and server resource availability, O = O1,O2, ...OT , and themodel λ, find the most likely resource usage state sequence Qmax =q1, q2, ..., qT . That is,

Qmax = argmaxQp(Q|O, λ)

= argmaxQp(Q,O|λ)

= argmaxQp(O|Q, λ)p(Q|λ) (6)

where T is the total number of time steps and Q is a sequenceof resource usage states. p(O|Q, λ) is defined by B and p(Q|λ) isdefined by A.

In order to solve this problem, we first need to train the HMM.That is, given an observation sequence O and a set of states Q,the parameters of the HMM (i.e., λ) are learned using the Forward-Backward algorithm [41]. In our problem, the HMM parameters aresimplified to parameters with the multivariate Gaussian distributionsthat are used for the emission probabilities. The algorithm estimatessuch distribution parameters by iteratively computing the likelihoodof the observation sequence given the current model parameters,and then updating them so as to maximize the likelihood ofthe observation sequence. Then, given the trained HMM and thesequence of observations (i.e., application parameters and resourceavailability on the server), the best hidden state sequence can beobtained by using the Viterbi algorithm [41]. Therefore, the CPU,memory, disk, and network usage of the application can be predictedwithout executing it on the server.

IV. EXPERIMENTAL EVALUATION

This section presents results from a number of experiments weconducted to evaluate our power-aware consolidation algorithm.

A. Algorithms Compared, Metrics and Goals

As a baseline for evaluating our proposed algorithm, we imple-mented an optimal approach. In this approach, the best consolidationplan is obtained with an exhaustive search, i.e., we try every combi-nation of placing workloads together and executing them on differentservers. The consolidation plan resulting in a minimum powerand less than 15% performance degradation is taken as the result.Furthermore, once the workloads are consolidated, we use work-conserving scheme during the application execution so that resourcerequirements of a workflow task are always satisfied, as long astotal requirements of consolidated tasks does not exceed availableresources. This consolidation plan thus is obtained and the resultingexecution is referred to as Optimal + Work Conserving.

We also evaluated two versions of our consolidation algorithm.In the first version, only the static consolidation was performed.Then, we statically assigned resources among consolidated work-loads proportional to the mean of their resource requirements. Thisis denoted as pSciMapper + Static Allocation. In thesecond version, we apply the time varying resource provisioning partof our algorithm. We refer this to as pSciMapper + DynamicProvisioning. Finally, the three versions thus obtained throughconsolidation are compared with the case where each workflow taskon an individual server, without virtualization. This is denoted asWithout Consolidation.

To evaluate the performance of our approach against the optimaland the case where consolidation is not used, we use the followingtwo metrics:

• Normalized Total Power Consumption: This shows the powerthat has been saved from our power-aware consolidation algo-rithm, as a percentage of the total power that is consumed byWithout Consolidation version.

• Execution Time: It is defined as the makespan of the workflows.Using the optimal approach and the above two metrics, we

designed the experiments with the following goals: 1) Demonstratethat our power-aware consolidation algorithm can reduce total power

Application CPU Memory Disk NetworkGLFS High Moderate Moderate Low

VR Moderate High Moderate ModerateSynApp1 Low Low High HighSynApp2 Moderate High Moderate LowSynApp3 High Moderate Low Low

TABLE IRESOURCE USAGE OF SCIENTIFIC WORKFLOWS

significantly without incurring substantial slowdowns. 2) Demon-strate that the overhead of our algorithm is negligible, and thealgorithm is scalable to workflows with a large number of tasks. 3)Demonstrate that our proposed resource prediction model is effectivein estimating resource usage. Also, models trained on one type ofhardware can still be effective on another type of hardware.

B. Experimental Setup and Applications

Our experiments are conducted using 2 Linux clusters, each ofwhich consists of 64 computing nodes. One cluster has dual Opteron250 (2.4GHz) processors with 8 MB L2 cache and 8 GB mainmemory, while the other has Intel Xeon CPU E5345 (2.33 GHz)nodes, comprising two quad-core CPUs, with 8 MB L2 cache and6 GB main memory. Computing nodes are interconnected withswitched 1 Gb/s Ethernet within each cluster. We chose Xen as thevirtualization technology and we used Xen-enabled 3.0 SMP Linuxkernel in a stock Fedora 5 distribution. On a single consolidatedserver, hardware resources are shared between the virtual machinesthat host application service tasks and the management domain(dom0 in Xen terminology). Throughout our experiments, we re-strict the management domain to use one physical core, thus isolatingit and avoiding performance interference. The virtual machineshosting application services share the remaining physical cores. Theplacement of VMs are decided at the start of application execution.In order to facilitate hosting the scientific workflows, the VMs arecreated and customized for each application.

The experiments we report were conducted using two real applica-tions and three synthetic workflows. Two real applications are GLFSand Volume Rendering. The first of these applications wasdescribed earlier in Section II. Volume Rendering interactivelycreates a 2D projection of a large time-varying 3D data set (volumedata) [19]. The workflow we implemented reads 7.5 GB of imagedata, stores the spatial and temporal information in a tree structure,and then renders the final image using composed unit images. VRis considered as a memory-intensive application as it consumesmore than 1GB of physical memory for 76% of its executiontime. In order to cover a wide range of scientific workflows withdifferent resource requirements and scales, we also evaluated ourconsolidation algorithm using three synthetic scientific workflows.These synthetic scientific workflows are from a set of benchmarkscientific applications developed by Bharathi et al. [9]. GLFSand VR are executed on our clusters (with Xen virtual machines),whereas simulations were used for the synthetic applications. Weused GridSim [43] as the grid environment simulator and its exten-sion, CloudSim [14], to simulate a virtualized environment. Table Ishows the relative resource usage of the two real workflows andthree synthetic workflows, with respect to CPU, memory, disk I/O,and network I/O.

Since we did not have physical access to these clusters to beable to attach the power meter, we used power modeling for

estimating power consumption. Particularly, we apply a full-systempower model based on high-level system utilization metrics [20]. Wevalidated this power model by actual measurements on the PC in ourresearch lab, and found it to be very accurate. Furthermore, as wereport only normalized power comparison for all of our experiments,limitations of the power model have a negligible impact on ourresults.

C. Performance Comparison

We now evaluate our proposed power-aware consolidation algo-rithm against the optimal approach and the case without consol-idation, with respect to two metrics, i.e., normalized total powerconsumption and execution time.

As previously stated, our KCCA model is trained offline using4000 data tuples. Each tuple includes an input vector X with 52features and an output vector Y with 16 features for GLFS or with 24features for VR. To determine the appropriate number of dimensionsonto which we should project our data, we consider the eigenvaluesassociated with each canonical direction. In the first eight canonicaldirections, we observe a significant drop in the associated eigenval-ues, leading us to believe that a projection onto these directions issufficient to well-represent the data. After identifying the directionsof maximum correlation, the kernelized representations of the data’sX and Y feature representations, denoted by KX and KY , areprojected onto these directions. Therefore, the correlation betweenKX and KY can be easily learned.Normalized Total Power Consumption Comparison: We nowevaluate our power-aware consolidation algorithm and demonstratethat the total power has been reduced significantly from the caseno consolidation is applied. Moreover, we also show that powerreductions are close to, or in some cases even better than, theoptimal (but static) approach, as time varying resource provisioningis enabled through our approach.

First, we executed the GLFS workflow with four different combi-nations of application parameters, which are referred to as AppConf-1, AppConf-2, AppConf-3, and AppConf-4. For each of the parameterconfigurations, the workflow is invoked 5 times and the average isreported as the result. The normalized total power consumption,which is the ratio of consumed total power over that from theWithout Consolidation version, is shown in Figure 7 (a).We can make the following observations from these results. First,the total power is reduced up to 27% by the Optimal + WorkConserving version. The CPU-bound characteristics of GLFSlimit the consolidation that can be performed without significantperformance degradation. In comparison, our pSciMapper +Static Allocation only performs 4% worse. Although thenumber of servers on which the GLFS workflow has been consol-idated is the same as what is obtained from the optimal approach,the tasks that are merged into a cluster might not always be theoptimal, due to the our distance metric definition.

Next, we consider pSciMapper + DynamicProvisioning, where time varying resource provisioningheuristic is enabled. We are able to save up to 35% of thetotal consumed power, which is 8% better than the optimal(static) approach. This is consistent with our power analysis, i.e.,dynamically allocating CPU and memory to the consolidated tasksat runtime using Non Work Conserving mode in Xen can helpreduce the power, compared to work conserving, particularly whenthe variance in resource requirement is large.

We repeated this experiment with the VR workflow and thethree synthetic scientific workflows. The results are presented inFigures 7 (b) and (c), respectively. We executed VR with fourdifferent parameter configurations. As we can see, we are ableto reduce 58% of the total power consumption in all cases, byapplying the optimal consolidation approach. pSciMapper achieves52% savings, and when time varying resource provisioning isenabled, pSciMapper performs only 2% worse than the optimalapproach. Similar observations can be made from the three syntheticworkflows. In the optimal case, more than 70% power is savedfrom SynApp1 due to its large number of tasks with low CPUand memory usage. As SynApp1’s tasks have relatively constantresource requirements, dynamic resource provisioning does not helpimprove performance. pSciMapper, however, is still able to save66% of the power. The other two synthetic workflows performbetter with pSciMapper + Dynamic Provisioning whencomparing to the optimal approach, and we are able to reduce thetotal power consumption by 59% and 44%, respectively.Execution Time Comparison: Next, we compare the performanceof pSciMapper with the other versions with respect of the workflowexecution time. We first conducted the experiments using the GLFSworkflow. Recall that we set performance degradation constraint tobe 15%, i.e., consolidation stops when the estimated performanceis still within 15% of the Without Consolidation case. Asillustrated in Figure 8 (a), the execution time from the Optimal +Work Conserving case is within 15% of the case where eachtask of the workflow is mapped onto a single server. pSciMapper+ Static Allocation also leads to performance that is closeto the optimal case. This demonstrates the effectiveness of thetemporal feature extraction and KCCA model in mapping resourceusage to power and execution time. With time varying resourceprovisioning, pSciMapper leads to 3% better performance than theoptimal case, and thus, the performance is within 12% of theWithout Consolidation case. The experiment was repeatedwith the VR workflow and the other three synthetic workflows. Theresults are demonstrated in Figure 8 (b) and (c). Similar observationscan be made here. With time-varying resource allocation, pSciMap-per results in only a 7% degradation.

D. Consolidation Overhead and Scalability

We now evaluate the execution time of our consolidation algo-rithm and compare it to the optimal approach. We used GLFS andthe VR workflows, which have 16 and 24 tasks, respectively. Theresults are shown in Figure 9 (a). As shown in the Figure, ouralgorithm is able to consolidate 16 and 24 tasks onto 2 emulated gridsites, each with 64 nodes, with an algorithm execution time of 1.5and 2.2 seconds, respectively. The overhead comes from the Nelder-Mead optimization algorithm at each level in the hierarchical clusterstructure. In comparison, optimal algorithm involves an exhaustivesearch, and completes the consolidation in 10.4 and 26.5 seconds.The overhead of our algorithm is clearly very small, compared to theactual execution time of any real scientific workflow. It may appearfrom these results that even the optimal algorithm is an acceptablechoice for scheduling real workflows. However, the execution timeof the optimal search will grow exponentially with respect to thenumber of tasks in the workflow and/or the number of servers.Therefore, its overhead will be unacceptable for consolidating large-scale scientific workflows, where hundreds or even thousands oftasks can be involved.

AppConf−1 AppConf−2 AppConf−3 AppConf−40%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Nor

mal

ized

Tot

al P

ower

Con

sum

ptio

n

Optimal + Work ConservingpSciMapper + Static AllocationpSciMapper + Dynamic Provisioning

(a)

AppConf−1 AppConf−2 AppConf−3 AppConf−40%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Nor

mal

ized

Tot

al P

ower

Con

sum

ptio

n

Optimal + Work Conserving

pSciMapper + Static Allocation

pSciMapper + Dynamic Provisioning

(b)

SynApp1 SynApp2 SynApp30%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Nor

mal

ized

Tot

al P

ower

Con

sum

ptio

n




(c)Fig. 7. Normalized Total Power Consumption Comparison: (a) GLFS (b) VR (c) Synthetic Workflows (Power Consumption of Without Consolidation is 100%)

AppConf−1 AppConf−2 AppConf−3 AppConf−40

50

100

150

200

250

Exe

cutio

n Ti

me

(Min

)

Without ConsolicationOptimal + Work ConservingpSciMapper + Static AllocationpSciMapper + Dynamic Provisioning

(a)

AppConf−1 AppConf−2 AppConf−3 AppConf−40

5

10

15

20

25

30

35

40E

xecu

tion

Tim

e (M

in)

Without Consolication




(b)

SynApp1 SynApp2 SynApp30

20

40

60

80

100

120

140

Exe

cutio

n Ti

me(

Min

)

Without ConsolidationOptimal + Work ConservingpSciMapper + Static AllocationpSciMapper + Dynamic Provisioning

(c)Fig. 8. Execution Time Comparison: (a) GLFS (b) VR (c) Synthetic Workflows

GLFS VR0

5

10

15

20

25

30

Con

solid

atio

n O

verh

ead

(Sec

)

pSciMapper

Optimal

(a)

100 200 400 800 10000

10

20

30

40

50

60

Numbef of Workflow Components

Con

solid

atio

n O

verh

ead

(Sec

)

(b)Fig. 9. (a) Consolidation Overhead Comparison of pSciMapper with Optimal(b)Scalability

Furthermore, we evaluate the scalability of pSciMapper in Fig-ure 9 (b). We used synthetic workflows and simulated 640 processingnodes for a grid computing environment. The number of tasks fromthe synthetic workflow is varied from 100 to 200, 400, 800, and1000. We can see that the consolidation overhead only increaseslinearly as the number of workflow tasks increases, and it takes lessthan 1 minute to consolidate 1000 workflow tasks on 640 nodes.

E. Model Evaluation

In this subsection, we evaluate the HMM-based model, and showthat our model can accurately estimate the CPU, memory, disk andnetwork usage, given the application parameters and the resourcecapacity of the host server. Furthermore, we show that the model canbe trained on one server, and then used on a server with a differenttype of hardware effectively. This is an important requirement incloud computing, as a service provider may use different types of

hardware, with a user not having any control on what is madeavailable to them.

We first consider a task from the GLFS workflow. In order totrain the HMM, we collected 80000 data samples. Each sampleincludes two sets of vectors, i.e., one representing the current valuesof the application parameters and the resource availability on theserver (observation sequence), while the other is the correspondingCPU, memory, disk and network utilization data of this task (statesequence). We ran the task multiple times and data was collectedeach second. 10 fold cross-validation was performed on the trainingdata set and it takes about 40 minutes to complete. To evaluate themodel, we used a new set of vectors w.r.t. application parametervalues from this task and the server resource capacity that havenot been seen during the training. Note that our model is trainedon a server with Intel Xeon quad-core CPU - 2.33 GHz, andthen evaluated on a different server with AMD Opteron model252 - 2.6 GHz. Input to the trained HMM is a sequence of 400observations, the results of CPU and memory predictions for thistask are presented in Figure 10 (a) and (b). As we can see, theaverage prediction error is within 4.6% in both cases comparing tothe actual usage. We repeated this experiment with the VR workflowand very similar observations can be made. Accurate resourceusage prediction serves as an important step for our consolidationalgorithm.

V. RELATED WORK

We now discuss the research efforts relevant to our work fromthe areas of scientific workflow scheduling and power managementtechniques.

0 50 100 150 200 250 300 350 4000%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%


CP

U U

tiliz

atio

n

Measured CPU UsageEstimated CPU Usage

(a)

0 50 100 150 200 250 300 350 400500

1000

1500


Mem

ory

Util

izat

ion

(MB

)

Measured Memory UsageEstimated Memory Usage

(b)

Fig. 10. Resource Usage Measurement and Prediction Comparison: (a) CPU (b)Memory

Scientific Workflow Scheduling: Scientific workflows managementand execution, particularly for heterogeneous and distributed envi-ronments like the computational grids, has been a popular topic ofresearch in recent years. Workflow systems such as Pegasus [18],Kelper [31], ASKALON [52], Taverna [36], and Triana [47], arecurrently used in a number of scientific domains. As part ofthis research, several workflow scheduling algorithms have beendeveloped, most of which focus on makespan minimization. Yanet al. proposed a workflow scheduler using data/network-awareexecution planning and advanced reservation. VGrADS (previouslyGrADS) [32], [21], [42] first applies a ranking mechanism to assigneach workflow job with a rank value, and then schedules these jobsbased on their rank priorities.

None of these efforts have, however, considered consolidation ofworkflow tasks with virtualization technologies. In the future, ourwork can be combined together with workflow scheduling algorithmto achieve integrated scheduling and consolidation.

As scientific workflows are represented as a DAG structure, DAGscheduling is also related to our work. Comparison and evaluationof different DAG scheduling algorithms can be found in survey pa-pers [15], [29]. Topcuouglu et al. proposed a Heterogeneous Earliest-Finish-Time (HEFT) algorithm, which minimizes the earliest finishtime of priority-based tasks with an insertion-based approach [24].As an improvement to the HEFT heuristic, Bittencourt et al. pro-posed to look ahead in the schedule and estimate a task by taking intoaccount the impact on its children [11]. Again, our work is distinctin consolidating tasks in virtualized environments, and reducing totalpower requirements.

With the emergence of clouds, efforts have been initiated onexecuting scientific workflows on cloud environments, and evalu-ating the tradeoff between application performance and resourcecosts [17], [26], [23]. Our work focuses on virtualized environments,where either the users can control resource allocation for each VM,or a management layer knows the periodic resource requirementsof a workflow. We expect that commercial cloud environments will

have these characteristics in the near future.Power Management: There have been two common mechanismsfor power management, i.e., Dynamic Speed Scaling (DSS) and Dy-namic Resource Sleeping (DRS). Dynamic Voltage and FrequencyScaling (DVFS), a typical example of DSS, is an effective techniqueto reduce processor power dissipation by either reducing the supplyvoltage or by slowing down the clock frequency [22], [51], [40],[38]. Wang et al. presented a cluster-level power controller based onmulti-input-multi-output system model and predictive control theory,with the goal of shifting power among servers based on the perfor-mance needs of the applications [51]. DRS dynamically hibernatestasks to save energy and then wakes them up on demand [45], [39],[16].

Recently, power management for virtualized environments havereceived much attention [34], [48], [28], [46], [33]. Laszewski etal. propose a power-aware scheduling algorithm to schedule virtualmachines in a DFVS-enabled cluster [30]. More closely related toour work are the efforts on power minimization using consolida-tion [49], [44]. Verma et al. propose two consolidation algorithmsbased on the detailed analysis of the workloads executed in datacenters [49]. Particularly, they observed that 90-percentile of theCPU usage of the workloads should be used for consolidation, as itis less than half of the peak value. Furthermore, they also consideredthe correlation between workloads so that SLA violation could beavoided when such workloads are placed together. Our focus hasbeen on scientific workflows, which have more complex structuresand resource usage patterns. Besides CPU, we have taken memory,disk, and network usage into account. Furthermore, performanceinterference resulting from co-locating multiple workloads on thesame server has been analyzed so that the consolidated workflowscan achieve high performance with optimal power. Srikantaiah etal. proposed a multi-dimensional bin packing algorithm for energy-aware consolidation [44]. Our method, based on KCCA, is a morepowerful model. Overall, we perform global optimization and canalso consider time-varying provisioning of resources for meeting therequirements of consolidated workflow tasks.

VI. CONCLUSION

Our work has been driven by two popular trends. First, scientificworkflows have emerged as a key class of applications for enablingand accelerating scientific discoveries. Second, in executing anyclass of (high-end) applications, power consumption is becomingan increasingly important consideration.

In this paper, we consider the power management problem ofexecuting scientific workflows in virtualized environments. We havedeveloped a pSciMapper, a power-aware consolidation algorithmbased on hierarchical clustering and a optimization search method.We have evaluated pSciMapper using 2 real and 3 syntheticscientific workflows. The experimental results demonstrate that weare able to reduce the total power consumption by up to 56% withat most a 15% slowdown for the workflows. Our consolidationalgorithm also incurs low overheads, and is thus suitable forlarge-scale workflows.

VII. ACKNOWLEDGEMENTS

This work was supported by NSF awards OCI-0619041, CCF-0833101 and IIS-0916196 to the Ohio State University.

REFERENCES

[1] Montage: An astronomical image engine. http://montage.ipac.caltech.edu/.[2] Southern california earthquake center, community modeling environment (cme).

http://www.scec.org/cme.[3] Sysstat. http://pagesperso-orange.fr/sebastien.godard/.[4] Teragrid. http://www.teragrid.org/.[5] Usc epigenome center. http://epigenome.usc.edu.[6] Xen credit-bsed cpu scheduler. http://wiki.xensource.com/xenwiki/

CreditScheduler.[7] F. R. Bach and M. I. Jordan. Kernel independent component analysis. Journal

of Machine Learning Research, 3, 2003.[8] L.A. Barroso and U. Holzle. The case for engergy-proportional computing.

IEEE Computer, 14(12), 2007.[9] S. Bharathi, A. Chervenak, E. Deelman, G. Mehta, M.H. Su, and K. Vahi.

Characterization of scientific workflows. In Proceedings of the 3rd Workshopon Workflows in Support of Large-Scale Science (WORKS08), pages 1–10, Nov.2008.

[10] R. Bianchini and R. Rajamony. Power and energy management for serversystems. IEEE Computer, 37(11), 2004.

[11] L.F. Bittencourt, R. Sakellariou, and E. Madeira. Dag scheduling using alookahead variant of the heterogeneous earliest finish time algorithm. InProceedings of the 18th Euromicro International Conference on Parallel,Distributed and Network-Based Computing (PDP 2010), Feb. 2010.

[12] D.A. Brown, P.R. Brady, A. Dietz, J. Cao, B. Johnson, and J. McNabb. A casestudy on the use of workflow technologies for scientific analysis: Gravitationalwave data analysis. Workflows for eScience, 2006.

[13] E.S. Buneci and D.A. Reed. Analysis of application heartbeats: Learningstructural and temporal features in time series data for identification of per-formance problems. In Proceedings of the 22nd International Conference onHigh Performance Computing and Networking (SC08), pages 1–12, Nov. 2008.

[14] R. Buyya, R. Ranjan, and R.N. Calheiros. Modeling and simulation ofscalable cloud computing environments and the cloudsim toolkit: Challengesand opportunities. In Proceedings of the 7th International Conference on HighPerformance Computing & Simulation (HPCS09), June 2009.

[15] L.C. Canon, E. Jeannot, R. Sakellariou, and W. Zheng. Comparative evaluationof the robustness of dag scheduling heuristics. In CoreGRID Technical ReportTR-0120, 2007.

[16] J. Chase, D. Anderson, P. Thakar, A. Vahdat, and R. Doyle. Managing energyand server resources in hosting centers. In Proceedings of the 18th ACMSymposium on Operating Systems Principles (SOSP01), pages 103–116, Oct.2001.

[17] E. Deelman, G. Singh, M. Livny, B. Berriman, and J. Good. The cost ofdoing science on the cloud: the montage example. In Proceedings of the 22ndInternational Conference on High Performance Computing and Networking(SC08), pages 50–61, Nov. 2008.

[18] E. Deelman, G. Singh, M.H. Su, J. Blythe, Y. Gil, C. Kesselman, G. Mehta,K. Vahi, G.B. Berriman, J. Good, A. Laity, J.C. Jacob, and D.S. Katz. Pega-sus: A framework for mapping complex scientific workflows onto distributedsystems. Scientific Programming, 13(3), 2005.

[19] R.A Drebin, L.Carpenter, and P.Hanrahan. Volume rendering. In Proceedings ofthe 15th Annual Conference on Computer Graphics and Interactive Techniques,pages 65–74, 1988.

[20] D. Economou, S. Rivoire, C. Kozyrakis, and P. Ranganathan. Full-system poweranalysis and modeling for server environments. In Proceedings of the Workshopon Modeling Benchmarking and Simulation (MOBS06), June 2006.

[21] F.Berman, H.Casanova, A.Chien, K.Cooper, H.Dail, A.Dasgupta, W.Deng,J. Dongarra, L. Johnsson andK.Kennedy, C.Koelbel, B.Liu, X. Liu, A.Mandal,G.Marin, M.Mazina, J.Mellor-Crummey, C.Mendes, A. Olugbile, M.Patel,D.Reed, Z.Shi, O.Sievert, H.Xia, and A. YarKhan. New grid scheduling andrescheduling methods in the grads project. International Journal of ParallelProgramming, 33(2), 2005.

[22] S. Govindan, J. Choi, B. Urgaonkar, A. Sivasubramaniam, and A. Baldini.Statistical profiling-based techniques for effective power provisioning in datacenters. In Proceedings of the 4th ACM European conference on Computersystems (EuroSys09), pages 317–330, April 2009.

[23] C. Hoffa, G. Mehta, T. Freeman, E. Deelman, K. Keahey, B. Berriman, andJ. Good. On the use of cloud computing for scientific workflows. InProceedings of the 3rd International Workshop on Scientific Workflows andBusiness Workflow Standards in e-Science (SWBES08), Dec. 2008.

[24] H.Topcuoglu, S.Hariri, and M.Y.Wu. Performance-effective and low-complexitytask scheduling for heterogeneous computing. IEEE Transactions on Paralleland Distributed Systems, 13, 2002.

[25] S.C. Johnson. Hierarchical clustering schemes. Psychometrika, 32(3), 1966.[26] G. Juve, E. Deelman, K. Vahi, G. Mehta, B. Berriman, B.P. Berman, and

P. Maechling. Scientific workflow applications on amazon ec2. In Workshopon Cloud-based Services and Applications in conjunction with 5th IEEEInternation Conference on e-Science (e-Science09), Dec. 2009.

[27] L.H. Koopsman. Sampling, aliasing and discrete-time models. The SpectralAnalysis of Time Series, 1995.

[28] D. Kusic, J.O. Kephart, J.E. Hanson, N. Kandasamy, and G.F. Jiang. Power andperformance management of virtualized computing environments via lookaheadcontrol. In Proceedings of the 5th International Conference on AutonomicComputing (ICAC08), pages 3–12, June 2008.

[29] Y.K. Kwok and I. Ahmad. Static scheduling algorithms for allocating directedtask graphs to multiprocessors. ACM Computing Surveys, 31(4), 1999.

[30] G.V. Laszewski, L.Z. Wang, A. Younge, and X. He. Power-aware schedulingof virtual machines in dvfs-enabled clusters. In Proceedings of the IEEEInternational Conference on Cluster Computing (Cluster09), pages 1–10, 2009.

[31] B. Ludascher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. Jones, E.A. Lee,J. Tao, and Y. Zhao. Scientific workflow management and the kepler system.Concurrency and Computation: Practice & Experience, 18(10), 2006.

[32] A. Mandal, K. Kennedy, C. Koelbel, G. Marin, J. Mellor-Crummey, B. Liu,and L. Johnsson. Scheduling strategies for mapping application workflows ontothe grid. In Proceedings of the 14th IEEE International Symposium on HighPerformance Distributed Computing(HPDC05), pages 125–134, June 2005.

[33] R. Nathuji and K. Schwan. Virtualpower: Coordinated power management invirtualized enterprise systems. In Proceedings of the 21st ACM Symposium onOperating Systems Principles (SOSP07), pages 265–278, Oct. 2007.

[34] R. Nathuji and K. Schwan. Vpm tokens: virtual machine-aware power budgetingin datacenters. In Proceedings of the 17th IEEE International Symposiumon High Performance Distributed Computing(HPDC08), pages 119–128, June2008.

[35] J.A. Nelder and R. Mead. A simplex method for function minimization.Computer Journal, 1965.

[36] T. Oinn, M. Greenwood, M. Addis, M.N. Alpdemir, J. Ferris, K. Glover,C. Goble, A. Goderis, D. Hull, D. Marvin, P. Li, P. Lord, M.R. Pocock,M. Senger, R. Stevens, A. Wipat, and C. Wroe. Taverna: lessons in creatinga workflow environment for the life sciences. Concurrency and Computation:Practice and Experience, 18, 2006.

[37] E. Parzen. On estimation of a probability density function and mode. TheAnnals of Mathematical Statistics, 3, 1962.

[38] P. Pillai and K.G. Shin. Real-time dynamic voltage scaling for low-powerembedded operating systems. In Proceedings of the 18th ACM Symposium onOperating Systems Principles (SOSP01), pages 89–102, Oct. 2001.

[39] E. Pinheiro, R. Bianchini, and C. Dubnicki. Exploiting redundancy to conserveenergy in storage systems. ACM SIGMETRICS Performance Evaluation Review,34(1), 2006.

[40] C. Poellabauer, T. Zhang, S. Pande, and K. Schwan. An efficient frequencyscaling approach for energy-aware embedded real-time systems. In Proceedingsof the 18th International Conference on Architecture of Computing Systems(ARCS05), pages 207–221, March 2005.

[41] L.R. Rabiner. A tutorial on hidden markov models and selected applications.IEEE, 77(11), 1989.

[42] L. Ramakrishnan, C. Koelbel, Y.S. Kee, R. Wolski, D. Nurmi, D. Gannon,G. Obertelli, A. YarKhan, A. Mandal, T.M. Huang, K. Thyagaraja, andD. Zagorodnov. Vgrads: enabling e-science workflows on grids and cloudswith fault tolerance. In Proceedings of the 23rd International Conference onHigh Performance Computing and Networking (SC09), pages 1–12, Nov. 2009.

[43] R.Buyya and M.Murshed. Gridsim: A toolkit for the modeling and simulationof distributed resource management and scheduling for grid compu ting.Concurrency and Computation: Practice and Experience, 14, 2002.

[44] S. Srikantaiah, A. Kansal, and F. Zhao. Energy. In Workshop on Power AwareComputing and Systems (HotPower08), Dec. 2008.

[45] M. Steinder, I. Whalley, J.E. Hanson, and J.O. Kephart. Coordinated manage-ment of power usage and runtime performance. In Proceedings of the IEEE/IFIPNetwork Operations and Management Symposium (NOMS08), pages 387–394,April 2008.

[46] J. Stoess, C. Lang, and F. Bellosa. Energy management for hypervisor-basedvirtual machines. In Proceedings of the 2007 USENIX Annual TechnicalConference (USENIX07), pages 1–14, June 2007.

[47] I. Taylor, M. Shields, I. Wang, and A. Harrison. The triana workflowenvironment: Architecture and applications. Workflows for e-Science, 2007.

[48] A. Verma, P. Ahuja, and A. Neogi. pmapper: Power and migration costaware application placement in virtualized systems. In Proceedings of the 9thACM/IFIP/USENIX International Conference on Middleware (Middleware08),pages 243–264, Sept. 2008.

[49] A. Verma, G. Dasgupta, T.K. Nayak, P. De, and R. Kothari. Server workloadanalysis for power minimization using consolidation. In Proceedings of the2009 USENIX Annual Technical Conference (USENIX09), June 2009.

[50] F. Wang, Q. Xin, B. Hong, S. Brandt, E. Miller, D. Long, and T. McLarty. Filesystem workload analysis for large scale scientific computing applications. InProceedings of the 21st IEEE / 12th NASA Goddard, 2004.

[51] X.R. Wang and M. Chen. Cluster-level feedback power control for performanceoptimization. In Proceedings of the 14th IEEE International Symposiumon High-Performance Computer Architecture (HPCA08), pages 101–110, Feb.2008.

[52] M. Wieczorek, R. Prodan, and T. Fahringer. Scheduling of scientific workflowsin the askalon grid environment. ACM SIGMOD Record, 34(3), 2005.

[53] Q. Zhu J. Zhu and G. Agrawal. Power-aware consolidation of scientificworkflows in virtualized environments. Technical report, OSU-CISRC-6/10-TR14, The Ohio State University, 2010.

Power-aware Consolidation of Scientic Worko ws in ...korobkin/tmp/SC10/papers/pdfs/...Power-aware...

Documents

Transcript of Power-aware Consolidation of Scientic Worko ws in ...korobkin/tmp/SC10/papers/pdfs/...Power-aware...