IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED … · mance requirements, Cura multiplexes the...

Cost-Effective Resource Provisioning forMapReduce in a Cloud

Balaji Palanisamy,Member, IEEE, Aameek Singh,Member, IEEE, and Ling Liu, Senior Member, IEEE

Abstract—This paper presents a new MapReduce cloud service model, Cura, for provisioning cost-effective MapReduce services in a

cloud. In contrast to existing MapReduce cloud services such as a generic compute cloud or a dedicated MapReduce cloud, Cura has a

number of unique benefits. First, Cura is designed to provide a cost-effective solution to efficiently handle MapReduce production

workloads that have a significant amount of interactive jobs. Second, unlike existing services that require customers to decide the

resources to be used for the jobs, Cura leverages MapReduce profiling to automatically create the best cluster configuration for the

jobs. While the existing models allow only a per-job resource optimization for the jobs, Cura implements a globally efficient resource

allocation scheme that significantly reduces the resource usage cost in the cloud. Third, Cura leverages unique optimization

opportunities when dealing with workloads that can withstand some slack. By effectively multiplexing the available cloud resources

among the jobs based on the job requirements, Cura achieves significantly lower resource usage costs for the jobs. Cura’s core

resource management schemes include cost-aware resource provisioning, VM-aware scheduling and online virtual machine

reconfiguration. Our experimental results using Facebook-like workload traces show that our techniques lead to more than 80 percent

reduction in the cloud compute infrastructure cost with upto 65 percent reduction in job response times.

Index Terms—MapReduce, cloud computing, cost-effectiveness, scheduling

Ç

1 INTRODUCTION

CLOUD computing and its pay-as-you-go cost structurehave enabled hardware infrastructure service pro-

viders, platform service providers as well as software andapplication service providers to offer computing services ondemand and pay per use just like how we use utility today.This growing trend in cloud computing, combined with thedemands for Big Data and Big Data analytics, is driving therapid evolution of datacenter technologies towards morecost-effective, consumer-driven and technology agnosticsolutions. The most popular approach towards such bigdata analytics is using MapReduce [1] and its open-sourceimplementation called Hadoop [15]. Offered in the cloud, aMapReduce service allows enterprises to analyze their datawithout dealing with the complexity of building and man-aging large installations of MapReduce platforms. Usingvirtual machines (VMs) and storage hosted by the cloud,enterprises can simply create virtual MapReduce clusters toanalyze their data.

In this paper, we discuss the cost-inefficiencies of theexisting cloud services for MapReduce and propose a cost-effective resource management framework called Cura thataims at a globally optimized resource allocation to minimizethe infrastructure cost in the cloud datacenter. We note that

the existing cloud solutions for MapReduce work primarilybased on a per-job or per-customer optimization approachwhere the optimization and resource sharing opportunitiesare restricted within a single job or a single customer. Forinstance, in existing dedicated MapReduce cloud servicessuch as Amazon Elastic MapReduce [13], customers buyon-demand clusters of VMs for each job or a workflowand once the MapReduce job (or workflow) is submitted,the cloud provider creates VMs that execute that job andafter job completion the VMs are deprovisioned. Here theresource optimization opportunity is restricted to the per-job level. Alternately, one can lease dedicated clusterresources from a generic cloud service like Amazon ElasticCompute Cloud [14] and operate MapReduce on them asif they were using a private MapReduce infrastructure.While this approach enables resource optimization at theper-customer level, we argue that in such an approach, thesize of the leased dedicated clusters needs to be chosenbased on the peak workload requirements of each customerand hence, the leased clusters are under-utilized for a largefraction of the time leading to higher costs. Cura on theother hand is designed to provide a cost-effective solutionto a wide range of MapReduce workloads with the follow-ing goals in mind:

First, we observe that existing solutions are not cost-effective to deal with interactive MapReduce workloadsthat consist of a significant fraction of short running jobswith lower latency requirements. A recent study on theFacebook and Yahoo production workload traces [29], [8],[34] reveals that more than 95 percent of their productionMapReduce jobs are short running jobs with an averagerunning time of 30 sec. Unlike existing per-job services thatrequire VMs to be created afresh for each submitted job,Cura deals with such interactive workloads using a secureinstant VM allocation scheme that minimizes the job

� B. Palanisamy is with the School of Information Sciences, University ofPittsburgh, Pittsburgh, PA. E-mail: [email protected].

� A. Singh is with Storage Systems, IBM Almaden Research Center 650,Harry Road, San Jose, CA. E-mail: [email protected].

� L. Liu is with the College of Computing, Georgia Institute of Technology,Atlanta, GA. E-mail: [email protected].

Manuscript received 1 July 2013; revised 2 Apr. 2014; accepted 3 Apr. 2014.Date of publication 24 Apr. 2014; date of current version 8 Apr. 2015.Recommended for acceptance by J. Wang.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TPDS.2014.2320498

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 5, MAY 2015 1265

1045-9219� 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

latency. In addition, Cura results in higher cost-effective-ness compared to an owned cluster in a generic computecloud that has high costs due to low utilization.

Second, as discussed earlier, existing cloud solutions arelargely optimized based on per-job and per-customer optimi-zation which leads to poor resource utilization and highercost. Additionally, their usage model requires users to figureout the complex job configuration parameters (e.g. type ofVMs, number of VMs and MapReduce configuration likenumber of mappers per VM etc.) that have an impact on theperformance and thus cost of the job. In contrast to existingcloud systems, Cura leverages MapReduce profiling to auto-matically create the best cluster configuration for the jobs andtries to optimize the resource allocation in a globally cost-effective fashion.Hence, Cura results in requiringmuch lessercloud resources than that consumed by the currentmodels.

Finally, Cura’s usage model and techniques achievehigher service differentiation than existing models as Curaincorporates an intelligent multiplexing of the shared cloudresources among the jobs based on job requirements. Map-Reduce workloads often have a large number of jobs that donot require immediate execution, rather feed into a sched-uled flow—e.g. MapReduce job analyzing system logs for adaily/weekly status report. By leveraging such opportuni-ties and by accurately understanding each job’s perfor-mance requirements, Cura multiplexes the resources forsignificant cost savings.

To the best of our knowledge, Cura is the first effort thatis aimed at developing a novel usage model and resourcemanagement techniques for achieving global resource opti-mization in the cloud for MapReduce services. Cura uses asecure instant VM allocation scheme that helps reduce theresponse time for short jobs by up to 65 percent. By leverag-ing MapReduce profiling, Cura tries to optimize the globalresource allocation in the cloud by automatically creatingthe best cluster configuration for the jobs. Cura’s coreresource management techniques include cost-awareresource provisioning, intelligent VM-aware schedulingand online VM reconfiguration. Overall, in addition to theresponse time savings, Cura results in more than 80 percentsavings in the cloud infrastructure cost. The rest of thepaper is organized as follows. In Section 2, we presentCura’s cloud service model and system architecture. Section3 discusses the technical challenges of Cura and presentsCura’s resource management techniques namely VM-awarescheduling and online VM reconfiguration including a for-mal discussion of Cura’s scheduling and reconfigurationproblems. We present the experimental evaluation of Curain Section 4. We discuss related work in Section 5 and con-clude in Section 6.

2 CURA: MODEL AND ARCHITECTURE

In this section, we present the cloud service model and sys-tem architecture for Cura.

2.1 Cloud Operational Model

Table 1 shows possible cloud service models for providingMapReduce as a cloud service. The first operational model(immediate execution) is a completely customer managedmodel where each job and its resources are specified by the

customer on a per-job basis and the cloud provider onlyensures that the requested resources are provisioned uponjob arrival. Many existing cloud services such as AmazonElastic Compute Cloud [15], Amazon Elastic MapReduce[14] use this model. This model has the lowest rewards sincethere is lack of global optimization across jobs as well asother drawbacks discussed earlier. The second possiblemodel (delayed start) [44] is partly customer-managed andpartly cloud-managed model where customers specifywhich resources to use for their jobs and the cloud providerhas the flexibility to schedule the jobs as long as they beginexecution within a specified deadline. Here, the cloud pro-vider takes slightly greater risk to make sure that all jobsbegin execution within their deadlines and as a reward canpotentially do better multiplexing of its resources. However,specifically with MapReduce, this model still provides lowcost benefits since jobs are being optimized on a per-jobbasis by disparate users. In fact customers in this modelalways tend to greedily choose low-cost small cluster con-figurations involving fewer VMs that would require the jobto begin execution almost immediately. For example, con-sider a job that takes 180 minutes to complete in a cluster oftwo small instances but takes 20 minutes to complete usinga cluster of six large instances.1 Here if the job needs to becompleted in more than 180 minutes, the per-job optimiza-tion by the customer will tend to choose the cluster of twosmall instances as it has lower resource usage cost com-pared to the six large instance cluster. This cluster configu-ration, however, expects the job to be started immediatelyand does not provide opportunity for delayed start. Thisobservation leads us to the next model. The third model—which is the subject of this paper—is a completely cloudmanaged model where the customers only submit jobs andspecify job completion deadlines. Here, the cloud providertakes greater risk and performs a globally optimizedresource management to meet the job SLAs for the custom-ers. Typically, the additional risks here include the responsi-bilities of meeting additional SLA requirements such asexecuting each job within its deadline and managing theallocation of resources for each job. While the conventionalcustomer-optimized cloud model requires only VMs to beprovisioned based on demand, a completely cloud man-aged model introduces additional role on the cloud pro-vider for resource management. For instance, an inefficientallocation of resources to a particular job can result in highercost for the cloud provider. Therefore, this model bringshigher risk to the cloud while it has high potential cost bene-fits. Similar high-risk high-reward model is the database-as-a-service model [10], [11], [12] where the cloud providerestimates the execution time of the customer queries and

TABLE 1Cloud Operational Models

Model Optimization Provider risk Potential benefits

Immediate execution Per-job Limited LowDelayed start Per-job Moderate Low – ModerateCloud managed Global High High

1. Example adapted from the measurements in Herodotou et al.paper [25].

1266 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 5, MAY 2015

performs resource provisioning and scheduling to ensurethat the queries meet their response time requirements. AsMapReduce also lends itself well to prediction of executiontime [4], [5], [24], [27], [28], we have designed Cura on a sim-ilar model. Another recent example of this model is theBatch query model in Google’s Big Query cloud service [34]where the cloud provider manages the resources requiredfor the SQL-like queries so as to provide a service levelagreement of executing the query within 3 hours.

2.2 System Model: User Interaction

Cura’s system model significantly simplifies the way usersdeal with the cloud service. With Cura, users simply submittheir jobs (or composite job workflows) and specify therequired service quality in terms of response time require-ments. After that, the cloud provider has complete controlon the type and schedule of resources to be devoted to thatjob. From the user perspective, the deadline will typicallybe driven by their quality of service requirements for thejob. As MapReduce jobs perform repeated analytics tasks,deadlines could simply be chosen based on those tasks (e.g.8 AM for a daily log analysis job). For ad-hoc jobs that arenot run per a set schedule, the cloud provider can try toincentivize longer deadlines by offering to lower costs ifusers are willing to wait for their jobs to be executed.2 How-ever, this model does not preclude an immediate executionmode in which case the job is scheduled to be executed atthe time of submission, similar to existing MapReducecloud service models.

Once a job is submitted to Cura, it may take one of thetwo paths (Fig. 1). If a job is submitted for the very firsttime, Cura processes it to be profiled prior to execution aspart of its profile and analyze service. This develops a perfor-mance model for the job in order to be able to generate pre-dictions for its performance later on. When makingscheduling decisions, performance predictions in terms ofexecution time are made based on the input dataset size,VM types, cluster sizes and job parameters. This model isused by Cura in optimizing the global resource allocation.MapReduce profiling has been an active area of research [4],[5], [24] and open-source tools such as Starfish [24] are avail-able to create such profiles. Recent work had leveragedMapReduce profiling for cloud resource management andshowed that such profiles can be obtained with very high

accuracy with less than 12 percent error rate for the pre-dicted running time [27]. Here we would like to note thatthough Cura’s architecture and techniques may apply toother HPC scheduling problems beyond just Map Reduce,we have considered dedicated MapReduce clouds as thetarget scenario of Cura for the wide popularity of MapRe-duce and the availability of several open source profile andanalyze tools for MapReduce.

2.3 System Architecture

The profile and analyze service is used only once when a cus-tomer’s job first goes from development-and-testing intoproduction in its software life cycle. For subsequent instan-ces of the production job, Cura directly sends the job forscheduling. Since typically production jobs including inter-active or long running jobs do not change frequently (onlytheir input data may differ for each instance of their execu-tion), profiling will most often be a one-time cost. Further,from an architectural standpoint, Cura users may evenchoose to skip profiling and instead provide VM type, clus-ter size and job parameters to the cloud service similar toexisting dedicated MapReduce cloud service models like[14]. Jobs that skip the one-time profile and analyze step willstill benefit from the response time optimizations in Curadescribed below, however, they will fail to leverage the ben-efits provided by Cura’s global resource optimization strate-gies. Jobs that are already profiled are directly submitted tothe Cura resource management system.

Cura’s resource management system is composed of thefollowing components:

2.3.1 Secure Instant VM Allocation

In contrast to existing MapReduce services that create VMson demand, Cura employs a secure instant VM allocationscheme that reduces response times for jobs, especially signif-icant for short running jobs. Upon completion of a job’s exe-cution, Cura only destroys the Hadoop instance used by thejob (including all local data) but retains the VM to be used forother jobs that need the same VM configuration. For the newjob, only a quick Hadoop initialization phase is requiredwhich prevents having to recreate and boot up VMs.3 Opera-tionally, Cura creates pools of VMs of different instance typesas shown in Fig. 2 and dynamically creates Hadoop clusters

Fig. 1. Cura: System architecture. Fig. 2. VM pool.

2. Design of a complete costing mechanism is beyond the scope ofthis work.

3. Even with our secure instant VM allocation technique data stillneeds to be loaded for each job into its HDFS, but it is very fast for smalljobs as they each process small amount of data, typically less than 200MB in the Facebook and Yahoo workloads [30].

PALANISAMY ET AL.: COST-EFFECTIVE RESOURCE PROVISIONING FOR MAPREDUCE IN A CLOUD 1267

on them. By default, Cura runs the maximum number of pre-created VMs in the cloud (limited by the number of servers)so that all workload can be served instantly.

When time sharing a VM across jobs it is important toensure that an untrusted MapReduce program is not able togain control over the data or applications of other custom-ers. Cura’s security management is based on SELinux [20]and is similar to that of the Airavat system proposed in [19]that showed that enforcing SELinux access policies in aMapReduce cloud does not lead to performance overheads.While Airavat shares multiple customer jobs across thesame HDFS, Cura runs only one Hadoop instance at a timeand the HDFS and MapReduce framework is used by onlyone customer before it is destroyed. Therefore, enforcingCura’s SELinux policies does not require modifications tothe Hadoop framework and requires creation of only twoSELinux domains, one trusted and the other untrusted. TheHadoop framework including the HDFS runs in the trusteddomain and the untrusted customer programs run in theuntrusted domain. While the trusted domain has regularaccess privileges including access to the network for net-work communication, the untrusted domain has very lim-ited permissions and has no access to any trusted files andother system resources. An alternate solution for Cura’ssecure instant VM allocation is to take VM snapshots uponVM creation and once a customer job finishes, the VM canrevert to the old snapshot. This approach is also signifi-cantly faster than destroying and recreating VMs, but it canhowever incur noticeable delays in starting a new job beforethe VM gets reverted to a secure earlier snapshot.

Overall this ability of Cura to serve short jobs better is akey distinguishing feature. However as discussed next,Cura has many other optimizations that benefit any type ofjob including long running batch jobs.

2.3.2 Job Scheduler

The job scheduler at the cloud provider forms an integralcomponent of the Cura system. Where existing MapReduceservices simply provision customer-specified VMs to exe-cute the job, Cura’s VM-aware scheduler (Section 3.1) isfaced with the challenge of scheduling jobs among availableVM pools while minimizing global cloud resource usage.Therefore, carefully executing jobs in the best VM type andcluster size among the available VM pools becomes a crucialfactor for performance. The scheduler has knowledge of therelative performance of the jobs across different cluster con-figurations from the predictions obtained from the profileand analyze service and uses it to obtain global resourceoptimization.

2.3.3 VM Pool Manager

The third main component in Cura is the VM Pool Managerthat deals with the challenge of dynamically managing theVM pools to help the job scheduler effectively obtain effi-cient resource allocations. For instance, if more number ofjobs in the current workload require small VM instancesand the cloud infrastructure has fewer small instances, thescheduler will be forced to schedule them in other instancetypes leading to higher resource usage cost. The VM poolmanager understands the current workload characteristics

of the jobs and is responsible for online reconfigurationof VMs for adapting to changes in workload patterns(Section 3.2). In addition, this component may perform fur-ther optimization such as power management by suitablyshutting down VMs at low load conditions.

2.4 Deployment Challenges and Practical Use

In this section, we discuss some basic challenges in deploy-ing a globally optimized resource management model likeCura in a commercial cloud setting. First of all, a global opti-mization model such as Cura brings additional responsibil-ity to the cloud provider in meeting the SLA requirementsfor the jobs and to the customers. Though the proposedmodel is not a must for cloud service providers to function,they can obtain significant benefits by offering such amodel. While this model brings attractive cost benefits toboth customers and cloud providers, we would need appro-priate incentive models for both cloud providers and cus-tomers in order to function symbiotically. The emergence ofnew cloud managed models in commercial services (suchas Google Big Query [34]) suggests that the additional man-agement overhead on the cloud providers might be quitepractical given the wide range of cost benefits such modelsbring. Motivated by the huge benefits of global resourcemanagement, similar models have also been proposed inthe context of database-as-a-service [10], [11], [12] and havebeen shown to be practical. Another key challenge in glob-ally optimized resource management is that the schedulingand resource allocation techniques need to be highly scal-able and efficient to work in even scenarios with thousandsof servers and with tens of thousands of customer jobs. Thiscalls for highly scalable scheduling techniques and webelieve there is many possible future work along this direc-tion. Finally, we also believe that resource pricing in a glob-ally optimized cloud can be quite a challenge and needsattention from both business perspective as well as from theresource management perspective. When pricing modelsare closely integrated with resource management techni-ques, there are huge opportunities for scheduling techni-ques where resource management decisions are influencedby intimately coupled pricing models. We believe suchsophisticated models will be interesting extensions to theCura global resource management model.

3 CURA: RESOURCE MANAGEMENT

In this section, we describe Cura’s core resource manage-ment techniques. We first present Cura’s VM-aware jobscheduler that intelligently schedules jobs within the avail-able set of VM pools. We then present our reconfiguration-based VM pool manager that dynamically manages the VMinstance pools by adaptively reconfiguring VMs based oncurrent workload requirements.

3.1 VM-Aware Scheduling

The goal of the cloud provider is to minimize the infra-structure cost by minimizing the number of serversrequired to handle the data center workload. Typicallythe peak workload decides the infrastructure cost for thedata center. The goal of Cura VM-aware scheduling is toschedule all jobs within available VM pools to meet their


deadlines while minimizing the overall resource usage inthe data center reducing this total infrastructure cost. Asjobs are incrementally submitted to the cloud, schedulingrequires an online algorithm that can place one job at atime on an infrastructure already executing some jobs. Tobetter understand the complexity of the problem, we firstanalyze an offline version which leads us to the design ofan online scheduler.

3.1.1 Offline VM-Aware Scheduling

In the offline VM-aware scheduling problem, we assumethat information about the jobs, their arrival time and dead-lines are known apriori and the goal of the algorithm is toschedule all jobs to meet their deadline by appropriatelyprovisioning VM clusters and to minimize the overallresource usage in the cloud. We assume each job, Ji is pro-filed when it first goes to production and based on the pro-file it has a number of predictions across various cluster

configurations, Ck;n in terms of instance types denoted by kand number of VMs denoted by n. Let tarrivalðJiÞ andtdeadlineðJiÞ denote the arrival time and deadline of job, Jirespectively. The running time of the job, Ji using the clus-

ter configuration, Ck;n is given by trunðJi; Ck;nÞ and itincludes both execution time and the time for loading data

into the HDFS. CostðJi; Ck;nÞ represents the resource usagecost of scheduling job, Ji using the cluster configuration,

Ck;n. Precisely, the cost, CostðJi; Ck;nÞ represents the productof the number of physical servers required to host the vir-

tual cluster, Ck;n and the running time of the job,

trunðJi; Ck;nÞ. If Rk represents number of units of physicalresources in VM type, k and if each physical server has Munits of physical resources4, the resource usage cost can becomputed as:

CostðJi; Ck;nÞ ¼ trunðJi; Ck;nÞ � n�Rk

M:

Handling prediction errors. Additionally, trunðJi; Ck;nÞincludes an error bound in the prediction to ensure that thejob will complete within its deadline even when there is pre-diction error. This error can also account for deviations inpredicted running time due to interference among multipleMapReduce jobs when they execute in parallel. If

tactualrunðJi; Ck;nÞ represents the actual running time and ifterrorðJi; Ck;nÞ represents the error bound in the predictedrunning time, we have

trunðJi; Ck;nÞ ¼ tactualrunðJi; Ck;nÞ þ terrorðJi; Ck;nÞ:

This conservative estimate of trunðJi; Ck;nÞ guarantees thatthe job completes execution even when there is predictionerror.

Let tstartðJiÞ be the actual starting time of the job, Ji andtherefore the end time of job, Ji is given by

tendðJiÞ ¼ tstartðJiÞ þX

k;n

Xk;ni � trunðJi; Ck;nÞ;

where Xk;ni is a Boolean variable indicating if job, Ji is

scheduled using the cluster configuration, Ck;n and

8i;X

k;n

Xk;ni ¼ 1:

In order to ensure that all jobs get completed within theirdeadlines, we have

8i; tendðJiÞ � tdeadlineðJiÞ:The sum of concurrent usage of VMs among the runningjobs is also constrained by the number of VMs, Vk in the VM

pools where k represents the VM type. If Sti is a Boolean var-

iable indicating if job, Ji is executing at time, t, we have

Sti ¼

1 if tstart ðJiÞ � t � tendðJiÞ0 otherwise

�

8t; 8k;X

i

�Sti �

X

n

�Xk;n

i � n�� Vk:

With the above constraints ensuring that the jobs get sched-uled to meet deadlines, now the key optimization is to mini-mize the overall resource usage cost of all the jobs in thesystem.

Overallcost ¼ minX

i;k;n

CostðJi; Ck;nÞ �Xk;ni :

An optimal solution for this problem is NP-Hard with areduction from the known NP-Hard multi bin-packingproblem [21] with additional job moldability constraints.Therefore, we use a heuristics based VM-aware schedulerwhich is designed to work in an online fashion.

3.1.2 Online VM-Aware Scheduler

Given VM pools for each VM instance type and continuallyincoming jobs, the online VM-aware scheduler decides (a)when to schedule each job in the job queue, (b) which VMinstance pool to use and (c) how many VMs to use for thejobs. The scheduler also decides best Hadoop configurationsettings to be used for the job by consulting the profile andanalyze service.

Depending upon deadlines for the submitted jobs, theVM-aware scheduler typically needs to make future reser-vations on VM pool resources (e.g. reserving 100 smallinstances from time instance 100 to 150). In order to main-tain the most agility in dealing with incrementally incomingjobs and minimizing the number of reservation cancella-tions, Cura uses a strategy of trying to create minimumnumber of future reservations without under-utilizing anyresources. For implementing this strategy, the scheduleroperates by identifying the highest priority job to schedule atany given time and creates a tentative reservation forresources for that job. It then uses the end time of that job’sreservation as the bound for limiting the number of reserva-tions i.e. jobs in the job queue that are not schedulable (interms of start time) within that reservation time window are

4. Though we present a scalar capacity value, VM resources mayhave multiple dimensions like CPU, memory and disk. To handlethis, our model can be extended to include a vector of resources orcompute dimensions can be captured in a scalar value, e.g. the volumemetric [13].


not considered for reservation. This ensures that we are notunnecessarily creating a large number of reservations whichmay need cancellation and rescheduling after another jobwith more stringent deadline enters the queue.

A job Ji is said to have higher priority over job Jj if theschedule obtained by reserving job Ji after reserving job Jjwould incur higher resource cost compared to the scheduleobtained by reserving job Jj after reserving Ji. The highestpriority job is chosen such that it will incur higher overallresource usage cost if the highest priority job is deferred ascompared to deferring any other job. Concretely, ifCostðJi; JjÞ denotes the resource usage cost of the scheduleobtained by reserving job Ji after reserving job Jj and Jlist

represents the job queue, then the highest priority job is cho-sen as the job Ji that maximizes the following.

X

Jj2JlistCostðJj; JiÞ � CostðJi; JjÞ:

For a job, the least cost cluster configuration (VM type, num-ber of VMs) at a given time is found based on the perfor-mance predictions developed for the job during the profileand analyze phase. It is essentially the lowest cost clusterconfiguration that can meet the jobs deadline and for whichresources are available.

For each VM pool, the algorithm picks the highest prior-ity job, Jprior in the job queue and makes a reservation for itusing the cluster configuration with the lowest possibleresource cost at the earliest possible time based on the per-formance predictions obtained from the profile and analyzeservice. Note that the lowest resource cost cluster configura-tion need not be the job’s optimal cluster configuration (thathas lowest per-job cost). For instance, if using the job’s opti-mal cluster configuration at the current time cannot meet thedeadline, the lowest resource cost cluster will representthe one that has the minimal resource usage cost among allthe cluster configurations that can meet the job’s deadline.

Once the highest priority job, Jprior is reserved for all VMpools, the reservation time windows for the correspondingVM pools are fixed. Subsequently, the scheduler picks thenext highest priority job in the job queue by considering pri-ority only with respect to the reservations that are possiblewithin the current reservation time windows of the VMpools. The scheduler keeps on picking the highest priority jobone by one in this manner and tries to make reservations tothem on the VM pools within the reservation time window.Either when all jobs are considered in the queue and nomorejobs are schedulable within the reservation time window orwhen the reservations have filled all the resources until thereservation timewindows, the scheduler stops reserving.

Then at each time instance, the scheduler picks the reser-vations for the current time and schedules them on the VMpools by creatingHadoop clusters of the required sizes in thereservation. After scheduling the set of jobs that have reser-vation starting at the current time, the scheduler waits forone unit of time and considers scheduling for the next timeunit. If no new jobs arrived within this one unit of time, thescheduler can simply look at the reservations made earlierand schedule the jobs that are reserved for the current time,however, if some new jobs arrived within the last one unit oftime, then the scheduler needs to check if some of the newly

arrived jobs have higher priority over the reserved jobs andin that case, the scheduler may require to cancel some exist-ing reservations to reserve some newly arrived jobs thathave higher priority over the ones in the reserved list.

If the scheduler finds that some newly arrived jobs takepriority over some jobs in the current reservation list, it firsttries to check if the reservation time window of the VM poolsneed to be changed. It needs to be changed only when somenewly arrived jobs take priority over the current highest pri-ority job of the VM pools that decides the reservation timewindow. If there exists such newly arrived jobs, the algo-rithm cancels all reserved jobs and moves them back to thejob queue and adds all the newly arrived jobs to the jobqueue. It then picks the highest priority job, Jprior for eachVM pool from the job queue that decides the reservationtime window for each VM pool. Once the new reservationtime window of the VM pools are updated, the schedulerconsiders the other jobs in the queue for reservation withinthe reservation time window of the VM pools until wheneither all jobs are considered or when no more resources areleft for reservation. In case, the newly arrived jobs do nothave higher priority over the time window deciding jobs buthave higher priority over some other reserved jobs, thescheduler will not cancel the time window deciding reserva-tions. However, it will cancel the other reservations andmove the jobs back to the job queue along with the new jobsand repeat the process of reserving jobs within the reserva-tion time windows from the job queue in the decreasingorder of priority. For a data center of a given size, assumingconstant number of profile and analyze predictions for eachjob, it can be shown that the algorithm runs in polynomial

time with Oðn2Þ complexity. We present a complete pseudo-code for this VM-aware scheduler in Appendix A, which canbe found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TPDS.2014.2320498.

While even a centralized VM-aware scheduler scales wellfor several thousands of servers with tens of thousands ofjobs, it is also straight forward to obtain a distributed imple-mentation to scale further. As seen from the pseudocode,the main operation of the VM-aware scheduler is findingthe highest priority job among the n jobs in the queue basedon pairwise cost comparisons. In a distributed implementa-tion, this operation can be distributed and parallelized sothat if there are n jobs in the queue, the algorithm wouldachieve a speed up of x with x parallel machines, each ofthem performing n

x pairwise cost comparisons.Fig. 3a shows an example VM-aware schedule obtained

for 15 jobs using 40 VMs in each VM type, VM-1, VM-2 andVM-3. Here we assume that jobs 1, 2, 5, 6, 7, 8, 9, 10, 11, 13,15 have their optimal cluster configuration using VM-1 andjobs 3, 12, and 14 are optimal with VM-2 and job 4 is optimalwith VM-3. Here, the VM-aware scheduler tries its besteffort to minimize the overall resource usage cost by provi-sioning the right jobs in the right VM types and using theminimal cluster size required to meet the deadline require-ments. However, when the optimal choice of the resource isnot available for some jobs, the scheduler considers the nextbest cluster configuration and schedules them in a cost-aware manner. A detailed illustration of this example withcost comparisons is presented in Appendix B, available inthe online supplemental material.


3.2 Reconfiguration-Based VM Management

Although the VM-aware scheduler tries to effectively mini-mize the global resource usage by scheduling jobs based onresource usage cost, it may not be efficient if the underlyingVM pools are not optimal for the current workload charac-teristics. Cura’s reconfiguration-based VM manager under-stands the workload characteristics of the jobs as an onlineprocess and performs online reconfiguration of the underly-ing VM pools to better suit the current workload. For exam-ple, the VM pool allocation shown in Fig. 2 can bereconfigured as shown in Fig. 4 to have more small instan-ces by shutting down some large and extra large instances ifthe current workload pattern requires more small instances.

The reconfiguration-based VM manager considers therecent history of job executions by observing the jobs thatarrived within a period of time referred to as the reconfigu-ration time window. For each job, Ji arriving within thereconfiguration time window, the reconfiguration algorithmunderstands the optimal cluster configuration, CoptðJiÞ.Concretely, the running time of the jobs under a given clus-ter configuration is predicted by the profile and analyzetool. Based on the amount of resources in the cluster config-uration (i.e number of VMs and configuration of each VM)and the running time, Cura computes the total resource costas described in Section 3.1.1. Cura uses this information tofind the optimal cluster configuration as the one which min-imizes this resource usage cost while being able to completethe job within its deadline. The reconfiguration managerunderstands the current demands for each VM instancetype in terms of the average number of VMs required foreach VM type in order to successfully provision the optimalcluster configuration to the jobs observed in the reconfigura-tion time window. At the end of the reconfiguration timewindow period, the algorithm decides on the reconfigura-tion plan by making a suitable tradeoff between the perfor-mance enhancement obtained after reconfiguration and the

cost of the reconfiguration process. If Y k;ni is a Boolean vari-

able indicating if Ck;n is the optimal cluster configurationfor job, Ji, then the proportion of physical resources, Pk to

be allocated to each VM type k can be estimated based onthe cumulative resource usage in each VM pool computedas the product of total running time of the jobs and the sizeof the cluster used:

Pk ¼P

i;n

�trunðJi; CoptðJiÞÞ � n� Y k;n

i

�P

i;k;n

�trunðJi; CoptðJiÞÞ � n� Y k;n

i

� :

The total physical resources, Rtotal in the cloud infrastruc-ture can be obtained as

Rtotal ¼X

k

Vk �Rk;

where Rk represents the physical resource in VM type, k,and Vk is the number of VMs in the existing VM pool oftype k. Therefore, the number of VMs, V 0

k in the new recon-figured VM pools is given by

V 0k ¼ Pk �

Rtotal

Rk:

Such reconfiguration has to be balanced against the cost ofreconfiguration operations (shutting down some instancesand starting others). For this, we compute the benefit ofdoing such reconfiguration. The overall observed cost repre-sents the actual cumulative resource cost of the jobs exe-cuted during the reconfiguration time window usingexisting VM pools. Here, Zk;n

i is a Boolean variable indicat-

ing if the job Ji used the cluster configuration, Ck;ni

Overallcostobserved ¼X

i;k;n

CostðJi; Ck;nÞ � Zk;ni :

Next, we compute the estimated overall cost with new VMpools assuming that the jobs were scheduled using theiroptimal cluster configurations, CoptðJiÞ. Reconfigurationbenefit, Reconfbenefit is then computed as the differencebetween the two

Overallcostestimate ¼X

i

CostðJi; CoptðJiÞÞ

Reconfbenefit ¼ Overallcostestimate �Overallcostactual:

Assuming the reconfiguration process incurs an averagereconfiguration overhead, Reconfoverhead that represents theresource usage spent on the reconfiguration process foreach VM that undergoes reconfiguration, the total cost ofreconfiguration is obtained as

Reconfcost ¼X

k

jðV 0k � VkÞj �Reconfoverhead:

Fig. 3. Scheduling in Cura.

Fig. 4. Reconfiguration-based VM management.


The algorithm then triggers the reconfiguration process onlyif it finds that the estimated benefit exceeds the reconfigura-tion cost by a factor of b, i.e., if Reconfbenefit � b� Reconfcostwhere b > 1. As Reconfbenefit only represents an estimate ofthe benefit, b is often chosen as a value greater than 1. Whenthe reconfiguration process starts to execute, it shuts downsome VMs whose instance types needs to be decreased innumber and creates new VMs of the instance types thatneeds to created. The rest of the process is similar to anyVM reconfiguration process that focuses on the bin-packingaspect of placing VMs within the set of physical servers dur-ing reconfiguration [13], [36].

Continuing the example of Fig. 3, we find that the basicVM-aware scheduler in Fig. 3a without reconfiguration sup-port schedules jobs 5, 6, 8, 13, 15 using VM-2 and VM-3types even though they are optimal with VM-1, The recon-figuration based VM-aware schedule in Fig. 3b provisionsmore VM-1 instances (notice changed Y-axis scale) byunderstanding the workload characteristics and hence inaddition to the other jobs, jobs 5, 6, 8 13 and 15 also getscheduled with their optimal choice of VM-type namelyVM-1, thereby minimizing the overall resource usage costin the cloud data center. The detailed schedule for this caseis explained in Appendix B, available in the online supple-mental material, for interested readers.

4 EXPERIMENTAL EVALUATION

We divide the experimental evaluation of Cura into two—first, we provide detailed analysis on the effectiveness ofCura compared to conventional MapReduce services andthen we present an extensive micro analysis on the differentset of techniques in Cura that contribute to the overall per-formance. We first start with our experimental setup.

4.1 Experimental Setup

Cluster setup. Our profiling cluster consists of 20 CentOS 5.5physical machines (KVM as the hypervisor) with 16 core2.53 GHz Intel processors and 16 GB RAM. The machinesare organized in two racks, each rack containing 10 physicalmachines. The network is 1 Gbps and the nodes within arack are connected through a single switch. We consideredsix VM instance types with the lowest configuration startingfrom two 2 GHz VCPUs and 2 GB RAM to the highest con-figuration having 12 2 GHz VCPUs and 12 GB RAM witheach VM configuration differing by two 2 GHz VCPUs and2 GB RAMwith the next higher configuration.

Workload. We created 50 jobs using the SwimMapReduceworkload generator [30] that richly represent the character-istics of the production MapReduce workload in the Face-book MapReduce cluster. The workload generator uses areal MapReduce trace from the Facebook production clusterand generates jobs with similar characteristics as observedin the Facebook cluster. Using the Starfish profiling tool [24],each job is profiled on our cluster setup using clusters ofVMs of all six VM types. Each profile is then analyzed usingStarfish to develop predictions across various hypotheticalcluster configurations and input data sizes.

Simulation setup. In order to analyze the performance andcost benefits of Cura on a datacenter scale system, we devel-oped a simulator in Java that uses the profiles and

performance predictions developed from the real cluster.The simulator models a cloud datacenter with servers, eachhaving a 16 core 2.53 GHz processors with 16 GB RAM. Itimplements both the VM-aware scheduling with the instantVM allocation and the reconfiguration-based VM manage-ment techniques. The execution time for each job in the sim-ulation is assumed as the predicted execution time (basedon the profiles generated from the profiling cluster) and aprediction error which could be either a positive or negativeerror within an assumed error bound.

Metrics. We evaluate our techniques on four key metricswith the goal of measuring their cost effectiveness and per-formance—(1) number of servers: techniques that requiremore number of physical servers to successfully meet theservice quality requirements are less cost-effective; this met-ric measures the capital expense on the provisioning ofphysical infrastructure in the data center, (2) response time:techniques that have higher response time provide poor ser-vice quality; this metric captures the service quality of thejobs, (3) per-job infrastructure cost—this metric represents theaverage per-job fraction of the infrastructure cost; techni-ques that require fewer servers will have lower per-job costand (4) effective utilization: techniques that result in poor uti-lization lead to higher cost; this metric captures both thecost-effectiveness and the performance of the techniques. Itshould be noted that the effective utilization captures onlythe useful utilization that represents job execution and doesnot include the time taken for creating and destroying VMs.

Before discussing the experimental results, we brieflydiscuss the set of techniques compared in the evaluation.

Per-job cluster services. Per job services are similar to dedi-cated MapReduce services such as Amazon Elastic MapRe-duce [14] that create clusters per job or per workflow. Whilethis model does not automatically pick VM and Hadoopparameters, for a fair comparisonwe use Starfish to create theoptimal VM andHadoop configuration even in thismodel.

Dedicated cluster services. Dedicated clusters are similar toprivate cloud infrastructures where all VMs are managedby the customer enterprises and Hadoop clusters areformed on demand when jobs arrive. Here again the VMand job parameters are chosen via Starfish.

Cura. Cura incorporates both the VM-aware schedulerand reconfiguration-based VM pool management. For themicro-analysis, we also compare the following sub-techni-ques to better evaluate Cura: 1) Per-job Optimization tech-nique that uses Cura’s secure instant VM allocation butalways uses the per-job optimal number of VMs and theoptimal VM type, 2) VM-aware scheduler described inSections 3.1 and 3) Reconfiguration based VM Management(Section 3.2).

4.2 Experimental Results

We first present the experimental evaluation of Cura bycomparing with the existing techniques for various experi-mental conditions determined by distribution of the jobdeadlines, size of the MapReduce jobs, number of servers inthe system and the amount of prediction error in the profileand analyze process. By default, we use a composite work-load consisting of equal proportion of jobs of three differentcategories: small jobs, medium jobs and large jobs. Small


jobs read 100 MB of data, whereas medium jobs and largejobs read 1 and 10 GB of input data respectively. We modelPoisson job arrivals with rate parameter, � ¼ 0:5 and thejobs are uniformly distributed among 50 customers. Theevaluation uses 11,500 jobs arriving within a period of 100minutes. Each of the arrived job represents one of the 50profiled jobs with input data size ranging from 100 MB to10 GB based on the job size category. By default, we assumethat jobs run for the same amount of time predicted in theprofile and analyze process, however, we dedicate a separateset of experiments to study the performance of the techni-ques when such predictions are erroneous. Note that a job’scomplete execution includes both the data loading timefrom the storage infrastructure to the compute infrastruc-ture and the Hadoop startup time for setting up the Hadoopcluster in the cluster of VMs. The data loading time is com-puted by assuming a network throughput of 50 MBps perVM5 from the storage server and the Hadoop startup time istaken as 10 sec.

4.2.1 Effect of Job Deadlines

In this set of experiments, we first study the effect of jobdeadlines on the performance of Cura with other techniques(Fig. 5) and then we analyze the performance of Cura interms of the contributions of each of its sub-techniques(Fig. 6). Fig. 5a shows the performance of the techniques fordifferent maximum deadlines with respect to number of

servers required for the cloud provider to satisfy the work-load. Here, the deadlines are uniformly distributed withinthe maximum deadline value shown on the X-axis. We findthat provisioning dedicated clusters for each customerresults in a lot of resources as dedicated clusters are basedon the peak requirements of each customer and thereforethe resources are under-utilized. On the other hand, per-jobcluster services require lower number of servers (Fig. 5a) asthese resources are shared among the customers. However,the Cura approach in Fig. 5a has a much lower resourcerequirement having up to 80 percent reduction in terms ofthe number of servers. This is due to the designed globaloptimization capability of Cura. Where per-job and dedicatedcluster services always attempt to place jobs based on per-job optimal configuration obtained from Starfish, resourcesfor which may not be available in the cloud, Cura on theother hand can schedule jobs using other than their individ-ual optimal configurations to better adapt to availableresources in the cloud.

We also compare the approaches in terms of the meanresponse time in Fig. 5b. To allow each compared techniqueto successfully schedule all jobs (and not cause failures), weuse the number of servers obtained in Fig. 5a for each individ-ual technique. As a result, in this response time comparison,Cura is using much fewer servers than the other techniques.We find that the Cura approach and the dedicated clusterapproach have lower response time (up to 65 percent).

In the per-job cluster approach, the VM clusters are cre-ated for each job and it takes additional time for the VM crea-tion and booting process before the jobs can begin executionleading to the increased response time of the jobs. Similar tothe comparison on the number of servers, we see the same

Fig. 5. Effect of job-deadlines.

5. Here, the 50 MBps throughput is a conservative estimate of thethroughput between the storage and compute infrastructures based onmeasurement studies on real cloud infrastructures [22].


trend with respect to the per-job cost in Fig. 5c that showsthat the Cura approach can significantly reduce the per-jobinfrastructure cost of the jobs (up to 80 percent). The effectiveutilization in Fig. 5d shows that the per-job cluster servicesand dedicated cluster approach have much lower effectiveutilization compared to the Cura approach. The per-job serv-ices spend a lot of resources in creating VMs for every jobarrival. Especially with short response time jobs, the VM cre-ation becomes a bigger overhead and reduces the effectiveutilization. The dedicated cluster approach does not createVMs for every job instance, however it has poor utilizationbecause dedicated clusters are sized based on peak utiliza-tion. But the Cura approach has a high effective utilizationhaving up to 7� improvement compared to the other techni-ques as Cura effectively leverages global optimization anddeadline-awareness to achieve better resourcemanagement.

Micro analysis. Next, we discuss the performance of thesub-techniques of Cura and illustrate how much each sub-technique contributes to the overall performance under dif-ferent deadlines. Fig. 6a shows that with only per-job opti-mization (which only leverages instant VM allocation), itrequires up to 2.6� higher number of servers compared tousing reconfiguration-based VM pool management schemewith the VM-aware scheduler. The per-job optimizationscheduler always chooses the optimal VM type and theoptimal number of VMs for each job and in case the optimalresources are not available when the job arrives, the sched-uler keeps on queuing the job until the required optimalresource becomes available when some other jobscomplete. It drops the request when it finds out that the jobcannot meet its deadline if the optimal resources are provi-sioned. However, with the VM-aware approach, the sched-uler will be able to still schedule the job by provisioninghigher resources in order to meet the deadline. Second,with the per-job optimization scheduler, even when somesub-optimal resources are available when the job is wait-ing, they remain unused as the job is expecting to be sched-uled only using the optimal resources. Therefore the per-job optimization results in poor performance. The numberof servers required by the VM-aware approach is signifi-cantly reduced by up to 45 percent servers by efficientreconfiguration-based VM management that dynamicallymanages the VMs in each VM pool. Fig. 6b shows the meanresponse time of the jobs for various sub-techniques. Wefind that the sub-techniques have similar response timesexcept for the per-job optimization case that has up to11 percent higher mean response time. As per-job

optimization scheduler keeps the jobs waiting until it findstheir optimal resources, it leads to higher queuing time thatcauses this increase.

4.2.2 Effect of Prediction Error

This set of experiments evaluates the techniques by study-ing the effect of inaccuracies in the performance predic-tion. As accurate performance predictions may not alwaysbe available, it is important that the techniques can tolerateinaccuracies in performance prediction and yet performefficiently. Fig. 7 shows the comparison of the techniqueswhile varying the error rate from 0 to 70 percent. Here, themean deadline of the jobs is taken as 800 second. The errorrate means that accurate running time of the jobs can beanywhere within the error range on both sides of the pre-dicted value. The comparison of number of servers inFig. 7a shows that all the techniques require more numberof servers when the prediction error increases. The Curaapproach on an average requires 4 percent additionalnumber of servers for every 10 percent increase in predic-tion error. Note that even the per-job cluster and dedicatedcluster schemes require increased number of servers asthey also decide the resource requirements based on theperformance predictions of the jobs across different clusterconfigurations.

Fig. 7b shows that the response time of the techniquesdecreases with increase in the error rate. While the Curaand dedicated cluster approaches have a decrease of 4.2and 3.7 percent respectively, the per-job cluster approachhas a decrease of only 1.4 percent for every 10 percentincrease in error rate as the major fraction of the responsetime in these services is due to the VM creation process.As error rate increases, the techniques provision moreresources to ensure that even in the worst case, when thejobs run for the maximum possible time within the errorrange, the jobs complete within the deadline. Therefore,in cases where the job completes within the maximumpossible running time, these additional resources makethe job complete earlier than its deadline and therefore itspeeds up the execution resulting in lower response time.The cost trend shown in Fig. 7c also shows that the tech-niques that require fewer servers result in lower per-jobcost. Similarly the effective utilization comparison inFig. 7d shows similar relative performance as in Fig. 5d.

We compare the performance of the sub-techniques ofCura under different error rates in Fig. 8. We find that the

Fig. 6. Effect of job-deadlines.


number of servers in Fig. 8a shows a similar relative perfor-mance among the sub-techniques as in 7a. Here again, theresponse time as shown in Fig. 8b shows that the per-joboptimization scheduler leads to higher response time due toqueue wait times and the response time of the sub-techni-ques increases with increase in error rate.

4.2.3 Varying Number of Servers

We next study the performance of the techniques by vary-ing the number of servers provisioned to handle the work-load. Fig. 9a shows the success rate of the approaches forvarious number of servers. Here, the success rate representsthe fraction of jobs that successfully meet their deadlines.We find that the Cura approach has a high success rate evenwith 250 servers, whereas the per-job cluster approachobtains close to 100 percent rate only with 2,000 servers.Fig. 9b shows that the response time of successful jobs inthe compared approaches show a similar trend as in Fig. 5bwhere the Cura approach performs better than the per-jobcluster services.

4.2.4 Varying Job Sizes

This set of experiments evaluates the performance of thetechniques for various job sizes based on the size of inputdata read. Note that small jobs process 100 MB of data,medium jobs process 1 GB of data and large and extra largejobs process 10 and 100 GB of data respectively. Also small,medium and large jobs have a mean deadline of 100 secondand the extra large jobs have a mean deadline of 1,000 sec-ond as they are long running. We find that the performancein terms of number of servers in Fig. 10a has up to 9ximprovement for the short and medium jobs with Curaapproach compared to the per-job cluster approach. It isbecause in addition to the VM-aware scheduling and recon-figuration-based VM management, these jobs benefit themost from the secure instant VM allocation as these areshort jobs. For large and extra large jobs, the Cura approachstill performs significantly better having up to 4� and 2�improvement for large and extra large jobs compared to theper-job cluster services. The dedicated cluster servicerequires significantly higher resources for large jobs as the

Fig. 7. Effect of prediction error.

Fig. 8. Effect of prediction error.


peak workload utilization becomes high (its numbers signif-icantly cross the max Y-axis value). This set of experimentsshow that the global optimization techniques in Cura arenot only efficient for short jobs but also for long runningbatch workloads.

The response time improvements of Cura and dedicatedcluster approach in Fig. 10b also show that the improvementis very significant for short jobs having up to 87 percentreduced response time and up to 69 percent for mediumjobs. It is reasonably significant for large jobs with up to 60percent lower response time and extra large jobs with up to30 percent reduced response time. The cost comparison inFig. 10c also shows a similar trend that the Cura approach,although is significantly effective for both large and extralarge jobs, the cost reduction is much more significant forsmall and medium jobs.

The sub-technique comparison of Cura for various jobtypes in terms of number of servers is shown in Fig. 11a.We find that the sub-techniques have impact on all kind ofjobs irrespective of the job size. While secure instant VM

allocation contributes more to the performance of the smalljobs compared to large jobs, the sub-techniques in Cura haveequal impact on the overall performance for all job categories.The response time comparison of the sub-techniques inFig. 11b shows that the sub-techniques have similar responsetime, however, for large and extra large jobs, the per-job opti-mization leads to increased response time by up to 24.8 per-cent as large jobs in the per-job optimization require incurlonger waiting time in the queue as they often request moreresources thatmay not be immediately available.

4.3 Effect of Deadline Distribution

In Fig. 12, we study the effect of different distributions ofdeadline on the performance of Cura for different meandeadlines. In Fig. 12a, we find that both the Poisson anduniform distributions require similar number of serverswhereas the exponential deadline distribution requires upto 30 percent additional servers as there are more jobswith shorter deadlines in the exponential distribution. Theresponse time comparison in Fig. 12b shows that irrespective

Fig. 9. Effect of servers.

Fig. 10. Effect of job type.


of the deadline distribution, the jobs have more or less simi-lar response time.

5 RELATED WORK

Resource allocation and job scheduling. There is a large body ofwork on resource allocation and job scheduling in grid andparallel computing. Some representative examples ofgeneric schedulers include [37], [38]. The techniques pro-posed in [39], [40] consider the class of malleable jobs wherethe number processors provisioned can be varied at run-time. Similarly, the scheduling techniques presented in [41],[42] consider moldable jobs that can be run on differentnumber of processors. These techniques do not consider avirtualized setting and hence do not deal with the chal-lenges of dynamically managing and reconfiguring the VMpools to adapt for workload changes. Therefore, unlikeCura they do not make scheduling decisions over dynami-cally managed VM pools. Chard et al. present a resourceallocation framework for grid and cloud computing frame-works by employing economic principles in job scheduling[45]. Hacker and Mahadik propose techniques for allocatingvirtual clusters by queuing job requests to minimize thespare resources in the cloud [46]. Recently, there has beenwork on cloud auto scaling with the goal of minimizing cus-tomer cost while provisioning the resources required to pro-vide the needed service quality [43]. The authors in [44]propose techniques for combining on demand provisioningof virtual resources with batch processing to increase sys-tem utilization. Although the above mentioned systemshave considered cost reduction as a primary objective ofresource management, these systems are based on eitherper-job or per-customer optimization and hence unlike

Cura, they do not lead to a globally optimal resourcemanagement.

MapReduce task placement. There have been several effortsthat investigate task placement techniques for MapReducewhile considering fairness constraints [17], [32]. Mantri triesto improve job performance by minimizing outliers by mak-ing network-aware task placement [3]. Similar to Yahoo’scapacity scheduler and Facebook’s fairness scheduler, thegoal of these techniques is to appropriately place tasks forthe jobs running in a given Hadoop cluster to optimize forlocality, fairness and performance. Cura, on the other handdeals with the challenges of appropriately provisioning theright Hadoop clusters for the jobs in terms of VM instancetype and cluster size to globally optimize for resource costwhile dynamically reconfiguring the VM pools to adapt forworkload changes.

MapReduce in a cloud. Recently, motivated by MapRe-duce, there has been work on resource allocation for dataintensive applications in the cloud context [18], [33]. Quincy[18] is a resource allocation system for scheduling concur-rent jobs on clusters and Purlieus [33] is a MapReduce cloudsystem that improves job performance through locality opti-mizations achieved by optimizing data and compute place-ments in an integrated fashion. However, unlike Cura thesesystems are not aimed at improving the usage model forMapReduce in a cloud to better serve modern workloadswith lower cost.

MapReduce profile and analyze tools. A number of MapRe-duce profiling tools have been developed in the recent pastwith an objective of minimizing customer’s cost in the cloud[4], [5], [23], [28], [29]. Herodotou and Babu developed anautomated performance prediction tool based on their pro-file and analyze tool Starfish [24] to guide customers to

Fig. 11. Effect of job type.

Fig. 12. Effect of deadline distribution.


choose the best cluster size for meeting their job require-ments [26]. Similar performance prediction tool is devel-oped by Verma et al. [29] based on a linear regressionmodel with the goal of guiding customers to minimize cost.Popescu et al. developed a technique for predicting runtimeperformance for jobs running over varying input data set[28]. Recently, a new tool called Bazaar [27] has been devel-oped to guide MapReduce customers in a cloud by predict-ing job performance using a gray-box approach that hasvery high prediction accuracy with less than 12 percent pre-diction error. However, as discussed earlier, these job opti-mizations initiated from the customer-end may lead torequiring higher resources at the cloud. Cura while leverag-ing existing profiling research, addresses the challenge ofoptimizing the global resource allocation at the cloud pro-vider-end with the goal of minimizing customer costs. Asseen in evaluation, Cura benefits from both its cost-opti-mized usage model and its intelligent scheduling and onlinereconfiguration-based VM pool management.

6 CONCLUSIONS

This paper presents a newMapReduce cloud service model,Cura, for data analytics in the cloud. We argued that exist-ing cloud services for MapReduce are inadequate and ineffi-cient for production workloads. In contrast to existingservices, Cura automatically creates the best cluster configu-ration for the jobs using MapReduce profiling and leveragesdeadline-awareness which, by delaying execution of certainjobs, allows the cloud provider to optimize its globalresource allocation efficiently and reduce its costs. Cura alsouses a unique secure instant VM allocation technique thatensures fast response time guarantees for short interactivejobs, a significant proportion of modern MapReduce work-loads. Cura’s resource management techniques includecost-aware resource provisioning, VM-aware schedulingand online virtual machine reconfiguration. Our experimen-tal results using jobs profiled from realistic Facebook-likeproduction workload traces show that Cura achieves morethan 80 percent reduction in infrastructure cost with 65 per-cent lower job response times.

ACKNOWLEDGMENTS

The first and third authors are supported by grants from USNational Science Foundation (NSF) CISE NetSE program,SaTC program, I/UCRC and a grant from Intel ICST oncloud computing.

REFERENCES

[1] B. Igou, “User survey analysis: Cloud-computing budgets aregrowing and shifting; traditional IT services providers mustprepare or perish,” Gartner, Inc., Stamford, CT, USA, GartnerRep. G00205813, 2010.

[2] J. Dean and S. Ghemawat, “MapReduce: Simplified data process-ing on large clusters,” in Proc. 6th Symp. Operating Syst. Des. Imple-mentation, 2004, p. 10.

[3] G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica, Y. Lu,B. Saha, and E. Harris, “Reining in the outliers in Map-Reduceclusters using Mantri,” in Proc. 9th USENIX Conf. Operating Syst.Des. Implementation, 2010.

[4] K. Kambatla, A. Pathak, and H. Pucha, “Towards optimizingHadoop provisioning in the cloud,” in Proc. Conf. Hot Topics CloudComput., 2009.

[5] K. Morton, A. Friesen, M. Balazinska, and D. Grossman,“Estimating the progress of MapReduce pipelines,” in Proc. IEEE26th Int. Conf. Data Eng., 2010, pp. 681–684.

[6] Pig user guide. [Online]. Available: http://pig.apache.org/, 2014.[7] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony,

H. Liu, P. Wyckoff, and R. Murthy, “Hive - A warehousingsolution over a MapReduce framework,” Proc. VLDB Endowment,vol. 2, pp. 1626–1629, 2009.

[8] D. Borthakur, J. Gray, J. S. Sarma, K. Muthukkaruppan, N. Spiegel-berg, H. Kuang, K. Ranganathan, D. Molkov, A. Menon, S. Rashfacebook, R. Schmidt, and A. Aiyer, “Apache Hadoop goes real-time at Facebook,” in Proc. ACM SIGMOD Int. Conf. Manage. Data,2011, pp. 1071–1080.

[9] S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M.Tolton, and T. Vassilakis, “Dremel: Interactive analysis of web-scale datasets,” in Proc. 36th Int. Conf. Very Large Data Bases, 2010,pp. 330–339.

[10] C. Curino, E. P. C. Jones, R. Popa, N. Malviya, E. Wu, S. Madden,H. Balakrishnan, N. Zeldovich, “Relational cloud: A database-as-a-service for the cloud,” in Proc. 5th Biennial Conf. Innovative DataSyst. Res., 2011, pp. 235–240.

[11] S. Aulbach, T. Grust, D. Jacobs, A. Kemper, and J. Rittinger,“Multi-tenant databases for software as a service: Schema-map-ping techniques,” in Proc. ACM SIGMOD Int. Conf. Manage. Data,2008, pp. 1195–1206.

[12] P. Xiong, Y. Chi, S. Zhu, J. Tatemura, C. Pu, and H. Hacigumus,“ActiveSLA: A profit-oriented admission control framework fordatabase-as-a-service providers,” in Proc. 2nd ACM Symp. CloudComput., 2011.

[13] T. Wood, P. Shenoy, A. Venkataramani, andM. Yousif, “Black-boxand gray-box strategies for virtual machine migration,” in Proc.4th USENIX Conf. Netw. Syst. Des.Implementation, 2007, p. 17.

[14] Amazon Elastic MapReduce. [Online]. Available: http://aws.amazon.com/elasticmapreduce/, 2014.

[15] Amazon Elastic Compute Cloud. [Online]. Available: http://aws.amazon.com/ec2/, 2014.

[16] Hadoop. [Online]. Available: http://hadoop.apache.org, 2014.[17] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica,

“Improving MapReduce performance in heterogeneous environ-ments,” in Proc. 8th USENIX Conf. Operating Syst. Des. Implementa-tion, 2008, pp. 29–42.

[18] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A.Goldberg, “Quincy: Fair scheduling for distributed computingclusters,” in Proc. ACM SIGOPS 22nd Symp. Operating Syst. Princi-ples, 2009, pp. 261–276.

[19] I. Roy, Srinath. Setty, A. Kilzer, V. Shmatikov, and E. Witchel,“Airavat: Security and privacy for MapReduce,” in Proc. 7th USE-NIX Conf. Netw. Syst. Des. Implementation, 2010, p. 20.

[20] SELinux user guide. [Online]. Available: http://selinuxproject.org/page/Main_Page, 2014.

[21] M. R. Garey and D. S. Johnson, Computers and Intractability: AGuide to the Theory of NP-Completeness,” San Francisco, CA, USA:Freeman, 1979.

[22] S. L. Garfinkel. (2007). An evaluation of Amazon’s grid computingservices: EC2, S3 and SQS. School Eng. Appl. Sci., Harvard Univ.,Cambridge, MA, USA, Tech. Rep. TR-08-07, [Online]. Available:ftp://ftp.deas.harvard.edu/techreports/tr-08-07.pdf

[23] F. Tian and K. Chen, “Towards optimal resource provisioning forrunning MapReduce programs in public clouds,” in Proc. IEEE4th Int. Conf. Cloud Comput., 2011, pp. 155–162.

[24] H. Herodotou et al., and S. Babu, “On optimizing MapReduceprograms/Hadoop jobs,” Proc. VLDB Endowment, vol. 12, 2011.

[25] H. Herodotou, H. Li, G. Luo, N. Borisov, L. Dong, F. B. Cetin,and S. Babu, “Starfish: A selftuning system for big data analy-tics,” in Proc. 5th Biennial Conf. Innovative Data Syst. Res., 2011,pp. 261–272.

[26] H. Herodotou, F. Dong, and S. Babu, “No one (cluster) size fits all:Automatic cluster sizing for data-intensive analytics,” in Proc. 2ndACM Symp. Cloud Comput., 2011.

[27] V. Jalaparti, H. Ballani, P. Costa, T. Karagiannis, and A. Rowstron,“Bazaar: Enabling predictable performance in datacenters,”Microsoft Res., Cambridge, U.K., Tech. Rep. MSR-TR-2012-38,2012.

[28] A. Popescu, V. Ercegovac, A. Balmin, M. Branco, and A. Aila-maki, “Same queries, different data: Can we predict runtimeperformance?” Proc. 3rd Int. Workshop Self-Manag. DatabaseSyst., 2012.


[29] A. Verma, L. Cherkasova, and R. H. Campbell, “Resource provi-sioning framework for MapReduce jobs with performance goals,”in Proc. ACM/IFIP/USENIX 12th Int. Middleware Conf., 2011,pp. 165–186.

[30] Y. Chen, A. Ganapathi, R. Griffith, and R. Katz, “The case for eval-uating MapReduce performance using workload suites,” in Proc.IEEE 19th Annu. Int. Symp. Model., Anal., Simul. Comput. Telecom-mun. Syst., 2011, pp. 390–399.

[31] S. Kavulya, J. Tan, R. Gandhi, and P. Narasimhan, “An analysis oftraces from a production MapReduce cluster,” in Proc. 10th IEEE/ACM Int. Conf. Cluster, Cloud Grid Comput., 2010, pp. 94–103.

[32] T. Sandholm and K. Lai, “Mapreduce optimization using dynamicregulated prioritization,” ACM SIGMETRICS/Perform. Eval. Rev.,vol. 37, pp. 299–310, 2009

[33] B. Palanisamy, A. Singh, L. Liu, and B. Jain, “Purlieus: Locality-aware resource allocation for MapReduce in a cloud,” in Proc. Int.Conf. High Perform. Comput., Netw., Storage Anal., 2011.

[34] Google BigQuery. [Online]. Available: https://developers.google.com/bigquery/

[35] Y. Chen, S. Alspaugh, D. Borthakur, and R. Katz, “Energy eff-ciency for large-scale MapReduce workloads with significantinteractive analysis,” in Proc. 7th ACM Eur. Conf. Comput. Syst.,2012, pp. 43–56.

[36] A. Singh, M. Korupolu, and D. Mohapatra, “Server-storagevirtualization: Integration and load balancing in data centers,” inProc. Int. Conf. High Perform. Comput., Netw., Storage Anal., 2008,pp. 1–12.

[37] A. Mu’alem and D. Feitelson, “Utilization, predictability, work-loads, and user runtime estimates in scheduling the IBM SP2 withbackfilling,” IEEE Trans. Parallel Distrib. Syst., vol. 12, no. 6, pp.529–543, Jun. 2001

[38] J. Skovira, W. Chan, H. Zhou, and D. Lifka, “The EASY - LoadLev-eler API Project,” in Proc. Workshop Job Scheduling Strategies ParallelProcess., 1996, pp. 41–47.

[39] S. Anastasiadis and K. Sevcik, “Parallel application scheduling onnetworks of workstations,” J. Parallel Distrib. Comput., vol. 43, pp.109–124, 1997

[40] E. Rosti, E. Smirni, L. Dowdy, G. Serazzi, and B. Carlson, “Robustpartitioning policies of multiprocessor systems,” Perform. Eval.,vol. 19, pp. 141–165, 1993.

[41] S. Srinivasan, V. Subramani, R. Kettimuthu, P. Holenarsipur, andP. Sadayappan, “Effective selection of partition sizes for moldablescheduling of parallel jobs,” in Proc. 9th Int. Conf. High Perform.Comput., 2002, pp. 174–183.

[42] W. Cirne, “Using moldability to improve the performance ofsupercomputer jobs,” Ph.D. dissertation, Dept. Comput. Sci. Eng.,Univ. California, San Diego, La Jolla, CA, USA, 2001.

[43] M. Mao and M. Humphrey, “Auto-scaling to minimize cost andmeet application deadlines in cloud workflows,” in Proc. Int. Conf.High Perform. Comput., Netw., Storage Anal., 2011.

[44] B. Sotomayor, K. Keahey, and I. Foster, “Combining batch execu-tion and leasing using virtual machines,” in Proc. 17th Int. Symp.High Perform. Distrib. Comput., 2007, pp. 87–96.

[45] K. Chard, K. Bubendorfer, and P. Komisarczuk, “High occupancyresource allocation for grid and cloud systems, a study withDRIVE,” in Proc. 19th ACM Int. Symp. High Perform. Distrib. Com-put., 2010, pp. 73–84.

[46] T. Hacker and K. Mahadik, “Flexible resource allocation for reli-able virtual cluster computing systems,” in Proc. Int. Conf. HighPerform. Comput., Netw., Storage Anal., 2011, pp. 1–12.

Balaji Palanisamy received the MS and PhDdegrees in computer science from the College ofComputing, Georgia Tech, in 2009 and 2013,respectively. He is an assistant professor at theSchool of Information Science, University ofPittsburgh. His primary research interests lie inscalable and privacy-conscious resource man-agement for large-scale distributed and mobilesystems. At University of Pittsburgh, he co-directs research in the Laboratory of Researchand Education on Security Assured Information

Systems (LERSAIS), which is one of the first group of NSA/DHS desig-nated Centers of Academic Excellence in Information Assurance Educa-tion and Research (CAE & CAE-R). He received the Best Paper Awardat the Fifth International Conference on Cloud Computing, IEEE CLOUD2012. He is a member of the IEEE and is currently the chair of the IEEECommunications Society Pittsburgh Chapter.

Aameek Singh received the bachelor’s degreefrom IIT Bombay, India and the PhD degree fromthe Georgia Institute of Technology. He is cur-rently a research manager in IBM Research-Almaden. His research interests include enter-prise systems management, cloud computingand distributed systems. He is a member of theIEEE.

Ling Liu is a full professor in computer science atthe Georgia Institute of Technology. She directsthe research programs in Distributed Data Inten-sive Systems Lab (DiSL), examining variousaspects of large scale data intensive systems.She is an internationally recognized expert in theareas of cloud computing, database systems, dis-tributed computing, Internet systems, and serviceoriented computing. She has published morethan 300 international journal and conferencearticles and received the best paper award from a

number of top venues, including ICDCS 2003, WWW 2004, 2005 PatGoldberg Memorial Best Paper Award, IEEE Cloud 2012, IEEE ICWS2013. She also received the IEEE Computer Society Technical Achieve-ment Award in 2012 and an Outstanding Doctoral Thesis Advisor awardin 2012 from the Georgia Institute of Technology. In addition to servicesas the general chair and PC chairs of numerous IEEE and ACM confer-ences in data engineering, very large databases and distributed comput-ing fields, she was on editorial board of more than a dozen internationaljournals. She is currently the editor in chief of IEEE Transactions on Ser-vice Computing, and is on the editorial board of ACM Transactions onInternet Technology (TOIT), ACM Transactions on Web (TWEB), Dis-tributed and Parallel Databases (Springer), and Journal of Parallel andDistributed Computing (JPDC). She is a senior member of the IEEE.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED … · mance requirements, Cura multiplexes the...

Documents

Transcript of IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED … · mance requirements, Cura multiplexes the...