SpringFS: Bridging Agility and Performance in Elastic ...Sierra [18]. Both organize replicas such...

14
This paper is included in the Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST ’14). February 17–20, 2014 • Santa Clara, CA USA ISBN 978-1-931971-08-9 Open access to the Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST ’14) is sponsored by SpringFS: Bridging Agility and Performance in Elastic Distributed Storage Lianghong Xu, James Cipar, Elie Krevat, Alexey Tumanov, and Nitin Gupta, Carnegie Mellon University; Michael A. Kozuch, Intel Labs; Gregory R. Ganger, Carnegie Mellon University https://www.usenix.org/conference/fast14/technical-sessions/presentation/xu

Transcript of SpringFS: Bridging Agility and Performance in Elastic ...Sierra [18]. Both organize replicas such...

This paper is included in the Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST ’14).

February 17–20, 2014 • Santa Clara, CA USA

ISBN 978-1-931971-08-9

Open access to the Proceedings of the 12th USENIX Conference on File and Storage

Technologies (FAST ’14) is sponsored by

SpringFS: Bridging Agility and Performance in Elastic Distributed Storage

Lianghong Xu, James Cipar, Elie Krevat, Alexey Tumanov, and Nitin Gupta, Carnegie Mellon University; Michael A. Kozuch, Intel Labs; Gregory R. Ganger, Carnegie Mellon University

https://www.usenix.org/conference/fast14/technical-sessions/presentation/xu

USENIX Association 12th USENIX Conference on File and Storage Technologies 243

SpringFS: Bridging Agility and Performance in Elastic Distributed Storage

Lianghong Xu�, James Cipar�, Elie Krevat�, Alexey Tumanov�

Nitin Gupta�, Michael A. Kozuch†, Gregory R. Ganger��Carnegie Mellon University, †Intel Labs

AbstractElastic storage systems can be expanded or contracted

to meet current demand, allowing servers to be turnedoff or used for other tasks. However, the usefulness of anelastic distributed storage system is limited by its agility:how quickly it can increase or decrease its number ofservers. Due to the large amount of data they must mi-grate during elastic resizing, state-of-the-art designs usu-ally have to make painful tradeoffs among performance,elasticity and agility.

This paper describes an elastic storage system, calledSpringFS, that can quickly change its number of activeservers, while retaining elasticity and performance goals.SpringFS uses a novel technique, termed bounded writeoffloading, that restricts the set of servers where writesto overloaded servers are redirected. This technique,combined with the read offloading and passive migrationpolicies used in SpringFS, minimizes the work neededbefore deactivation or activation of servers. Analysis ofreal-world traces from Hadoop deployments at Facebookand various Cloudera customers and experiments withthe SpringFS prototype confirm SpringFS’s agility, showthat it reduces the amount of data migrated for elastic re-sizing by up to two orders of magnitude, and show thatit cuts the percentage of active servers required by 67–82%, outdoing state-of-the-art designs by 6–120%.

1 Introduction

Distributed storage can and should be elastic, just likeother aspects of cloud computing. When storage is pro-vided via single-purpose storage devices or servers, sep-arated from compute activities, elasticity is useful forreducing energy usage, allowing temporarily unneededstorage components to be powered down. However, forstorage provided via multi-purpose servers (e.g. whena server operates as both a storage node in a distributedfilesystem and a compute node), such elasticity is evenmore valuable— providing cloud infrastructures with thefreedom to use such servers for other purposes, as ten-ant demands and priorities dictate. This freedom maybe particularly important for increasingly prevalent data-intensive computing activities (e.g., data analytics).

Data-intensive computing over big data sets is quicklybecoming important in most domains and will be a ma-jor consumer of future cloud computing resources [7,4, 3, 2]. Many of the frameworks for such comput-ing (e.g., Hadoop [1] and Google’s MapReduce [10])achieve efficiency by distributing and storing the data onthe same servers used for processing it. Usually, the datais replicated and spread evenly (via randomness) acrossthe servers, and the entire set of servers is assumed toalways be part of the data analytics cluster. Little-to-nosupport is provided for elastic sizing1 of the portion ofthe cluster that hosts storage—only nodes that host nostorage can be removed without significant effort, mean-ing that the storage service size can only grow.

Some recent distributed storage designs (e.g.,Sierra [18], Rabbit [5]) provide for elastic sizing, origi-nally targeted for energy savings, by distributing replicasamong servers such that subsets of them can be powereddown when the workload is low without affecting dataavailability; any server with the primary replica of datawill remain active. These systems are designed mainlyfor performance or elasticity (how small the system sizecan shrink to) goals, while overlooking the importanceof agility (how quickly the system can resize its footprintin response to workload variations), which we find has asignificant impact on the machine-hour savings (and sothe operating cost savings) one can potentially achieve.As a result, state-of-the-art elastic storage systemsmust make painful tradeoffs among these goals, unableto fulfill them at the same time. For example, Sierrabalances load across all active servers and thus providesgood performance. However, this even data layout limitselasticity— at least one third of the servers must alwaysbe active (assuming 3-way replication), wasting machinehours that could be used for other purposes when theworkload is very low. Further, rebalancing the datalayout when turning servers back on induces significantmigration overhead, impairing system agility.

1We use “elastic sizing” to refer to dynamic online resizing, downfrom the full set of servers and back up, such as to adapt to workloadvariations. The ability to add new servers, as an infrequent adminis-trative action, is common but does not itself make a storage service“elastic” in this context; likewise with the ability to survive failures ofindividual storage servers.

1

244 12th USENIX Conference on File and Storage Technologies USENIX Association

In contrast, Rabbit can shrink its active footprint toa much smaller size (≈10% of the cluster size), but itsreliance on Everest-style write offloading [16] inducessignificant cleanup overhead when shrinking the activeserver set, resulting in poor agility.

This paper describes a new elastic distributed stor-age system, called SpringFS, that provides the elastic-ity of Rabbit and the peak write bandwidth character-istic of Sierra, while maximizing agility at each pointalong a continuum between their respective best cases.The key idea is to employ a small set of servers to storeall primary replicas nominally, but (when needed) of-fload writes that would go to overloaded servers to onlythe minimum set of servers that can satisfy the writethroughput requirement (instead of all active servers).This technique, termed bounded write offloading, ef-fectively restricts the distribution of primary replicasduring offloading and enables SpringFS to adapt dy-namically to workload variations while meeting perfor-mance targets with a minimum loss of agility—most ofthe servers can be extracted without needing any pre-removal cleanup. SpringFS further improves agility byminimizing the cleanup work involved in resizing withtwo more techniques: read offloading offloads readsfrom write-heavy servers to reduce the amount of writeoffloading needed to achieve the system’s performancetargets; passive migration delays migration work by acertain time threshold during server re-integration to re-duce the overall amount of data migrated. With thesetechniques, SpringFS achieves agile elasticity while pro-viding performance comparable to a non-elastic storagesystem.

Our experiments demonstrate that the SpringFS de-sign enables significant reductions in both the fractionof servers that need to be active and the amount of mi-gration work required. Indeed, its design for where andwhen to offload writes enables SpringFS to resize elas-tically without performing any data migration at all inmost cases. Analysis of traces from six real Hadoop de-ployments at Facebook and various Cloudera customersshow the oft-noted workload variation and the potentialof SpringFS to exploit it—SpringFS reduces the amountof data migrated for elastic resizing by up to two ordersof magnitude, and cuts the percentage of active serversrequired by 67–82%, outdoing state-of-the-art designslike Sierra and Rabbit by 6–120%.

This paper makes three main contributions: First, tothe best of our knowledge, it is the first to show the im-portance of agility in elastic distributed storage, high-lighting the need to resize quickly (at times) rather thanjust hourly as in previous designs. Second, SpringFSintroduces a novel write offloading policy that boundsthe set of servers to which writes to over-loaded pri-mary servers are redirected. Bounded write offloading,

together with read offloading and passive migration sig-nificantly improve the system’s agility by reducing thecleanup work during elastic resizing. These techniquesapply generally to elastic storage with an uneven datalayout. Third, we demonstrate the significant machine-hour savings that can be achieved with elastic resizing,using six real-world HDFS traces, and the effectivenessof SpringFS’s policies at achieving a “close-to-ideal”machine-hour usage.

The remainder of this paper is organized as follows.Section 2 describes elastic distributed storage generally,the importance of agility in such storage, and the limita-tions of the state-of-the-art data layout designs in fulfill-ing elasticity, agility and performance goals at the sametime. Section 3 describes the key techniques in SpringFSdesign and how they can increase agility of elasticity.Section 4 overviews the SpringFS implementation. Sec-tion 5 evaluates the SpringFS design.

2 Background and Motivation

This section motivates our work. First, it describes therelated work on elastic distributed storage, which pro-vides different mechanisms and data layouts to allowservers to be extracted while maintaining data availabil-ity. Second, it demonstrates the significant impact ofagility on aggregate machine-hour usage of elastic stor-age. Third, it describes the limitations of state-of-the-artelastic storage systems and how SpringFS fills the signif-icant gap between agility and performance.

2.1 Related WorkMost distributed storage is not elastic. For example, thecluster-based storage systems commonly used in supportof cloud and data-intensive computing environments,such as the Google File System(GFS) [11] or the HadoopDistributed Filesystem [1], use data layouts that are notamenable to elasticity. The Hadoop Distributed File Sys-tem (HDFS), for example, uses a replication and data-layout policy wherein the first replica is placed on a nodein the same rack as the writing node (preferably the writ-ing node, if it contributes to DFS storage), the secondand third on random nodes in a randomly chosen dif-ferent rack than the writing node. In addition to loadbalancing, this data layout provides excellent availabil-ity properties—if the node with the primary replica fails,the other replicas maintain data availability; if an entirerack fails (e.g., through the failure of a communicationlink), data availability is maintained via the replica(s) inanother rack. But, such a data layout prevents elastic-ity by requiring that almost all nodes be active—no morethan one node per rack can be turned off without a highlikelihood of making some data unavailable.

2

USENIX Association 12th USENIX Conference on File and Storage Technologies 245

Recent research [5, 13, 18, 19, 17] has provided newdata layouts and mechanisms for enabling elasticity indistributed storage. Most notable are Rabbit [5] andSierra [18]. Both organize replicas such that one copyof data is always on a specific subset of servers, termedprimaries, so as to allow the remainder of the nodes tobe powered down without affecting availability, whenthe workload is low. With workload increase, they canbe turned back on. The same designs and data distri-bution schemes would allow for servers to be used forother functions, rather than turned off, such as for higher-priority (or higher paying) tenants’ activities. Writes in-tended for servers that are inactive2 are instead writtento other active servers—an action called write availabil-ity offloading—and then later reorganized (when serversbecome active) to conform to the desired data layout.

Rabbit and Sierra build on a number of techniquesfrom previous systems, such as write availability offload-ing and power gears. Narayanan, Donnelly, and Row-stron [15] described the use of write availability offload-ing for power management in enterprise storage work-loads. The approach was used to redirect traffic from oth-erwise idle disks to increase periods of idleness, allowingthe disks to be spun down to save power. PARAID [20]introduced a geared scheme to allow individual disks ina RAID array to be turned off, allowing the power usedby the array to be proportional to its throughput.

Everest [16] is a distributed storage design that usedwrite performance offloading3, rather than to avoid turn-ing on powered-down servers, in the context of enter-prise storage. In Everest, disks are grouped into distinctvolumes, and each write is directed to a particular vol-ume. When a volume becomes overloaded, writes can betemporarily redirected to other volumes that have sparebandwidth, leaving the overloaded volume to only han-dle reads. Rabbit applies this same approach, when nec-essary, to address overload of the primaries.

SpringFS borrows the ideas of write availability andperformance offloading from prior elastic storage sys-tems. We expand on previous work by developing newoffloading and migration schemes that effectively elim-inate the painful tradeoff between agility and write per-formance in state-of-the-art elastic storage designs.

2We generally refer to a server as inactive when it is either pow-ered down or reused for other purposes. Conversely, we call a serveractive when it is powered on and servicing requests as part of a elasticdistributed storage system.

3Write performance offloading differs from write availability of-floading in that it offloads writes from overloaded active servers toother (relatively idle) active servers for better load balancing. TheEverest-style and bounded write offloading schemes are both types ofwrite performance offloading.

2.2 Agility is important

By “agility”, we mean how quickly one can change thenumber of servers effectively contributing to a service.For most non-storage services, such changes can oftenbe completed quickly, as the amount of state involvedis small. For distributed storage, however, the state in-volved may be substantial. A storage server can servicereads only for data that it stores, which affects the speedof both removing and re-integrating a server. Removinga server requires first ensuring that all data is available onother servers, and re-integrating a server involves replac-ing data overwritten (or discarded) while it was inactive.

The time required for such migrations has a direct im-pact on the machine-hours consumed by elastic storagesystems. Systems with better agility are able to more ef-fectively exploit the potential of workload variation bymore closely tracking workload changes. Previous elas-tic storage systems rely on very infrequent changes (e.g.,hourly resizing in Sierra [18]), but we find that over halfof the potential savings is lost with such an approach dueto the burstiness of real workloads.

Figure 1: Workload variation in the Facebook trace.The shaded region represents the potential reduction inmachine-hour usage with a 1-minute resizing interval.

As one concrete example, Figure 1 shows the num-ber of active servers needed, as a function of time in thetrace, to provide the required throughput in a randomlychosen 4-hour period from the Facebook trace describedin Section 5. The dashed and solid curves bounding theshaded region represent the minimum number of activeservers needed if using 1-hour and 1-minute resizing in-tervals, respectively. For each such period, the numberof active servers corresponds to the number needed toprovide the peak throughput in that period, as is donein Sierra to avoid significant latency increases. The areaunder each curve represents the machine time used forthat resizing interval, and the shaded region representsthe increased server usage (more than double) for the 1-hour interval. We observe similar burstiness and conse-quences of it across all of the traces.

3

246 12th USENIX Conference on File and Storage Technologies USENIX Association

2.3 Bridging Agility and PerformancePrevious elastic storage systems overlook the importanceof agility, focusing on performance and elasticity. Thissection describes the data layouts of state-of-the-art elas-tic storage systems, specifically Sierra and Rabbit, andhow their layouts represent two specific points in thetradeoff space among elasticity, agility and performance.Doing so highlights the need for a more flexible elasticstorage design that fills the void between them, providinggreater agility and matching the best of each.

We focus on elastic storage systems that ensure dataavailability at all times. When servers are extracted fromthe system, at least one copy of all data must remain ac-tive to serve read requests. To do so, state-of-the-art elas-tic storage designs exploit data replicas (originally forfault tolerance) to ensure that all blocks are available atany power setting. For example, with 3-way replication4,Sierra stores the first replica of every block (termed pri-mary replica) in one third of servers, and writes the other2 replicas to the other two thirds of servers. This data lay-out allows Sierra to achieve full peak performance due tobalanced load across all active servers, but it limits theelasticity of the system by not allowing the system foot-print to go below one third of the cluster size. We showin section 5.2 that such limitation can have a significantimpact on the machine-hour savings that Sierra can po-tentially achieve, especially during periods of low work-load.

Rabbit, on the other hand, is able to reduce its systemfootprint to a much smaller size (≈10% of the clustersize). It does so by storing the replicas according to anequal-work data layout, so that it achieves power pro-portionality for read requests. That is, read performancescales linearly with the number of active servers: if 50%of the servers are active, the read performance of Rabbitshould be at least 50% of its maximum read performance.The equal-work data layout ensures that, with any num-ber of active servers, each server is able to perform anequal share of the read workload. In a system storing Bblocks, with p primary servers and x active servers, eachactive server must store at least B/x blocks so that readscan be distributed equally, with the exception of the pri-mary servers. Since a copy of all blocks must be storedon the p primary servers, they each store B/p blocks.This ensures (probabilistically) that when a large quan-tity of data is read, no server must read more than theothers and become a bottleneck. This data layout allowsRabbit to keep the number of primary servers (p=N/e2)very small (e is Euler’s constant). The small number of

4We assume 3-way replication for all data blocks throughout thispaper, which remains the default policy for HDFS. The data layoutdesigns apply to other replication levels as well. Different approachesthan Sierra, Rabbit and SpringFS are needed when erasure codes areused for fault tolerance instead of replication.

Figure 2: Primary data distribution for Rabbit withoutoffloading (grey) and Rabbit with offloading (light grey).With offloading, primary replicas are spread across allactive servers during writes, incurring significant cleanupoverhead when the system shrinks its size.

primary servers provides great agility—Rabbit is able toshrink its system size down to p without any cleanupwork—but it can create bottlenecks for writes. Sincethe primary servers must store the primary replicas forall blocks, the maximum write throughput of Rabbit islimited by the maximum aggregate write throughout ofthe p primary servers, even when all servers are active.In contrast, Sierra is able to achieve the same maximumwrite throughput as that of HDFS, that is, the aggregatewrite throughput of N/3 servers (recall: N servers write3 replicas for every data block).

Rabbit borrows write offloading from the Everest sys-tem [16] to solve this problem. When primary serversbecome the write performance bottleneck, Rabbit simplyoffloads writes that would go to heavily loaded serversacross all active servers. While such write offloadingallows Rabbit to achieve good peak write performancecomparable to non-modified HDFS due to balanced load,it significantly impairs system agility by spreading pri-mary replicas across all active servers, as depicted inFigure 2. Consequently, before Rabbit shrinks the sys-tem size, cleanup work is required to migrate some pri-mary replicas to the remaining active servers so that atleast one complete copy of data is still available after theresizing action. As a result, the improved performancefrom Everest-style write offloading comes at a high costin system agility.

Figure 3 illustrates the very different design pointsrepresented by Sierra and Rabbit, in terms of the trade-offs among agility, elasticity and peak write perfor-mance. Read performance is the same for all of thesesystems, given the same number of active servers. Thenumber of servers that store primary replicas indicatesthe minimal system footprint one can shrink to withoutany cleanup work. As described above, state-of-the artelastic storage systems such as Sierra and Rabbit suf-fer from the painful tradeoff between agility and perfor-

4

USENIX Association 12th USENIX Conference on File and Storage Technologies 247

Figure 3: Elastic storage system comparison in terms ofagility and performance. N is the total size of the clus-ter. p is the number of primary servers in the equal-workdata layout. Servers with at least some primary repli-cas cannot be deactivated without first moving those pri-mary replicas. SpringFS provides a continuum betweenSierra’s and Rabbit’s (when no offload) single pointsin this tradeoff space. When Rabbit requires offload,SpringFS is superior at all points. Note that the y-axisis discontinuous.

mance due to the use of a rigid data layout. SpringFSprovides a more flexible design that provides the best-case elasticity of Rabbit, the best-case write performanceof Sierra, and much better agility than either. To achievethe range of options shown, SpringFS uses an explicitbound on the offload set, where writes of primary repli-cas to overloaded servers are offloaded to only the mini-mum set of servers (instead of all active servers) that cansatisfy the current write throughput requirement. Thisadditional degree of freedom allows SpringFS to adaptdynamically to workload changes, providing the desiredperformance while maintaining system agility.

3 SpringFS Design and Policies

This section describes SpringFS’s data layout, as well asthe bounded write offloading and read offloading policiesthat minimize the cleanup work needed before deactiva-tion of servers. It also describes the passive migrationpolicy used during a server’s re-integration to addressdata that was written during the server’s absence.

3.1 Data Layout and Offloading Policies

Data layout. Regardless of write performance, theequal-work data layout proposed in Rabbit enables thesmallest number of primary servers and thus providesthe best elasticity in state-of-the-art designs.5 SpringFSretains such elasticity using a variant of the equal-workdata layout, but addresses the agility issue incurred byEverest-style offloading when write performance bottle-necks arise. The key idea is to bound the distributionof primary replicas to a minimal set of servers (insteadof offloading them to all active servers), given a tar-get maximum write performance, so that the cleanupwork during server extraction can be minimized. Thisbounded write offloading technique introduces a param-eter called the offload set: the set of servers to whichprimary replicas are offloaded (and as a consequence re-ceive the most write requests). The offload set providesan adjustable tradeoff between maximum write perfor-mance and cleanup work. With a small offload set, fewwrites will be offloaded, and little cleanup work will besubsequently required, but the maximum write perfor-mance will be limited. Conversely, a larger offload setwill offload more writes, enabling higher maximum writeperformance at the cost of more cleanup work to be donelater. Figure 4 shows the SpringFS data layout and itsrelationship with the state-of-the-art elastic data layoutdesigns. We denote the size of the offload set as m, thenumber of primary servers in the equal-work layout as p,and the total size of the cluster as N. When m equals p,SpringFS behaves like Rabbit and writes all data accord-ing to the equal-work layout (no offload); when m equalsN/3, SpringFS behaves like Sierra and load balances allwrites (maximum performance). As illustrated in Fig-ure 3, the use of the tunable offload set allows SpringFSto achieve both end points and points in between.

Choosing the offload set. The offload set is not a rigidsetting, but determined on the fly to adapt to workloadchanges. Essentially, it is chosen according to the tar-get maximum write performance identified for each re-sizing interval. Because servers in the offload set writeone complete copy of the primary replicas, the size ofthe offload set is simply the maximum write throughputin the workload divided by the write throughput a singleserver can provide. Section 5.2 gives a more detailed de-scription of how SpringFS chooses the offload set (andthe number of active servers) given the target workloadperformance.

Read offloading. One way to reduce the amount ofcleanup work is to simply reduce the amount of writeoffloading that needs to be done to achieve the system’s

5Theoretically, no other data layout can achieve a smaller numberof primary servers while maintaining power-proportionality for readperformance.

5

248 12th USENIX Conference on File and Storage Technologies USENIX Association

performance targets. When applications simultaneouslyread and write data, SpringFS can coordinate the readand write requests so that reads are preferentially sentto higher numbered servers that naturally handle fewerwrite requests. We call this technique read offloading.

Despite its simplicity, read offloading allows SpringFSto increase write throughput without changing the offloadset by taking read work away from the low numberedservers (which are the bottleneck for writes). When aread occurs, instead of randomly picking one amongthe servers storing the replicas, SpringFS chooses theserver that has received the least number of total re-quests recently. (The one exception is when the clientrequesting the read has a local copy of the data. In thiscase, SpringFS reads the replica directly from that serverto exploit machine locality.) As a result, lower num-bered servers receive more writes while higher numberedservers handle more reads. Such read/write distributionbalances the overall load across all the active serverswhile reducing the need for write offloading.

Replica placement. When a block write occurs,SpringFS chooses target servers for the 3 replicas in thefollowing steps: The primary replica is load balancedacross (and thus bounded in) the m servers in the cur-rent offload set. (The one exception is when the clientrequesting the write is in the offload set. In this case,SpringFS writes the primary copy to that server, in-stead of the server with the least load in the offload set,to exploit machine locality.) For non-primary replicas,SpringFS first determines their target servers accordingto the equal-work layout. For example, the target serverfor the secondary replica would be a server numberedbetween p+ 1 and ep, and that for the tertiary replicawould be a server numbered between ep + 1 and e2 p,both following the probability distribution as indicatedby the equal-work layout (lower numbered servers havehigher probability to write the non-primary replicas). Ifthe target server number is higher than m, the replica iswritten to that server. However, if the target server num-ber is between p+ 1 and m (a subset of the offload set),the replica is instead redirected and load balanced acrossservers outside the offload set, as shown in the shaded re-gions in Figure 4. Such redirection of non-primary repli-cas reduces the write requests going to the servers in theoffload set and ensures that these servers store only theprimary replicas.

Fault tolerance and multi-volume support. The useof an uneven data layout creates new problems for faulttolerance and capacity utilization. For example, when aprimary server fails, the system may need to re-integratesome non-primary servers to restore the primary repli-cas onto a new server. SpringFS includes the data lay-out refinements from Rabbit that minimize the numberof additional servers that must be re-activated if such fail-

Figure 4: SpringFS data layout and its relationship withprevious designs. The offload set allows SpringFS toachieve a dynamic tradeoff between the maximum writeperformance and the cleanup work needed before ex-tracting servers. In SpringFS, all primary replicas arestored in the m servers of the offload set. The shaded re-gions indicate writes of non-primary replicas that wouldhave gone to the offload set (in SpringFS) are insteadredirected and load balanced outside the set.

ure happens. Writes that would have gone to the failedprimary server are instead redirected to other servers inthe offload set to preserve system agility. Like Rabbit,SpringFS also accommodates multi-volume data layoutsin which independent volumes use distinct servers as pri-maries in order to allow small values of p without limit-ing storage capacity utilization to 3p/N.

3.2 Passive Migration for Re-integration

When SpringFS tries to write a replica according to itstarget data layout but the chosen server happens to beinactive, it must still maintain the specified replicationfactor for the block. To do this, another host must beselected to receive the write. Availability offloading isused to redirect writes that would have gone to inactiveservers (which are unavailable to receive requests) to theactive servers. As illustrated in Figure 5, SpringFS loadbalances availability offloaded writes together with theother writes to the system. This results in the availabilityoffloaded writes going to the less-loaded active serversrather than adding to existing write bottlenecks on otherservers.

Because of availability offloading, re-integrating apreviously deactivated server is more than simply restart-ing its software. While the server can begin servicing itsshare of the write workload immediately, it can only ser-vice reads for blocks that it stores. Thus, filling it accord-ing to its place in the target equal-work layout is part offull re-integration.

When a server is reintegrated to address a workload

6

USENIX Association 12th USENIX Conference on File and Storage Technologies 249

Figure 5: Availability offloading. When SpringFS worksin the power saving mode, some servers (n+1 to N) aredeactivated. The shaded regions show that writes thatwould have gone to these inactive servers are offloadedto higher numbered active servers for load balancing.

increase, the system needs to make sure that the activeservers will be able to satisfy the read performance re-quirement. One option is to aggressively restore theequal work data layout before reintegrated servers beginservicing reads. We call this approach aggressive mi-gration. Before anticipated workload increases, the mi-gration agent would activate the right number of serversand migrate some data to the newly activated servers sothat they store enough data to contribute their full shareof read performance. The migration time is determinedby the number of blocks that need to be migrated, thenumber of servers that are newly activated, and the I/Othroughput of a single server. With aggressive migration,cleanup work is never delayed. Whenever a resizing ac-tion takes place, the property of the equal-work layout isobeyed—server x stores no less than B

x blocks.SpringFS takes an alternate approach called passive

migration, based on the observation that cleanup workwhen re-integrating a server is not as important as whendeactivating a server (for which it preserves data avail-ability), and that the total amount of cleanup work canbe reduced by delaying some fraction of migration workwhile performance goals are still maintained (whichmakes this approach better than aggressive migration).Instead of aggressively fixing the data layout (by ac-tivating the target number of servers in advance for alonger period of time), SpringFS temporarily activatesmore servers than would minimally be needed to satisfythe read throughput requirement and utilizes the extrabandwidth for migration work and to address the reducednumber of blocks initially on each reactivated server. Thenumber of extra servers that need to be activated is de-termined in two steps. First, an initial number is cho-sen to ensure that the number of valid data blocks stillstored on the activated servers is more than the fractionof read workload they need to satisfy, so that the perfor-

mance requirement is satisfied. Second, the number maybe increased so that the extra servers provide enough I/Obandwidth to finish a fraction (1/T , where T is the mi-gration threshold as described below) of migration work.To avoid migration work building up indefinitely, the mi-gration agent sets a time threshold so that whenever amigration action takes place, it is guaranteed to finishwithin T minutes. With T > 1 (the default resizing inter-val), SpringFS delays part of the migration work whilesatisfying throughput requirement. Because higher num-bered servers receive more writes than their equal-workshare, due to write offloading, some delayed migrationwork can be replaced by future writes, which reduces theoverall amount of data migration. If T is too large, how-ever, the cleanup work can build up so quickly that evenactivating all the servers cannot satisfy the throughputrequirement. In practice, we find a migration thresholdT = 10 to be a good choice and use this setting for thetrace analysis in Section 5. Exploring automatic settingof T is an interesting future work.

4 Implementation

SpringFS is implemented as a modified instance ofthe Hadoop Distributed File System (HDFS), version0.19.16. We build on a Scriptable Hadoop interface thatwe built into Hadoop to allow experimenters to imple-ment policies in external programs that are called bythe modified Hadoop. This enables rapid prototypingof new policies for data placement, read load balancing,task scheduling, and re-balancing. It also enables us toemulate both Rabbit and SpringFS in the same system,for better comparison. SpringFS mainly consists of fourcomponents: data placement agent, load balancer, resiz-ing agent and migration agent, all implemented as pythonprograms called by the Scriptable Hadoop interface.

Data placement agent. The data placement agentdetermines where to place blocks according to theSpringFS data layout. Ordinarily, when a HDFS clientwishes to write a block, it contacts the HDFS NameNodeand asks where the block should be placed. TheNameNode returns a list of pseudo-randomly chosenDataNodes to the client, and the client writes the datadirectly to these DataNodes. The data placement agentstarts together with the NameNode, and communicateswith the NameNode using a simple text-based protocolover stdin and stdout. To obtain a placement deci-sion for the R replicas of a block, the NameNode writesthe name of the client machine as well as a list of candi-

60.19.1 was the latest Hadoop version when our work started. Wehave done a set of experiments to verify that HDFS performance dif-fers little, on our experimental setup, between version 0.19.1 and thelatest stable version (1.2.1). We believe our results and findings are notsignificantly affected by still using this older version of HDFS.

7

250 12th USENIX Conference on File and Storage Technologies USENIX Association

date DataNodes to the placement agent’s stdin. Theplacement agent can then filter and reorder the candi-dates, returning a prioritized list of targets for the writeoperation. The NameNode then instructs the client towrite to the first R candidates returned.

Load balancer. The load balancer implements theread offloading policy and preferentially sends reads tohigher numbered servers that handle fewer write requestswhenever possible. It keeps an estimate of the load oneach server by counting the number of requests sent toeach server recently. Every time SpringFS assigns ablock to a server, it increments a counter for the server.To ensure that recent activity has precedence, these coun-ters are periodically decayed by 0.95 every 5 seconds.While this does not give the exact load on each server,we find its estimates good enough (within 3% off opti-mal) for load balancing among relatively homogeneousservers.

Resizing agent. The resizing agent changesSpringFS’s footprint by setting an activity state for eachDataNode. On every read and write, the data placementagent and load balancer will check these states and re-move all “INACTIVE” DataNodes from the candidatelist. Only “ACTIVE” DataNodes are able to service readsor writes. By setting the activity state for DataNodes,we allow the resources (e.g., CPU and network) of “IN-ACTIVE” nodes to be used for other activities with nointerference from SpringFS activities. We also modifiedthe HDFS mechanisms for detecting and repairing under-replication to assume that “INACTIVE” nodes are notfailed, so as to avoid undesired re-replication.

Migration agent. The migration agent crawls the en-tire HDFS block distribution (once) when the NameNodestarts, and it keeps this information up-to-date by modi-fying HDFS to provide an interface to get and change thecurrent data layout. It exports two metadata tables fromthe NameNode, mapping file names to block lists andblocks to DataNode lists, and loads them into a SQLitedatabase. Any changes to the metadata (e.g., creating afile, creating or migrating a block) are then reflected inthe database on the fly. When data migration is sched-uled, the SpringFS migration agent executes a series ofSQL queries to detect layout problems, such as blockswith no primary replica or hosts storing too little data. Itthen constructs a list of migration actions to repair theseproblems. After constructing the full list of actions, themigration agent executes them in the background. To al-low block-level migration, we modified the HDFS clientutility to have a “relocate” operation that copies a blockto a new server. The migration agent uses GNU Parallelto execute many relocates simultaneously.

5 Evaluation

This section evaluates SpringFS and its offloading poli-cies. Measurements of the SpringFS implementationshow that it provide performance comparable to unmod-ified HDFS, that its policies improve agility by reduc-ing the cleanup required, and that it can agilely adaptits number of active servers to provide required perfor-mance levels. In addition, analysis of six traces from realHadoop deployments shows that SpringFS’s agility en-ables significantly reduced commitment of active serversfor the highly dynamic demands commonly seen in prac-tice.

5.1 SpringFS prototype experiments

Experimental setup: Our experiments were run on acluster of 31 machines. The modified Hadoop software isrun within KVM virtual machines, for software manage-ment purposes, but each VM gets its entire machine andis configured to use all 8 CPU cores, all 8 GB RAM, and100 GB of local hard disk space. One machine was con-figured as the Hadoop master, hosting both the NameN-ode and the JobTracker. The other 30 machines wereconfigured as slaves, each serving as an HDFS DataN-ode and a Hadoop TaskTracker. Unless otherwise noted,SpringFS was configured for 3-way replication (R = 3)and 4 primary servers (p = 4).

To simulate periods of high I/O activity, and effec-tively evaluate SpringFS under different mixes of I/Ooperations, we used a modified version of the standardHadoop TestDFSIO storage system benchmark calledTestDFSIO2. Our modifications allow for each node togenerate a mix of block-size (128 MB) reads and writes,distributed randomly across the block ID space, with auser-specified write ratio.

Except where otherwise noted, we specify a file sizeof 2GB per node in our experiments, such that the singleHadoop map task per node reads or writes 16 blocks. Thetotal time taken to transfer all blocks is aggregated andused to determine a global throughput. In some cases, webreak down the throughput results into the average aggre-gate throughput of just the block reads or just the blockwrites. This enables comparison of SpringFS’s perfor-mance to the unmodified HDFS setup with the same re-sources. Our experiments are focused primarily on therelative performance changes as agility-specific param-eters and policies are modified. Because the originalHadoop implementation is unable to deliver the full per-formance of the underlying hardware, our system canonly be compared reasonably with it and not the capa-bility of the raw storage devices.

Effect of offloading policies: Our evaluation fo-cuses on how SpringFS’s offloading policies affect per-

8

USENIX Association 12th USENIX Conference on File and Storage Technologies 251

Figure 6: Performance comparison of Rabbit with no of-fload, original HDFS, and SpringFS with varied offloadset.

formance and agility. We also measure the cleanup workcreated by offloading and demonstrate that SpringFS’snumber of active servers can be adapted agilely tochanges in workload intensity, allowing machines to beextracted and used for other activities.

Figure 6 presents the peak sustained I/O bandwidthmeasured for HDFS, Rabbit and SpringFS at different of-fload settings. (Rabbit and SpringFS are identical whenno offloading is used.) In this experiment, the write ra-tio is varied to demonstrate different mixes of read andwrite requests. SpringFS, Rabbit and HDFS achieve sim-ilar performance for a read-only workload, because in allcases there is a good distribution of blocks and replicasacross the cluster over which to balance the load. Theread performance of SpringFS slightly outperforms theoriginal HDFS due to its explicit load tracking for bal-ancing.

When no offloading is needed, both Rabbit andSpringFS are highly elastic and able to shrink 87% (26non-primary servers out of 30) with no cleanup work.However, as the write workload increases, the equal-work layout’s requirement that one replica be writtento the primary set creates a bottleneck and eventuallya slowdown of around 50% relative to HDFS for amaximum-speed write-only workload. SpringFS pro-vides the flexibility to tradeoff some amount of agility forbetter write throughput under periods of high write load.As the write ratio increases, the effect of SpringFS’s of-floading policies becomes more visible. Using only asmall number of offload servers, SpringFS significantlyreduces the amount of data written to the primary serversand, as a result, significantly improves performance overRabbit. For example, increasing the offload set from four(i.e., just the four primaries) to eight doubles maximumthroughput for the write-only workload, while remain-ing agile—the cluster is still able to shrink 74% with no

Figure 7: Cleanup work (in blocks) needed to reduceactive server count from 30 to X, for different offloadsettings. The “(offload=6)”, “(offload=8)” and “(of-fload=10)” lines correspond to SpringFS with boundedwrite offloading. The “(offload=30)” line corresponds toRabbit using Everest-style write offloading. Deactivat-ing only non-offload servers requires no block migration.The amount of cleanup work is linear in the number oftarget active servers.

cleanup work.Figure 7 shows the number of blocks that need to

be relocated to preserve data availability when reducingthe number of active servers. As desired, SpringFS’sdata placements are highly amenable to fast extrac-tion of servers. Shrinking the number of nodes to acount exceeding the cardinality of the offload set re-quires no clean-up work. Decreasing the count into thewrite offload set is also possible, but comes at somecost. As expected, for a specified target, the cleanupwork grows with an increase in the offload target set.SpringFS with no offload reduces to the based equal-work layout, which needs no cleanup work when ex-tracting servers but suffers from write performance bot-tlenecks. The most interesting comparison is Rabbit’sfull offload (offload=30) against SpringFS’s full offload(offload=10). Both provide the cluster’s full aggregatewrite bandwidth, but SpringFS’s offloading scheme doesit with much greater agility—66% of the cluster couldstill be extracted with no cleanup work and more withsmall amounts of cleanup. We also measured actualcleanup times, finding (not surprisingly) that they cor-relate strongly with the number of blocks that must bemoved.

SpringFS’s read offloading policy is simple and re-duces the cleanup work resulting from write offloading.To ensure that its simplicity does not result in lost oppor-tunity, we compare it to the optimal, oracular schedul-ing policy with claircognizance of the HDFS layout. We

9

252 12th USENIX Conference on File and Storage Technologies USENIX Association

Figure 8: Agile resizing in a 3-phase workload

use an Integer Linear Programming (ILP) model thatminimizes the number of reads sent to primary serversfrom which primary replica writes are offloaded. TheSpringFS read offloading policy, despite its simple re-alization, compares favorably and falls within 3% fromoptimal on average.

Agile resizing in SpringFS: Figure 8 illustratesSpringFS’s ability to resize quickly and deliver requiredperformance levels. It uses a sequence of three bench-marks to create phases of workload intensity and mea-sures performance for two cases: “SpringFS (no resiz-ing)” where the full cluster stays active throughout theexperiment and “SpringFS (resizing)” where the systemsize is changed with workload intensity. As expected,the performance is essentially the same for the two cases,with a small delay observed when SpringFS re-integratesservers for the third phase. But, the number of machinehours used is very different, as SpringFS extracts ma-chines during the middle phase.

This experiment uses a smaller setup, with only 7DataNodes, 2 primaries, 3 in the offload set, and 2-way replication. The workload consists of 3 consecu-tive benchmarks. The first benchmark is a TestDFSIO2benchmark that writes 7 files, each 2GB in size for a totalof 14GB written. The second benchmark is one SWIMjob [9] randomly picked from a series of SWIM jobs syn-thesized from a Facebook trace which reads 4.2GB andwrites 8.4GB of data. The third benchmark is also aTestDFSIO2 benchmark, but with a write ratio of 20%.The TestDFSIO2 benchmarks are I/O intensive, whereasthe SWIM job consumes only a small amount of the fullI/O throughput. For the resizing case, 4 servers are ex-tracted after the first write-only TestDFSIO2 benchmarkfinishes (shrinking the active set to 3), and those serversare reintegrated when the second TestDFSIO2 job starts.In this experiment, the resizing points are manually setwhen phase switch happens. Automatic resizing canbe done based on previous work on workload predic-tion [6, 12, 14].

The results in Figure 8 are an average of 10 runsfor both cases, shown with a moving average of 3 sec-onds. The I/O throughput is calculated by summing read

throughput and write throughput multiplied by the repli-cation factor. Decreasing the number of active SpringFSservers from 7 to 3 does not have an impact on its per-formance, since no cleanup work is needed. As ex-pected, resizing the cluster from 3 nodes to 7 imposesa small performance overhead due to background blockmigration, but the number of blocks to be migrated isvery small—about 200 blocks are written to SpringFSwith only 3 active servers, but only 4 blocks need to bemigrated to restore the equal-work layout. SpringFS’soffloading policies keep the cleanup work small, forboth directions. As a result, SpringFS extracts and re-integrates servers very quickly.

5.2 Policy analysis with real-world traces

This subsection evaluates SpringFS in terms of machine-hour usage with real-world traces from six industryHadoop deployments and compares it against three otherstorage systems: Rabbit, Sierra, and the default HDFS.We evaluate each system’s layout policies with eachtrace, calculate the amount of cleanup work and the esti-mated cleaning time for each resizing action, and sum-marize the aggregated machine-hour usage consumedby each system for each trace. The results show thatSpringFS significantly reduces machine-hour usage evencompared to the state-of-the-art elastic storage systems,especially for write-intensive workloads.

Trace overview: We use traces from six real Hadoopdeployments representing a broad range of businessactivities, one from Facebook and five from differentCloudera customers. The six traces are described andanalyzed in detail by Chen et al. [8]. Table 1 summa-rizes key statistics of the traces. The Facebook trace(FB) comes from Hadoop DataNode logs, each recordcontaining timestamp, operation type (HDFS READ orHDFS WRITE), and the number of bytes processed.From this information, we calculate the aggregate HDFSread/write throughput as well as the total throughput,which is the sum of read and write throughput multi-plied by the replication factor (3 for all the traces). Thefive Cloudera customer traces (CC-a through CC-e, us-ing the terminology from [8]) all come from Hadoop jobhistory logs, which contain per-job records of job dura-tion, HDFS input/output size, etc. Assuming the amountof HDFS data read or written for each job is distributedevenly within the job duration, we also obtain the aggre-gated HDFS throughput at any given point of time, whichis then used as input to the analysis program.

Trace analysis and results: To simplify calcula-tion, we make several assumptions. First, the maximummeasured total throughput in the traces corresponds tothe maximum aggregate performance across all the ma-chines in the cluster. Second, the maximum throughput

10

USENIX Association 12th USENIX Conference on File and Storage Technologies 253

Table 1: Trace summary. CC is “Cloudera Customer”and FB is “Facebook”. HDFS bytes processed is the sumof HDFS bytes read and HDFS bytes written.

Trace Machines Date Length Bytesprocessed

CC-a <100 2011 1 month 69TBCC-b 300 2011 9 days 473TBCC-c 700 2011 1 month 13PBCC-d 400-500 2011 2.8 months 5PBCC-e 100 2011 9 days 446TBFB 3000 2010 10 days 10.5PB

a single machine can deliver, not differentiating readsand writes, is derived from the maximum measured to-tal throughput divided by the number of machines in thecluster. In order to calculate the machine hour usage foreach storage system, the analysis program needs to de-termine the number of active servers needed at any givenpoint of time. It does this in the following steps: First, itdetermines the number of active servers needed in theimaginary “ideal” case, where no cleanup work is re-quired at all, by dividing the total HDFS throughput bythe maximum throughput a single machine can deliver.Second, it iterates through the number of active serversas a function of time. For each decrease in the activeset of servers, it checks for any cleanup work that mustbe done by analyzing the data layout at that point. Ifany cleanup is required, it delays resizing until the workis done or the performance requirement demands an in-crease of the active set, to allow additional bandwidthfor necessary cleanup work. For increases in the activeset of servers, it turns on some extra servers to satisfythe read throughput and uses the extra bandwidth to do afraction of migration work, using the passive migrationpolicy (for all the systems) with the migration thresholdset to be T=10.

Figures 9 and 10 show the number of active serversneeded, as a function of time, for the 6 traces. Eachgraph has 4 lines, corresponding to the “ideal” storagesystem, SpringFS, Rabbit and Sierra, respectively. We

Figure 9: Facebook trace

Figure 10: Traces: CC-a, CC-b, CC-c, CC-d, and CC-e

do not show the line for the Default HDFS, but since it isnot elastic, its curve would be a horizontal line with thenumber of active servers always being the full cluster size(the highest value on the Y axis). While the original tracedurations range from 9 days to 2.8 months, we only showa 4-hour-period for each trace for clarity. We start trace

11

254 12th USENIX Conference on File and Storage Technologies USENIX Association

replaying more than 3 days before the 4-hour period, tomake sure it represents the situation when systems are ina steady state and includes the effect of delaying migra-tion work.

As expected, SpringFS exhibits better agility thanRabbit, especially when shrinking the size of the clus-ter, since it needs no cleanup work until resizing down tothe offload set. Such agility difference between SpringFSand Rabbit is shown in Figure 9 at various points oftime (e.g., at minute 110, 140, and 160). The gap be-tween the two lines indicates the number of machinehours saved due to the agility-aware read and boundedwrite policies used in SpringFS. SpringFS also achieveslower machine-hour usage than Sierra, as confirmed inall the analysis graphs. While a Sierra cluster can shrinkdown to 1/3 of its total size without any cleanup work,it is not able to further decrease the cluster size. In con-trast, SpringFS can shrink the cluster size down to ap-proximately 10% of the original footprint. When I/O ac-tivity is low, the difference in minimal system footprintcan have a significant impact on the machine-hour us-age (e.g., as illustrated in Figure 10(b), Figure 10(c) andFigure 10(e)). In addition, when expanding cluster size,Sierra incurs more cleaning overhead than SpringFS, be-cause deactivated servers need to migrate more data torestore its even data layout. These results are summa-rized in Figure 11, which shows the extra number ofmachine hours used by each storage system, comparedand normalized to the ideal system. In these traces,SpringFS outperforms the other systems by 6% to 120%.For the traces with a relatively high write ratio, suchas the FB, CC-d and CC-e traces, SpringFS is able toachieve a “close-to-ideal” (within 5%) machine-hour us-age. SpringFS is less close to ideal for the other threetraces because they frequently need even less than the13% primary servers that SpringFS cannot deactivate.

Figure 12 summarizes the total amount of data mi-grated by Rabbit, Sierra and SpringFS while runningeach trace. With bounded write offloading and read of-floading, SpringFS is able to reduce the amount of datamigration by a factor of 9–208, as compared to Rabbit.SpringFS migrates significantly less data than Sierra aswell, because data migrated to restore the equal-workdata layout is much less than that to restore an even datalayout.

All of the trace analysis above assumes passive mi-gration during server reintegration for all three systemscompared, since it is useful to all of them. To evaluate theadvantage of passive migration, specifically, we repeatedthe same trace analysis using the aggressive migrationpolicy. The results show that passive migration reducesthe amount of data migrated, relative to aggressive mi-gration, by 1.5–7× (across the six traces) for SpringFS,1.2–5.6× for Sierra, and 1.2–3× for Rabbit. The bene-

0

0.5

1

1.5

2

2.5

3

3.5

CC-a CC-b CC-c CC-d CC-e FB

Mac

hine

hou

rs, n

orm

aliz

ed to

Idea

l

Trace

SpringFSSierraRabbit

Figure 11: Number of machine hours needed to executeeach trace for each system, normalized to the “Ideal” sys-tem (1 on the y-axis, not shown).

0

0.2

0.4

0.6

0.8

1

1.2

1.4

CC-a CC-b CC-c CC-d CC-e FB

Data

mig

rate

d, n

orm

aliz

ed to

Rab

bit

Trace

SpringFSSierraRabbit

Figure 12: Total data migrated for Rabbit, Sierra andSpringFS, normalized to results for Rabbit.

fit for Sierra and SpringFS is more significant, becausetheir data migration occurs primarily during server re-integration.

6 Conclusion

SpringFS is a new elastic storage system that fills thespace between state-of-the-art designs in the tradeoffamong agility, elasticity, and performance. SpringFS’sdata layout and offloading/migration policies adapt toworkload demands and minimize the data redistributioncleanup work needed for elastic resizing, greatly increas-ing agility relative to the best previous elastic storage de-signs. As a result, SpringFS can satisfy the time-varyingperformance demands of real environments with manyfewer machine hours. Such agility provides an impor-tant building block for resource-efficient data-intensive

12

USENIX Association 12th USENIX Conference on File and Storage Technologies 255

computing (a.k.a. Big Data) in multi-purpose clouds withcompeting demands for server resources.

There are several directions for interesting futurework. For example, the SpringFS data layout assumesthat servers are approximately homogeneous, like HDFSdoes, but some real-world deployments end up with het-erogeneous servers (in terms of I/O throughput and ca-pacity) as servers are added and replaced over time. Thedata layout could be refined to exploit such heterogene-ity, such as by using more powerful servers as primaries.Second, SpringFS’s design assumes a relatively evenpopularity of data within a given dataset, as exists forHadoop jobs processing that dataset, so it will be inter-esting to explore what aspects change when addressingthe unbalanced access patterns (e.g., Zipf distribution)common in servers hosting large numbers of relativelyindependent files.

Acknowledgements: We thank Cloudera and Face-book for sharing the traces and Yanpei Chen for releas-ing SWIM. We thank the members and companies of thePDL Consortium (including Actifio, APC, EMC, Face-book, Fusion-io, Google, HP Labs, Hitachi, Huawei,Intel, Microsoft Research, NEC Labs, NetApp, Oracle,Panasas, Samsung, Seagate, Symantec, VMWare, andWestern Digital) for their interest, insights, feedback,and support. This research was sponsored in part by In-tel as part of the Intel Science and Technology Centerfor Cloud Computing (ISTC-CC). Experiments were en-abled by generous hardware donations from Intel, Ne-tApp, and APC.

References

[1] Hadoop, 2012. http://hadoop.apache.org.

[2] MIT, Intel unveil new initia-tives addressing ‘Big Data’, 2012.http://web.mit.edu/newsoffice/2012/big-data-csail-intel-center-0531.html.

[3] AMPLab, 2013. http://amplab.cs.berkeley.edu.

[4] ISTC-CC Research, 2013. www.istc-cc.cmu.edu.

[5] H. Amur, J. Cipar, V. Gupta, G. R. Ganger, M. A.Kozuch, and K. Schwan. Robust and flexiblepower-proportional storage. ACM Symposium onCloud Computing, pages 217–228, 2010.

[6] P. Bodik, M. Armbrust, K. Canini, A. Fox, M. Jor-dan, and D. Patterson. A case for adaptive dat-acenters to conserve energy and improve reliabil-ity. University of California at Berkeley, Tech. Rep.UCB/EECS-2008-127, 2008.

[7] R. E. Bryant. Data-Intensive Supercomputing: TheCase for DISC. Technical report, Carnegie MellonUniversity, 2007.

[8] Y. Chen, S. Alspaugh, and R. Katz. Interactive an-alytical processing in big data systems: A crossindustry study of mapreduce workloads. VLDB,2012.

[9] Y. Chen, A. Ganapathi, R. Griffith, and R. Katz.The case for evaluating mapreduce performance us-ing workload suites. MASCOTS, 2011.

[10] J. Dean and S. Ghemawat. Mapreduce: simplifieddata processing on large clusters. Communicationsof the ACM, 51:107–108, January 2008.

[11] S. Ghemawat, H. Gobioff, and S. tak Leung. TheGoogle File System. In SOSP, pages 29–43, 2003.

[12] D. Gmach, J. Rolia, L. Cherkasova, and A. Kem-per. Workload analysis and demand prediction ofenterprise data center applications. IISWC, 2007.

[13] J. Leverich and C. Kozyrakis. On the Energy(In)efficiency of Hadoop Clusters. HotPower,2009.

[14] M. Lin, A. Wierman, L. L. Andrew, andE. Thereska. Dynamic right-sizing for power-proportional data centers. INFOCOM, 2011.

[15] D. Narayanan, A. Donnelly, and A. Rowstron.Write Off-Loading: Practical Power Managementfor Enterprise Storage. In USENIX Conference onFile and Storage Technologies, pages 1–15, Berke-ley, CA, USA, 2008. USENIX Association.

[16] D. Narayanan, A. Donnelly, E. Thereska, S. El-nikety, and A. Rowstron. Everest: Scaling downpeak loads through I/O off-loading. In OSDI, 2008.

[17] Y. Saito, S. Frølund, A. Veitch, A. Merchant, andS. Spence. FAB: Building Distributed EnterpriseDisk Arrays from Commodity Components. In AS-PLOS, pages 48–58, 2004.

[18] E. Thereska, A. Donnelly, and D. Narayanan.Sierra: practical power-proportionality for datacenter storage. In EuroSys, pages 169–182, 2011.

[19] N. Vasic, M. Barisits, V. Salzgeber, and D. Kos-tic. Making Cluster Applications Energy-Aware.In Workshop on Automated Control for Datacentersand Clouds, pages 37–42, New York, 2009.

[20] C. Weddle, M. Oldham, J. Qian, A.-I. A. Wang,P. L. Reiher, and G. H. Kuenning. PARAID: AGear-Shifting Power-Aware RAID. TOS, 2007.

13