[IEEE 2012 41st International Conference on Parallel Processing (ICPP) - Pittsburgh, PA, USA...

10
Probability-based Cloud Storage Providers Selection Algorithms with Maximum Availability Chia-Wei Chang Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan [email protected] Pangfeng Liu Department of Computer Science and Information Engineering, Graduate Institute of Networking and Multimedia, National Taiwan University Taipei, Taiwan [email protected] Jan-Jan Wu Institute of Information Science, Research Center for Information Technology Innovation, Academia Sinica Taipei, Taiwan [email protected] Abstract—During recent years cloud service providers have successfully provided reliable and flexible resources to cloud users. For example Amazon Elastic Block Store (Amazon EBS) and Simple Storage Service (Amazon S3) provides users storage in the cloud. Despite the tremendous efforts cloud service providers have devoted to the availability of their services, the interruption is still inevitable. Therefore just as an Internet service provider will not count on a single network provider, a cloud user should not depend on a single cloud service provider either. However, cloud service providers provide different levels of services. A more costly service is usually more reliable. As a result it is an important and challenging problem to choose among a set of service providers to fit one’s need, which could be budget, failure probability, or the amount of data that can survive failure. The goal of this paper is to select cloud service providers in order to maximize the benefits with a given budget. The contributions of this paper include a mathematical formulation of the cloud service provider selection problem in which both the object functions and cost measurements are clearly defined, algorithms that selects among cloud storage providers to maximize the data survival probability or the amount of surviving data, subject to a fixed budget, and a series of experiments that demonstrate that the proposed algorithms are efficient enough to find optimal solutions in reasonable amount of time, using price and fail probability taken from real cloud providers. Keywords-Cloud Computing, service provider selection, fail- ure probability, data replication, dynamic programming. I. I NTRODUCTION During recent years cloud service providers have suc- cessfully provide reliable and flexible resources to cloud users. Amazon Elastic Compute Cloud (Amazon EC2) [7] provides users the computation capacity in the cloud. Users can rent “computers” tailored to their needs, like the number of cores, the amount of memory, and the amount of disk storage, etc. Similarly Amazon Elastic Block Store (Amazon EBS) [6], Simple Storage Service (Amazon S3) [9], and Cloud Files [3] (Rackspace) provide users storage in the cloud. Users can rent storage tailored to their needs, like the amount of storage, how many copies of replicas, whether the replica should be placed in the same data center, etc. The storage infrastructure service is provided by Internet, therefore along with other resources like CPU and network bandwidth, this kind of services is called Infrastructure as a Service, or IaaS. Scalable Storage is listed as one of the top ten obstacles in the development of cloud computing in [12]. The tremen- dous amount of data must be stored in the cloud in a reliable and scalable manner. There are many cloud service providers that provide cloud storage services. For example, Amazon Elastic Block Store (EBS) [6] is block level storage volumes for use with Amazon EC2 instances. Users can perform computation using EC2 on their data storage in EBS. This “linkage” between computation and data storage is crucial for those computation models that do not provide persistent state between usages. Amazon also has a Simple Storage Service that provides a simple web services interface to be used to store and retrieve data [9]. As a result more and more companies and organizations are moving their data to cloud storage providers [38]. Availability of Service is listed as one of the top ten obstacles in the develop of cloud computing in [12]. Most cloud providers, including Amazon Simple Storage Service, Google Apps, GoGrid, Rackspace, etc, follow strict service- level agreements (SLAs) [10], [27], [24], [35] to ensure their users that their services are highly available (HA). Despite the tremendous efforts cloud service providers have devoted to the availability of their services, service interruption is inevitable. For example, there were four major cloud service outages reported in 2008 alone. These services, including AWS, Google App Engine, and Gmail, were not available from 1.5 to 8 hours [12], [2]. The reasons for these outages include authentication service overload, software protocol error, programming error, and outage in other contact sys- tems [12], [1]. In 2009 Rackspace reported a major outage in June [4]. Despite the efforts in data replications and the removal of possible single point of failure, disasters can still take out an entire data center and make service unavailable. It is believed that just as an Internet service provider will 2012 41st International Conference on Parallel Processing 0190-3918/12 $26.00 © 2012 IEEE DOI 10.1109/ICPP.2012.51 199

Transcript of [IEEE 2012 41st International Conference on Parallel Processing (ICPP) - Pittsburgh, PA, USA...

Probability-based Cloud Storage Providers Selection Algorithms with MaximumAvailability

Chia-Wei ChangDepartment of Computer Science

and Information EngineeringNational Taiwan University

Taipei, [email protected]

Pangfeng LiuDepartment of Computer Science

and Information Engineering,Graduate Institute of Networking and Multimedia,

National Taiwan UniversityTaipei, Taiwan

[email protected]

Jan-Jan WuInstitute of Information Science,Research Center for Information

Technology Innovation,Academia SinicaTaipei, Taiwan

[email protected]

Abstract—During recent years cloud service providers havesuccessfully provided reliable and flexible resources to cloudusers. For example Amazon Elastic Block Store (Amazon EBS)and Simple Storage Service (Amazon S3) provides users storagein the cloud. Despite the tremendous efforts cloud serviceproviders have devoted to the availability of their services, theinterruption is still inevitable. Therefore just as an Internetservice provider will not count on a single network provider, acloud user should not depend on a single cloud service providereither. However, cloud service providers provide different levelsof services. A more costly service is usually more reliable. Asa result it is an important and challenging problem to chooseamong a set of service providers to fit one’s need, which couldbe budget, failure probability, or the amount of data thatcan survive failure. The goal of this paper is to select cloudservice providers in order to maximize the benefits with a givenbudget. The contributions of this paper include a mathematicalformulation of the cloud service provider selection problem inwhich both the object functions and cost measurements areclearly defined, algorithms that selects among cloud storageproviders to maximize the data survival probability or theamount of surviving data, subject to a fixed budget, and a seriesof experiments that demonstrate that the proposed algorithmsare efficient enough to find optimal solutions in reasonableamount of time, using price and fail probability taken fromreal cloud providers.

Keywords-Cloud Computing, service provider selection, fail-ure probability, data replication, dynamic programming.

I. INTRODUCTION

During recent years cloud service providers have suc-cessfully provide reliable and flexible resources to cloudusers. Amazon Elastic Compute Cloud (Amazon EC2) [7]provides users the computation capacity in the cloud. Userscan rent “computers” tailored to their needs, like the numberof cores, the amount of memory, and the amount of diskstorage, etc. Similarly Amazon Elastic Block Store (AmazonEBS) [6], Simple Storage Service (Amazon S3) [9], andCloud Files [3] (Rackspace) provide users storage in thecloud. Users can rent storage tailored to their needs, like theamount of storage, how many copies of replicas, whetherthe replica should be placed in the same data center, etc.

The storage infrastructure service is provided by Internet,therefore along with other resources like CPU and networkbandwidth, this kind of services is called Infrastructure asa Service, or IaaS.

Scalable Storage is listed as one of the top ten obstaclesin the development of cloud computing in [12]. The tremen-dous amount of data must be stored in the cloud in a reliableand scalable manner. There are many cloud service providersthat provide cloud storage services. For example, AmazonElastic Block Store (EBS) [6] is block level storage volumesfor use with Amazon EC2 instances. Users can performcomputation using EC2 on their data storage in EBS. This“linkage” between computation and data storage is crucialfor those computation models that do not provide persistentstate between usages. Amazon also has a Simple StorageService that provides a simple web services interface to beused to store and retrieve data [9]. As a result more andmore companies and organizations are moving their data tocloud storage providers [38].

Availability of Service is listed as one of the top tenobstacles in the develop of cloud computing in [12]. Mostcloud providers, including Amazon Simple Storage Service,Google Apps, GoGrid, Rackspace, etc, follow strict service-level agreements (SLAs) [10], [27], [24], [35] to ensure theirusers that their services are highly available (HA). Despitethe tremendous efforts cloud service providers have devotedto the availability of their services, service interruption isinevitable. For example, there were four major cloud serviceoutages reported in 2008 alone. These services, includingAWS, Google App Engine, and Gmail, were not availablefrom 1.5 to 8 hours [12], [2]. The reasons for these outagesinclude authentication service overload, software protocolerror, programming error, and outage in other contact sys-tems [12], [1]. In 2009 Rackspace reported a major outagein June [4]. Despite the efforts in data replications and theremoval of possible single point of failure, disasters can stilltake out an entire data center and make service unavailable.

It is believed that just as an Internet service provider will

2012 41st International Conference on Parallel Processing

0190-3918/12 $26.00 © 2012 IEEE

DOI 10.1109/ICPP.2012.51

199

not count on a single network provider, a cloud user shouldnot depend on a single cloud service provider [5], [12]. Thereason is that using only one cloud service provider is itselfa single point of failure, despite the fact that the failureprobability is very small. In order to overcome this singlepoint of failure, one should employ multiple cloud serviceproviders [5].

Data lock-in is also one of the major obstacles in cloudcomputing [12]. When one places a huge amount of datainto a particular cloud provider, it becomes very difficult toswitch to another one. The reason is that despite that somecloud providers (e.g. Amazon S3) do not charge for dataupload [9], instead they charge for data download, whichis definitively the most costly part of data storage services.The storage providers encourage the use of data storage byfree upload. Once data is uploaded it becomes very costlyto retrieve the data, or move the data to another provider,and we have a “data lock-in”. Therefore it would be wiseto spread out the data storage requirement among severalproviders, so that one will not be subject to sudden priceincrease, or single point of failure [5].

Cloud service providers provide different levels of ser-vices. A more costly service is usually more reliable. Forexample, Amazon Simple Storage Service [9] provides asimple web services interface that can be used to store andretrieve any amount of data, at any time, from anywhereon the web. The same company also provides ReducedRedundancy Storage (RRS), which is a storage option withinAmazon S3 that enables customers to reduce their costsby storing non-critical, reproducible data at lower levels ofredundancy than Amazon S3’s standard storage [8]. Theprice of RRS is therefore lower than that of the standardS3. For example, the price per GB of the first TB is $0.140for S3, and $0.093 for RRS [9]. This is a good examplethat cloud storage providers are providing different levels ofreliabilities for different prices to suit different needs.

We will use the number of nines in the availabilityprobability as the metrics of availability [31]. For example,if the availability is 99.999% then we refer it as five nines.Sripanidkulchai et al. [37] suggested that we can infer ac-ceptable user expectation of service availability by observingthe websites of popular web services, like Amazon, CNN,Ebay, Walmart, and Yahoo. For example, Ebay has fivenines in its service availability in 2008, and all other abovementioned websites have a service level of at least threenines in their availability.

Cloud storage broker is an emerging business in cloudcomputing [29], [39]. A cloud storage broker, just like a realestate broker, provide brokerage services to cloud users andcloud providers. A cloud storage broker aggregates cloudsfrom public and private networks and delivering capacityand services in an automated on-demand fashion [29]. Acloud storage broker must handle the heterogeneity of cloudstorage providers, and provide his integration service as

a single “image” to end users. Therefore a cloud storagebroker must package different storage services, which areprovided via different SLAs, into a single service to suit thethe specific service level agreement of a particular user.

The goal of this paper is to help cloud storage brokeror any large enterprise cloud users to select cloud storageproviders to suit their own SLAs, to avoid data lock-in, orto improve data availability. In particular, this paper focuseson selecting cloud service providers in order to maximizethe benefits with a given budget. The benefits could be theprobability that one can retrieve the data, or the expectedamount of data one can still salvage, given the fact that someproviders may fail. Due to the diversity of cloud serviceproviders, to select them in order to achieve the maximumbenefits under a given budget becomes a challenging andimportant problem in cloud computing.

The contributions of this paper are summarized as follows.• A mathematical formulation of the cloud service

provider selection problem in which both the objectfunctions and cost measurements are clearly defined.

• An algorithm that selects among cloud storageproviders to maximize the data survival probabilitywhen some providers may fail, subject to a fixedbudget.

• A dynamic programming algorithm that selects amongcloud storage providers to maximize the expected num-ber of survival data blocks when some providers mayfail, subject to a fixed budget.

• We also conduct a series of experiments that demon-strate that the proposed algorithms are efficient enoughto find optimal solutions in reasonable amount of time,using price and fail probability taken from real cloudproviders.

We also would like to point out that since this paperfocuses on the trade-off between the storage costs and thereliability, the model does not include other factors, e.g.,synchronization, security, and responsiveness.

The rest of the paper is organized as follows. Section IIdescribes related work of this paper. Section III and IVformally formulate two cloud provider selection problemsand propose optimal solutions. Section V describes theimplementation of our algorithm, and demonstrates theirexecution efficiency. Section VI concludes with a summaryand possible future works.

II. RELATED WORKS

Hussam et al. proposed RACS (Redundant Array of CloudStorage), a cloud storage proxy that transparently stripes dataacross multiple cloud storage providers [5] to provide betteravailability and avoid data lock-in. The idea is to spreadout data among cloud service providers, just as disk arraysspread out data among individual disks. They built andevaluated a prototype and estimate the costs by using trace-driven simulations and concluded that RACS can reduce the

200

cost of switching storage vendors for a large organizationby seven-fold [5]

Bowers proposed a cryptographic system HAIL [14] thatdistributes redundant blocks of a file across multiple servers,and allows a client to make sure that the file is not corruptedeven when an attacker gain access to all servers. Otheralgorithms that provide “integrity proof” by a pool of serversthemselves include [21], [25], [15], [28].

Chun et al. consider replication strategies for storagesystems that aggregate the disks of many nodes spread overthe Internet in a P2P manner [16]. The advantage of usinga P2P approach is to provide durability by the availabilityof individual storage in the Internet, and a system should beable to create new copies of data faster than permanent diskfailures destroy the data. However, maintaining replication ina P2P system can be very expensive because network or hostfailure could lead to copying all data on a server over theInternet to guarantee a sufficient number of replicas. OtherP2P storage system include [19], [36], [18].

Li et al. proposed a CloudCmp system that comparespublic cloud providers [30]. CloudCmp helps user picka cloud that fits their needs by systematically comparingthe performance and cost of cloud providers. CloudCmpalso measures the elastic computing, persistent storage, andnetworking services offered by a cloud along metrics thatdirectly reflect their impacts on the performance of customerapplications.

Zeng et al. described the cloud service architecture andkey technologies for service selection algorithm with adap-tive performances and minimum cost [40]. The serviceselection algorithm running on the service proxy will findthe most suitable service provider based on the servicecost and gains, which include service unit price, distance,responsive time, traffic volume, storage space, etc. The goalis to maximize gain and minimize cost [40].

Ford et al. analyzed data availability in a distributed filesystem level by applying two mathematical models. The firstmodel is a correlated failure model and the second is aMarkov chain. The correlated failure models the conceptthat failures usually occur in burst within a short timewindow. Then they used correlated failure to predict overallavailability, and formulated a Markov chain that can predictdata availability. This probability prediction can scale upto arbitrary size of file system instances, and capture theinteraction of failures with replication policies and recoverytimes [20]. Their focus is to predict the availability in asingle data center. In contrast we are interested in multipledata center availability prediction. As a result we can lever-age their work in obtaining the availability of any singledata center, and use these information as the input to ourmultiple data center availability prediction.

Sripanidkulchai et al. discussed three key requirements ofcloud computing – large-scale deployment, high availability,and problem resolution [37]. In particular they proposed

three directions for delivering cloud service with highavailability. The first direction is to explore the technicaldifferences between individual and enterprise sites that resultin the observed gap in service availability. The seconddirection is to extend the architecture across different cloudservice providers, which is the focus of this paper. Thethird direction is to develop new virtualization technologies.Although they pointed out the importance of using multipleservice provider, no mathematical analysis in how to achievehigh availability was made. In contrast, this paper focus onthe optimization of the probability in successfully retrievingthe data among multiple providers via duplication.

The cloud provider selection problem is strongly relatedto the knapsack problem, in which we want to select a setof objects so that the sum of the benefits of the chosenobjects is maximized, while the sum of the weights ofthe chosen object does not exceed a given bound. It iswell-known that the general problem is NP-complete [22],there is a pseudo-polynomial time algorithm using dynamicprogramming [17], [13]. There are programs freely availablethat solve knapsack problems with dynamic programming orbranch and bound approach [11], [33], [32].

The unique contribution of this paper is a probability-based quantitative analysis on the availability of the storagewhen one can replicate data among multiple providers.We focus on the analysis of the selection algorithms anddemonstrate their efficiency by implementing them on alaptop computer. This work can be integrated into a cloudservice provider selection system like RACS [5], whichprovides transparent data access among multiple serviceproviders.

III. MINIMUM FAILURE PROBABILITY WITH GIVEN

BUDGET

A. Problem Description

We want to replicate a fixed amount of data into n datacenters, where the replication is subject to various cost andperformance requirements. We assume that every data centerhas sufficient amount of storage to store the entire data,therefore it is possible to replicate all data in all data centers.

A data center has two parameters – price and failureprobability. In order to store data at a data center we needto pay a price, and a data center could fail with the givenfailure probability. We also assume that all data centers areindependent because they are operated by different cloudvendors, and the event that a data center fails is independentfrom the event that another data center fails. Despite thata large scale catastrophe may disrupt more than one datacenter simultaneously, the major outages reported in [12] allinvolve only a single provider, so we assume that the failureof independent providers are probabilistically independent.

We want to choose a subset S of data centers to replicatedata. As a result the total cost of S is the sum of cost ofusing data centers in S. Since we replicate data in every

201

data center in S, we lose data only when all data centersin S fail. Now given a fixed budget B, how do we choosea subset S of data centers, such that the total cost of usingthese data centers does not exceed B, and the probability oflosing data is minimized?

B. Problem Definition

Before we formally define the problem we first define thefollowing terminologies. Let D be the set of all data centers{d1, . . . , dn}, and B be the given budget.

1) Cost function: We use c(d) to denote the cost ofplacing data in a data center d. The cost of using all datacenters in S to replicate the data, denoted by c(S), is thesummation of the cost of placing data in data centers in S,as in Equation (1).

c(S) =∑d∈S

c(d) (1)

2) Failure probability function: We use p(d) to denotethe failure probability of data center d. Let p(S) denote theprobability that we fail to retrieve data stored in S. Since wewill not be able to retrieve the data only when all data centersin S fail, the failure probability under S is the product of thefailure probability of all data centers in S, as in Equation (2).

p(S) =

⎧⎨⎩1 if S = ∅∏d∈S

p(d) otherwise. (2)

3) Feasible: A subset S of D is feasible if the cost of Sis no more than the given budget B, that is, c(S) ≤ B.

4) Minimum-failure-fixed-budget problem: We now for-mally define the minimum-failure-fixed-budget problem. Wewould like to find the feasible subset S∗ of D that minimizesthe failure probability.

C. Solution

We solve the minimum-failure-fixed-budget problem bytransforming it into a knapsack problem. However, it isdifficult to formulate the failure probability in Equation (2)as a summation, as the objective function in a knapsackproblem is formulated. Nevertheless we transform the prod-uct in Equation (2) into a summation as follows.

We first define a function l(d) to be the negative value ofthe logarithmic function on the failure probability of a datacenter d, as defined in Equation (2). Since lg(x) is strictlyincreasing and lg(x) + lg(y) = lg(xy), the l(S), which isthe negative value of the logarithmic function on the failureprobability of S, is equal to the sum of l(d) function of datacenter in a set S.

l(d) = − lg(p(d)) (3)

l(S) = − lg(p(S)) =∑d∈S

l(d) (4)

Now we can transform our minimum-failure-fixed-budgetproblem into a 0-1 knapsack problem as follows. Each datacenter in the original problem is an item in the knapsackproblem, the cost c(d) is the weight of the item in theknapsack problem, the negative logarithmic l(d) is the valueof the item, and the budget B is the capacity of the sack.

D. Algorithm

Bellman [13] proposed an algorithm that solves 0-1knapsack problem using a simple dynamic programming.The time complexity is O(nB), where n is the numberof items and B is the capacity of the sack. After wetransform our minimum-failure-fixed-budget problem intoa 0-1 knapsack problem, we apply Bellman’s algorithm toobtain the minimum failure probability.

IV. MAXIMUM VALIDNESS WITH GIVEN BUDGET

The previous minimum-failure-fixed-budget problem de-mands a solution that minimize the overall failure probabilityunder a given budget. This section describes a similarproblem in which we would like to replicate chunks of dataamong data centers so that the expected number of survivingdata chunks is maximized under a given budget.

A. Problem description

We would like to replicate m chunks of data of equalsize into n data centers, and the assignment is subject tovarious cost and performance considerations. Similar to theprevious minimum-failure-fixed-budget problem, let the setof data centers be D = {d1, . . . , dn}, and we assume thatevery data center has a sufficient amount of storage to storeall m chunks of data if necessary. A data center also has twoparameters – a price per chunk and a failure probability thatwe will not be able to retrieve any data from it. Similar toprevious notations we use c and p to denote the cost and thefailure probability function of the data center.

For each chunk of data we make r replicas and chooser data centers to store them. Note that each data chunk canchoose a different set of r data centers to store the chunk.Now given a budget B, how do we replicate data chunksamong data centers so that the total cost of using these datacenters does not exceed B, and the expected amount of validdata, i.e. the data that are not stored in a failed data center,is maximized?

B. Problem definition

1) Chunk assignment: A chunk assignment S is an m-tuple (S1, ..., Sm), where Si indicates the set of data centersthe i-th data chunk is stored. As a result all |Si| are r, andSi ⊆ D.

202

2) Cost function: Let c(d) denote the cost of placing onechunk of data in a data center d. Similarly let c(S) denotethe cost of using all data centers in a set of data centersS to replicate one chunk of data. By definition c(S) is thesummation of the cost of placing one chunk of data in eachdata center of S. Similarly we can define the cost of a chunkassignment S, denoted by c(S), to be the cost to replicatem chunks of data according to S. By definition c(S) is thesum of the cost to replicate each data chunk according to S,as indicated in Equation (5).

c(S) =m∑i=1

c(Si) (5)

3) Failure probability function: Let p(d) denote the fail-ure probability of data center d. We would like to derive therelation of p(d) and the expected number of surviving datachunks under a chunk assignment S.

4) Expected valid function: Let S be a set of data centersand v be a random variable that denotes the number ofsurviving data chunks if we place replicas on all data centersin S. We will refer to v as validness of a data chunk underS. The value of v will be 0 only when all data centers in S

fails, otherwise v will be 1. As a result the expected valueof v, or expected validness, is as follows.

E(v) = 0×∏d∈S

p(d) + 1×

(1−

∏d∈S

p(d)

)(6)

= 1−∏d∈S

p(d) (7)

For ease of notation let E(S) denote the expected valid-ness of a data chunk when we replicate it at every datacenter in S. Here we want to emphasize that the function E

is defined on a data center set S so that different S’s willprovide different level of expected validness. Note that whenS is an empty set then we define the expected validness is0 since no data was stored, otherwise we can define E(S)according to Equation 7. Please refer to Equation (8) for thecomplete definition of E(S).

E(S) =

⎧⎨⎩0 if S = ∅

1−∏d∈S

p(d) otherwise (8)

Now we can define the expected validness of placing m

chunks of data under the chunk assignment S, denoted byE(S), to be the expected number of surviving data chunks.This is simply the summations of all E(Si) because eachE(Si) gives the expected number of surviving data chunksfor a chunk, and the expected value of the sum of notnecessarily indenpendent random variables, is still the sumof the expected values of these random variables.

E(S) =

m∑i=1

E(Si) (9)

5) feasible: A chunk assignment S is feasible if the costof S is no more than the budget B, that is, c(S) ≤ B.

6) max-validness-fixed-budget: We now formally definethe max-validness-fixed-budget problem. We would like tofind a feasible chunk assignment S

∗ that maximizes theexpected number of valid chunks.

C. Solution

We propose a dynamic programming to solve the max-validness-fixed-budget problem. The key of the dynamicprogramming is the definition of a f function. Let f(k, b) bethe maximum expected number of valid data chunks fromthe first k data chunks when the cost is exactly b. We willuse dynamic programming to compute f(k, b) recursively.

1) Recursive Formula: We first derive a general recursiveformula for f . Recall that f(k, b) is the maximum expectednumber of chunks for the first k chunks of data when thecost is exactly b. First we consider possible ways to placethe k-th chunk. We enumerate all possible sets of r datacenters (denoted by S) that could store the k-th chunk ofdata, as if we were placing the k-th chunk.

Second, we consider all possible placements of the first(k − 1)-th chunks that could lead to the current k and b.Since the cost of placing the k-th chunk is assumed to bec(S), we have the cost b− c(S) for the first k − 1 chunks.That is, we should consider all cases with cost b− c(S) forthe first k − 1 chunks, where S is the data center set weintend to place the k-th chunk of data.

Finally we finish the recursive definition of f by derivingthe function value of f . If we place the k-th chunk into S,the expected data validness will be the sum of the maximumexpected validness from the first k − 1 chunks with costb−c(S), plus the expected validness of the k-th chunk whenwe use S. We have the following recursive definition for fas Equation (10).

2) Terminal Cases: The final step is to consider twoterminal conditions of f . First, the sum of the cost for thefirst k − 1 chunks and using S for the k-th chunk may beover the budget b, i.e., b − c(S) may become negative. Inthis case we should not consider S for the k-th chunk andsimply set f to be negative infinity, which means there areno solutions for the given budget. Second, the value of f

should be 0 when k and b are both zero, meaning that thereis no chunk to store and no cost to pay. Combining all thediscussions above we derive the recursive formula for f asin Equation (10).

203

f(k, b) =

⎧⎪⎨⎪⎩

−∞

if p < 0 ∨ k < 0

0if p = 0 ∧ k = 0

max|S|=r

(f(k − 1, b− c(S)) +E(S))otherwise

(10)Let g(k, b) be the maximum expected number of valid

data chunks from the first k chunks with a cost no morethan the given budget b. Note that for ease of computationwe define f(k, b) be the maximum expected validness for thefirst k chunks with an exact budget b. Function g differ fromfunction f in that it relaxes the exact budget constraints. Asa result, g(m,B) is the E(S∗) that we are looking for. Wecan compute g(m,B) from function f by definition as inEquation (11).

g(k, b) = max0≤b′≤b

f(k, b′) (11)

D. Algorithm

The following dynamic programming uses Equation (10)to solve the max-validness-fixed-budget problem. Thepseudo code is in Algorithm 1. Note that since the cost forany additional chunk is at least 1, we can safely initializingf [k][b] as f [k][b− 1] if b is positive.

It is easy to see that the time complexity of Algorithm 1is O

(mB ·

(n

r

)). We can safely assume that due to the

economical advantage of large data centers [12], the numberof cloud service provider will be limited to a small constant.The replication factor is also a small constant, e.g., 2 or 3in practice [23]. As a result the final time complexity of thedynamic programming is O(mB). It is interesting to notethat the number of data chunks m can be easily manipulatedto adopt to various circumstances. For example, if we dividedata into larger chunks, we can trade data validness forthe speed of dynamic programming since m is reduced.Nevertheless, as we will see in the next section Algorithm 1is already very fast in computing the optimal selection forreal world parameters, even on a laptop computer.

V. EXPERIMENT

We conduct a series of experiments to verify the perfor-mance of our algorithms. Our algorithms do provide optimalsolutions, but we still need to verify that the execution timeof algorithms is fast enough to enable timely policy decisionfor cloud systems in real world.

We conduct our experiment on an Intel Core i5 laptop.We would like to demonstrate that our algorithm is efficientenough to make timely decisions even on a laptop computer.The laptop has 4G memory, and the C++ code was compiledwith MinGW32 g++ 4.5.2 compiler, with optimization flag-O2.

We assume that we have 100 data chunks to store in fivepossible data centers, and that the duplication factor is two.Due to economical advantage in favor of large data centers,

Algorithm 1: Max validness fixed budgetInput: the number of data centers n, the number of

chunk m, duplication factor r, the cost functionc, data surviving probability function E, abudget B

Output: the maximum expected value of non-loss dataR

for b = 0 to B dof [0][b]← 0;

endfor k = 1 to m do

f [k][0]← −∞;for b = 0 to B do

if b > 0 thenf [k][b]← f [k][b− 1];

endforeach combination S ∈ C(n, r) do

Z ← f [k − 1][b− c(S)] + E(S);if b− c(S) ≥ 0 then

f [k][b]← max(f [k][b], Z);end

endend

endR← f [m][B];

the number of cloud service provided will be limited to asmall constant, because a small provider will not be ableto receive the same discount in electricity, IT equipment,and network costs [12]. Also the number of chunks willbe limited since one can always aggregate data into largerchunks to speed up the decision process.

We then determine the price and failure probability ofcloud providers for our simulations. Amazon S3 providestwo levels of services – standard storage with durability99.999999999% and availability 99.99%, and reduced re-dundancy storage has durability 99.99% and the 99.99%abailability [9]. Since this paper focuses on the probabilityof failure in providing the data, we use the availabilityprobability 99.99% as the baseline in our simulations. Themonthly usage fee per GB is $0.14 for the first TB, $0.125for the next 49T, and $0.110 for the next 450T [9].

Google does not disclose the failure probability but onlydescribe the service as “robust, scalable storage for yourweb application” [26]. The price is similar to Amazon S3– the monthly usage fee per GB is $0.13 for the first TB,$0.12 for the next 9T, and $0.105 for the next 900T [26].Microsoft Azure does not disclose the failure probability ontheir storage service either, but mention that the replicationfactor is three [34]. The price is $0.14 per GB stored permonth based on the daily average [34].

Based on the description above, we summarize the failure

204

cost per chunk failure probability24 0.000112 0.00026 0.00042 0.00121 0.0024

Table ITHE PRICE AND FAILURE PROBABILITY OF FIVE DATA CENTERS.

probability and the cost per chunk in our simulation as inTable I. Since we only has the failure probability of AmazonS3, we use it as the baseline and add four other providers,based on the assumption that a more reliable service ismore expensive. We make this assumption because this paperfocuses on the trade-off between reliability and cost. Anyother factors, like brand name effects, are not consideredin this paper. Therefore we assume that the product of thefailure probability and the unit cost is a constant in thesimulations.

A. Minimum Failure Probability with Given Budget

We implemented Bellman’s Algorithm for the minimumfailure probability with a given budget problem described inSection III. We use the failure probability in Table I to runour algorithm and the results are illustrated in Figure 1. Thefailure probability (in logarithmic scale) is plotted against thegiven budget. As the budget increases the error probabilitydecreases. Since the replication factor is set to two, thesimulation easily achieve 99.99% (four nines) availabilitywith a very small budget. That is, by replicating data in twocheapest storage providers, which have failure probabilityas high as 0.0024 and 0.0012, we are able to achieve morethan 99.99% of availability.

Recall that we use the number of nines in the availabilityprobability as the metrics of availability [31]. For example,if the availability is 99.999% then we refer it as five nines. Ifwe increase the budget so that we can afford to use the mostexpensive providers we can achieve more than 17 nines ofavailability.

Figure 1 also illustrates the 99.9% (3 nines), 99.99%(4 nines), 99.999% (5 nines), and 99.99999% (6 nines)availability. In practice 6 nines are sufficient for a high-available system [31]. We observe that by assuming the datacenter has the failure probability in Table I, one can easilyachieve 6 nine without using the most expensive data storageservice.

We also report the execution time of Bellman’s Algorithmin Figure 2. We tested our implementation of Bellman’salgorithm for our minimum-failure-fixed-budget problem.The number of data centers ranges from 1 to 10 and thebudget is up to 10 million. The execution time increases asthe budget increases, since the number of data center choicesincreases. Nevertheless our implementation is so efficient

1.0e-018

1.0e-016

1.0e-014

1.0e-012

1.0e-010

1.0e-008

1.0e-006

1.0e-004

1.0e-002

1.0e+000

0 10 20 30 40 50

Exp

ecte

d lo

ss d

ata

Budget

5centersthree ninesfour ninesfive ninessix nines

Figure 1. The failure probability with a given budget by Bellman’salgorithm

that it is able to derive the answers for all cases within 0.5seconds on an Intel Core i5 CPU.

0.0

0.2

0.4

0.6

0.8

1.0

0 2e+006 4e+006 6e+006 8e+006 1e+007

Tim

e

Budget

0.0

0.2

0.4

0.6

0.8

1.0

0 2e+006 4e+006 6e+006 8e+006 1e+007

Tim

e

Budget

0.0

0.2

0.4

0.6

0.8

1.0

0 2e+006 4e+006 6e+006 8e+006 1e+007

Tim

e

Budget

1 data center2 data centers3 data centers4 data centers5 data centers6 data centers7 data centers8 data centers9 data centers

10 data centers

Figure 2. The execution time for computing optimal selection underdifferent numbers of data centers and different budget.

B. Maximum Validness With Given Budget

We implemented Algorithm 1 for the maximum expectedvalidness with a given budget problem described in Sec-tion IV. We calculate the expected number of valid datachunks from a given budget using Algorithm 1, and the theexpected number of valid data chunks is plotted against thebudget in Figure 3. If we use only the two cheapest providersin Table I, we can expect to recover 99.999712% of the data.On the other hand, if we use the most expensive two datacenters, we can expect to recover 99.999998% of the data.

We can interpret the expected validness as the probabilityto pick a random data chunk and expect it to be available.That is. if the the expected validness is 99%, then a randomlypicked chunk will have a 99% proability to be in the99% of the chunks we can retrieve. Accordingly we candefine availability as the expected number of valid data,i.e., validness. As a result we plot the 6 nines line inFigure 3 and observe that using the least expensive data

205

centers already can achieve 6 nines. Note that in this aspectwe are considering the availability of a random chunk, notthe availability of all chunks,

99.99970

99.99975

99.99980

99.99985

99.99990

99.99995

100.00000

0 500 1000 1500 2000 2500 3000 3500 4000

Exp

ecte

d va

lid d

ata

Budget

99.99970

99.99975

99.99980

99.99985

99.99990

99.99995

100.00000

0 500 1000 1500 2000 2500 3000 3500 4000

Exp

ecte

d va

lid d

ata

Budget

99.99970

99.99975

99.99980

99.99985

99.99990

99.99995

100.00000

0 500 1000 1500 2000 2500 3000 3500 4000

Exp

ecte

d va

lid d

ata

Budget

100chunks 2dups 5centers

Figure 3. The expected number of data chunks that can be retrieved byAlgorithm 1

We also observe that there are four almost linear “seg-ments” in Figure 3. Within a segment the algorithm usesa set of fixed data centers, which we refer to as basicset, plus some other more expensive data centers that aremore reliable, which we refer to as extra set. As a resultwhen the budget increases the expected percentage of validdata chunks increases. However, we observe that the extravalidness is contributed by the more expensive extra set,which is dynamically changing due to increasing budget.When the budget increases to an extent that allows the useof another set of more expensive basic set, the algorithm will“switch” to the new basic set, and the validness will increasewith a smaller slope. The reason is that the marginal benefitsof using more expensive extra set is decreasing, so we aregetting less validness with the same amount of extra budget.

We calculate the maximum expected percentage of validdata with a given budget to understand the impact of budget.The budget ranges from 300 to 4000 in the increment of 50,so there are 75 different budgets to process. In addition, wecompute the best placement of replicas in order to maximizethe expected number of valid data. Despite that there are 75different cases to compute, the maximum expected numberof valid data and the chunk placement for all differentbudgets can be calculated in about a second. This indicatesthat the algorithm can select cloud providers efficiently.

We also plot the expected number of lost data chunks(in logarithmic scale) against the budget in Figure 4. Sincethe expected number of valid data chunks increases as thebudget increases, the logarithmic of the expected number oflost data chunks decreases as the budget increases. We alsoplot the 6 nines line in Figure 4.

We plot the probability to successfully retrieve the entirefile in Figure 5 with Algorithm 1. Note that the objectivefunction of Algorithm 1 is the expected number of data

0.00000

0.00001

0.00010

0.00100

0 500 1000 1500 2000 2500 3000 3500 4000

Exp

ecte

d lo

ss d

ata

Budget

0.00000

0.00001

0.00010

0.00100

0 500 1000 1500 2000 2500 3000 3500 4000

Exp

ecte

d lo

ss d

ata

Budget

0.00000

0.00001

0.00010

0.00100

0 500 1000 1500 2000 2500 3000 3500 4000

Exp

ecte

d lo

ss d

ata

Budget

100chunks 2dups 5centers

Figure 4. The amount of lost data chunks in logarithmic scale byAlgorithm 1

0.99999000

0.99999200

0.99999400

0.99999600

0.99999800

1.00000000

0 500 1000 1500 2000 2500 3000 3500 4000

Pro

babi

lity

of fe

tchi

ng a

ll da

ta s

ucce

ssfu

lly

Budget

six nines100chunks 2dups 5centers

Figure 5. The probability of fetching all data successfully by Algorithm 1

chunks that can be retrieved successfully, not the probabilityto recover the entire file. Nevertheless we observe thatthe probability of recovering all chunks when given morebudget, will increase except a few singular points.

Now we related the probability to successfully retrieve theentire file, denoted as p, and the expected number of datachunks that can be retrieved successfully by Algorithm 1,denoted as e. Note that the objective function of Algorithm 1is e, which can be interpreted as the probability of suc-cessfully retrieve a random chunk from the file. We wantto demonstrate that e is a good approximation of p, whichis difficult to compute, due to the fact the events of beingable to recover two chunks are dependent. To illustrate theeffectiveness of this approximation we plot p (with valueson the left) and e (with values on the right) together whilevarying budget in Figure 6. Note that due to the fact thatthese two probabilities are very different we plot them withdifferent scales. We observe that two probabilities more orless increase in the same trend as the budget increases. Thisapproximation is useful in practice because the latter is muchhard to compute.

We report the execution time of Algorithm 1 in Figure 7.

206

0.999996

0.999997

0.999998

0.999999

1.000000

0 500 1000 1500 2000 2500 3000 3500 400099.99970

99.99975

99.99980

99.99985

99.99990

99.99995

100.00000

Pro

babi

lity

of fe

tchi

ng a

ll da

ta s

ucce

ssfu

lly

Exp

ecte

d va

lid d

ata

Budget

Surviving probability of all dataNumber of expected valid data

Figure 6. Compare the all data surviving probability p with the expectedpercentage of recoverable data c.

We set the number chunk to 64 and the duplication factor to3, and plot the execution time against the budget for up to8 data centers. The execution time increases as the budgetincreases, as predicted by the O(mB) time complexity,where B is the budget and m is the number of chunks. Againour dynamic programming solution is so efficient that it isable to solve all problems within 1.8 seconds on an IntelCore i5 CPU.

0.0

0.5

1.0

1.5

2.0

2.5

3.0

0 20000 40000 60000 80000 100000

Tim

e

Budget

choose from 3 data centerschoose from 4 data centerschoose from 5 data centerschoose from 6 data centerschoose from 7 data centerschoose from 8 data centers

Figure 7. The execution time for computing optimal selection underdifferent numbers of data centers and different budget.

VI. CONCLUSION AND FUTURE WORKS

This paper addresses the issues of selecting multiple cloudproviders. We describe the motivation of having multiplecloud providers, and formally define a mathematical modelfor evaluating the quality of those algorithms that selectservice providers. This model addresses both the objectfunctions and cost measurements, in which optimizationproblems on their trade-off can be defined.

Based on the model, we derive two algorithms for se-lecting service providers with a given budget. One algo-rithm maximizes data survival probability, and the othermaximize the expected number of survival data blocks. We

also conduct experiments to demonstrate that the proposedalgorithms are efficient enough to find optimal solutions inreasonable amount of time.

For the experiments we observe that replication is ex-tremely effective in improving data availability. Using mul-tiple data providers that have much high failure probabilitythan the leading provider is sufficient to guarantee highavailability. However, we must have transparent data ac-cess mechanism to retrieve data among multiple providers.We believe that this transparent mechanism is an enablingtechnology in leveraging the high availability with multipleproviders. We also observe from the experiments that theexpected number of data chunks is a good approximationon the probability of being able to retrieve all chunks.

We would like to extend this work in the followingdirections. First, we do not consider the costs of datatransfer. It is inevitably that once data is uploaded to adata center, it will be downloaded for various purposes,and the data transfer may be expensive. We would like toextend our model to account for these extra costs duringthe selection procedure. Second, we want to evaluate thecost of switching service providers. The current algorithmsis purely price and availability driven, and do not considerthe dynamics of the possible scenario that we may need toswitch providers due to certain emergency. This should alsobe taken into consideration during the selection procedureas well. Finally, we would like to extend the model toinclude the effects due to replication. The positive effect isthe parallel access to data, and the negative effects includeextra data communication for synchronizing replicas. Theseeffects should be included in the final objective function thatwe want to optimize.

REFERENCES

[1] Amazon s3 july 2008 outage. http://www.networkworld.com/news/2008/072108-amazon-outages.html.

[2] Cloud services outage report. http://bit.ly/cloud outage.

[3] Rackspace cloud files. http://www.rackspace.com/cloud/cloud hosting products/files/.

[4] Rackspace june 2009 outage. http://www.bbc.co.uk/blogs/technology/2009/10/the sidekick cloud disaster.html, 2009.

[5] H. Abu-Libdeh, L. Princehouse, and H. Weatherspoon. Racs:a case for cloud storage diversity. In Proceedings of the 1stACM symposium on Cloud computing, SoCC ’10, pages 229–240, New York, NY, USA, 2010. ACM.

[6] Amazon. Amazon elastic block store. http://aws.amazon.com/ebs/.

[7] Amazon. Amazon elastic compute cloud. http://aws.amazon.com/ec2/.

[8] Amazon. Amazon reduced redundancy storage. http://aws.amazon.com/s3/.

207

[9] Amazon. Amazon simple storage service. http://aws.amazon.com/s3/.

[10] Amazon. Amazon simple storage service service level agree-ment. http://aws.amazon.com/s3-sla/.

[11] R. Andonov, V. Poirriez, and S. V. Rajopadhye. Unboundedknapsack problem: Dynamic programming revisited. Eu-ropean Journal of Operational Research, 123(2):394–407,2000.

[12] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. H. Katz,A. Konwinski, G. Lee, D. A. Patterson, A. Rabkin, I. Stoica,and M. Zaharia. Above the clouds: A berkeley view of cloudcomputing. Technical Report UCB/EECS-2009-28, EECSDepartment, University of California, Berkeley, Feb 2009.

[13] R. E. Bellman. Dynamic Programming. Dover Publications,Incorporated, 2003.

[14] K. D. Bowers, A. Juels, and A. Oprea. Hail: A high-availability and integrity layer for cloud storage. CryptologyePrint Archive, Report 2008/489, 2008. http://eprint.iacr.org/.

[15] C. Cachin and S. Tessaro. Asynchronous verifiable informa-tion dispersal. In In Proceedings of the 24 th IEEE Symposiumon Reliable Distributed Systems, pages 191–202. IEEE Press,2005.

[16] B.-G. Chun, F. Dabek, A. Haeberlen, E. Sit, H. Weatherspoon,M. F. Kaashoek, J. Kubiatowicz, and R. Morris. Efficientreplica maintenance for distributed storage systems. InProceedings of the 3rd Symposium on Networked SystemsDesign and Implementation (NSDI’06), May 2006.

[17] T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson.Introduction to Algorithms. McGraw-Hill Higher Education,2nd edition, 2001.

[18] F. Dabek, M. F. Kaashoek, D. Karger, R. Morris, and I. Stoica.Wide-area cooperative storage with CFS. In SOSP ’01:Proceedings of the eighteenth ACM symposium on Operatingsystems principles, pages 202–215, New York, NY, USA,2001. ACM.

[19] F. Dabek, J. Li, E. Sit, J. Robertson, M. F. Kaashoek,and R. Morris. Designing a dht for low latency and highthroughput. In NSDI, pages 85–98. USENIX, 2004.

[20] J. A. Garay, R. Gennaro, C. Jutla, and T. Rabin. Secure dis-tributed storage and retrieval. Theoretical Computer Science,243:363–389, July 2000.

[21] M. R. Garey and D. S. Johnson. Computers and Intractability;A Guide to the Theory of NP-Completeness. W. H. Freeman& Co., New York, NY, USA, 1990.

[22] S. Ghemawat, H. Gobioff, and S.-T. Leung. The google filesystem. SIGOPS Oper. Syst. Rev., 37:29–43, Oct. 2003.

[23] GoGrid. Gogrid service level agreement. http://www.gogrid.com/legal/sla.php.

[24] G. R. Goodson, J. J. Wylie, G. R. Ganger, and M. K. Reiter.Efficient byzantine-tolerant erasure-coded storage. In PRO-CEEDINGS OF THE INTERNATIONAL CONFERENCE ONDEPENDABLE SYSTEMS AND NETWORKS, JUNE 2004,pages 135–144, 2004.

[25] Google. Google app engine datastore. http://code.google.com/intl/en/appengine/docs/python/datastore/.

[26] Google. Google apps service level agreement. http://www.google.com/apps/intl/en/terms/sla.html.

[27] J. Hendricks. Verifying distributed erasure-coded data. In InProceedings of the 26 th ACM Symposium on Principles ofDistributed Computing, pages 163–168. ACM Press, 2007.

[28] S. Higginbotham. Future of cloud computing – moreclouds. seriously. http://gigaom.com/cloud/future-of-cloud-computing-more-clouds-seriously/, 2011.

[29] A. Li, X. Yang, S. Kandula, and M. Zhang. Cloudcmp:comparing public cloud providers. In Proceedings of the 10thannual conference on Internet measurement, IMC ’10, pages1–14, New York, NY, USA, 2010. ACM.

[30] E. L. Marcus. The myth of the nines. http://searchstorage.techtarget.com/tip/The-myth-of-the-nines.

[31] S. Martello, D. Pisinger, and P. Toth. Dynamic programmingand strong bounds for the 0-1 knapsack problem. Manage.Sci., 45:414–424, March 1999.

[32] S. Martello and P. Toth. Knapsack problems: algorithms andcomputer implementations. John Wiley & Sons, Inc., NewYork, NY, USA, 1990.

[33] Microsoft. Microsoft azure. http://www.microsoft.com/taiwan/windowsazure/.

[34] Rackspace. Rackspace service level agreement. http://www.rackspacecloud.com/legal/cloudfilessla.

[35] S. C. Rhea, P. R. Eaton, D. Geels, H. Weatherspoon, B. Y.Zhao, and J. Kubiatowicz. Pond: The oceanstore prototype. InProceedings of the FAST 03 Conference on File and StorageTechnologies, March 31 - April 2, 2003, Cathedral Hill Hotel,San Francisco, California, USA. USENIX, 2003.

[36] K. Sripanidkulchai, S. Sahu, Y. Ruan, A. Shaikh, and C. Do-rai. Are clouds ready for large distributed applications?SIGOPS Oper. Syst. Rev., 44:18–23, April 2010.

[37] H. Stevens and C. Pettey. Gartner says cloud computing willbe as influential as e-business. Gartner Newsroom, OnlineEdition. http://www.gartner.com/it/page.jsp?id=707508, 2008.

[38] N. Vekiarides. Is there a need for a cloud storage broker?http://cloudcomputing.sys-con.com/node/1974574, 2011.

[39] W. Zeng, Y. Zhao, and J. Zeng. Cloud service and serviceselection algorithm research. In Proceedings of the firstACM/SIGEVO Summit on Genetic and Evolutionary Com-putation, GEC ’09, pages 1045–1048, New York, NY, USA,2009. ACM.

208