DAX: A Widely Distributed Multi-tenant Storage Service … · DAX: A Widely Distributed...

DAX: A Widely Distributed Multi-tenant Storage Service forDBMS Hosting

Rui LiuUniversity of Waterloo

[email protected]

Ashraf AboulnagaUniversity of Waterloo

[email protected]

Kenneth SalemUniversity of Waterloo

[email protected]

ABSTRACTMany applications hosted on the cloud have sophisticateddata management needs that are best served by a SQL-basedrelational DBMS. It is not difficult to run a DBMS in thecloud, and in many cases one DBMS instance is enough tosupport an application’s workload. However, a DBMS run-ning in the cloud (or even on a local server) still needs away to persistently store its data and protect it against fail-ures. One way to achieve this is to provide a scalable andreliable storage service that the DBMS can access over anetwork. This paper describes such a service, which we callDAX. DAX relies on multi-master replication and Dynamo-style flexible consistency, which enables it to run in mul-tiple data centers and hence be disaster tolerant. Flexibleconsistency allows DAX to control the consistency level ofeach read or write operation, choosing between strong con-sistency at the cost of high latency or weak consistency withlow latency. DAX makes this choice for each read or writeoperation by applying protocols that we designed based onthe storage tier usage characteristics of database systems.With these protocols, DAX provides a storage service thatcan host multiple DBMS tenants, scaling with the numberof tenants and the required storage capacity and bandwidth.DAX also provides high availability and disaster tolerancefor the DBMS storage tier. Experiments using the TPC-Cbenchmark show that DAX provides up to a factor of 4 per-formance improvement over baseline solutions that do notexploit flexible consistency.

1. INTRODUCTIONIn this paper we present a storage service that is intended

to support cloud-hosted, data-centric applications. There isa wide variety of cloud data management systems that suchapplications can use to manage their persistent data. Atone extreme are so-called NoSQL database systems, such asBigTable [7] and Cassandra [20]. These systems are scalableand highly available, but their functionality is limited. Theytypically provide simple key-based, record-level, read/write

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Articles from this volume were invited to presenttheir results at The 39th International Conference on Very Large Data Bases,August 26th - 30th 2013, Riva del Garda, Trento, Italy.Proceedings of the VLDB Endowment, Vol. 6, No. 4Copyright 2013 VLDB Endowment 2150-8097/13/02... $ 10.00.

App

DAX

StorageTier

DBMS

Tenant

DAX

Client Lib

DBMS

Tenant

DAX

Client Lib

DBMS

Tenant

DAX

Client Lib

DBMS

Tenant

DAX

Client Lib

App App AppApp App

Data Center Data Center

Figure 1: DAX architecture.

interfaces and limited (or no) support for application-definedtransactions. Applications that use such systems directlymust either tolerate or work around these limitations.

At the other extreme, applications can store their datausing cloud-hosted relational database management systems(DBMS). For example, clients of infrastructure-as-a-serviceproviders, such as Amazon, can deploy DBMS in virtualmachines and use these to provide database managementservices for their applications. Alternatively, applicationscan use services such as Amazon’s RDS or Microsoft SQLAzure [4] in a similar way. This approach is best suitedto applications that can be supported by a single DBMSinstance, or that can be sharded across multiple indepen-dent DBMS instances. High availability is also an issue, asthe DBMS represents a single point of failure – a problemtypically addressed using DBMS-level high availability tech-niques. Despite these limitations, this approach is widelyused because it puts all of the well-understood benefits ofrelational DBMS, such as SQL query processing and trans-action support, in service of the application. This is theapproach we focus on in this paper.

A cloud-hosted DBMS must have some means of persis-tently storing its database. One approach is to use a persis-tent storage service provided within the cloud and accessedover the network by the DBMS. An example of this is Ama-zon’s Elastic Block Service (EBS), which provides network-accessible persistent storage volumes that can be attachedto virtual machines.

In this paper, we present a back-end storage servicecalled DAX, for Distributed Application-controlled Consis-tent Store, that is intended to provide network-accessiblepersistent storage for hosted DBMS tenants. Each DBMStenant, in turn, supports one or more applications. Thisarchitecture is illustrated in Figure 1. We have several ob-

253

jectives for DAX’s design that set it apart from other dis-tributed storage services:

• scalable tenancy: DAX must accommodate the ag-gregate storage demand (space and bandwidth) of all ofits DBMS tenants, and it should be able to scale out to ac-commodate additional tenants or to meet growing demandfrom existing tenants. Our focus is on scaling out thestorage tier to accommodate more tenants, not on scal-ing out individual DBMS tenants. A substantial amountof previous research has considered techniques for scalingout DBMS [12, 14, 22] and techniques like sharding [15]are widely used in practice. These techniques can also beused to scale individual hosted DBMS running on DAX.Furthermore, while DBMS scale-out is clearly an impor-tant issue, current trends in server hardware and softwareare making it more likely that an application can be sup-ported by a single DBMS instance, with no need for scaleout. For example, it is possible at the time of this writingto rent a virtual machine in Amazon’s EC2 with 68 GB ofmain memory and 8 CPU cores, and physical servers with1 TB of main memory and 32 or more cores are becom-ing commonplace. While such powerful servers reduce theneed for elastic DBMS scale-out, we still need a scalableand resilient storage tier, which we provide with DAX.

• high availability and disaster tolerance: We ex-pect the storage service provided by DAX to remain avail-able despite failures of DAX servers (high availability),and even in the face of the loss of entire data centers(disaster tolerance). To achieve this, DAX replicates dataacross multiple geographically distributed data centers. Ifa DBMS tenant fails, DAX ensures that the DBMS can berestarted either in the same data center or in a differentdata center. The tenant may be unavailable to its ap-plications while it recovers, but DAX will ensure that nodurable updates are lost. Thus, DAX offloads part of thework of protecting a tenant DBMS from the effects of fail-ures. It provides highly available and widely distributedaccess to the stored database, but leaves the task of en-suring that the database service remains available to thehosted DBMS.

• consistency: DAX should be able to host legacyDBMS tenants, which expect to be provided with a consis-tent view of the underlying stored database. Thus, DAXmust provide sufficiently strong consistency guarantees toits tenants.

• DBMS specialization: DAX assumes that its clientshave DBMS-like properties. We discuss these assumptionsfurther in Section 2.

A common way to build strongly consistent highly avail-able data stores is through the use of synchronous master-slave replication. However, such systems are difficult to dis-tribute over wide areas. Therefore, we have instead basedDAX on multi-master replication and Dynamo-style flexi-ble consistency [16]. Flexible consistency means that clients(in our case, the DBMS tenants) can perform either fastread operations for which the storage system can provideonly weak consistency guarantees or slower read operationswith strong consistency guarantees. Similarly, clients canperform fast writes with weak durability guarantees, or rel-atively slow writes with stronger guarantees. As we showin Section 3, these performance/consistency and perfor-mance/durability tradeoffs can be substantial, especially in

widely distributed systems. DAX controls these trade-offsautomatically on behalf of its DBMS tenants. Its goal isto approach the performance of the fast weakly consistent(or weakly durable) operations, while still providing strongguarantees to the tenants. It does this in part by taking ad-vantage of the specific nature of the workload for which it istargeted, e.g., the fact that each piece of stored data is usedby only one tenant.

This paper makes several research contributions. First,we present a technique called optimistic I/O, which con-trols the consistency of the storage operations issued by theDBMS tenants. Optimistic I/O aims to achieve performanceapproaching that of fast weakly consistent operations whileensuring that the DBMS tenant sees a sufficiently consistentview of storage.

Second, we describe DAX’s use of client-controlled syn-chronization. Client-controlled synchronization allows DAXto defer making guarantees about the durability of updatesuntil the DBMS tenant indicates that a guarantee is re-quired. This allows DAX to hide some of the latency associ-ated with updating replicated data. While client-controlledsynchronization could potentially be used by other typesof tenants, it is a natural fit with DBMS tenants, whichcarefully control update durability in order to implementdatabase transactions.

Third, we define a consistency model for DAX that ac-counts for the presence of explicit, DBMS-controlled syn-chronization points. The model defines the consistencyrequirements that a storage system needs to guarantee toensure correct operation when a DBMS fails and is restartedafter the failure (possible in a different data center).

Finally, we present an evaluation of the performance ofDAX running in Amazon’s EC2 cloud. We demonstrate itsscalability and availability using TPC-C workloads, and wealso measure the effectiveness of optimistic I/O and client-controlled synchronization.

2. SYSTEM OVERVIEWThe system architecture illustrated in Figure 1 includes a

client library for each DBMS tenant and a shared storagetier. The DAX storage tier provides a reliable, widely dis-tributed, shared block storage service. Each tenant keepsall of its persistent data, including its database and logs, inDAX. We assume each DBMS tenant views persistent stor-age as a set of files which are divided into fixed-size blocks.The tenants issue requests to read and write blocks. TheDAX client library intercepts these requests and redirectsthem to the DAX storage tier.

Since the tenants are relational DBMS, there are someconstraints on the workload that is expected by DAX. First,since each tenant manages an independent database, thestorage tier assumes that, at any time, only a single clientcan update each stored block. The single writer may changeover time. For example, a tenant DBMS may fail and berestarted, perhaps in a different data center. However, atany time, each block is owned by a single tenant. Second,because each DBMS tenant manages its own block buffercache, there are no concurrent requests for a single block.A DBMS may request different blocks concurrently, but allrequests for any particular block are totally ordered.

Since each DBMS tenant reads and writes a set of non-overlapping blocks, one way to implement the DAX storagetier is to use a distributed, replicated key/value store. To

254

do this, we can use block identifiers (file identifier plus blockoffset) as unique keys, and the block’s contents as the value.In the remainder of this section, we describe a baseline im-plementation of the DAX storage tier based on this idea.

We will use as our baseline a distributed, replicated, multi-master, flexible consistency, key/value system. Examplesof such systems include Dynamo [16], Cassandra [20], andVoldemort [21]. The multi-master design of these systemsenables them to run in geographically distributed data cen-ters, which we require for disaster tolerance in DAX. Next,we briefly describe how such a system would handle DBMSblock read and write requests, using Cassandra as our model.

In Cassandra, each stored value (a block, in our case)is replicated N times in the system, with placement deter-mined by the key (block identifier). A client connects to anyCassandra server and submits a read or write request. Theserver to which the client connects acts as the coordinatorfor that request. If the request is a write, the coordina-tor determines which servers hold copies of the block, sendsthe new value of the block to those servers, and waits foracknowledgements from at least a write quorum W of theservers before sending an acknowledgement to the client.Similarly, if the request is a read, the coordinator sends arequest for the block to those servers that have it and waitsfor each least a read quorum R of servers to respond. Thecoordinator then sends to the client the most recent copy ofthe requested block, as determined by timestamps (detailsin the next paragraph). The Cassandra client can balanceconsistency, durability, and performance by controlling thevalues of R and W . As we will illustrate in the next sec-tion, these tradeoffs can be particularly significant when thecopies are distributed across remote data centers, as messagelatencies may be long.

Since a DBMS normally stores data in a file system ordirectly on raw storage devices, it expects that when itreads a block it will obtain the most recent version of thatblock, and not a stale version. In classic quorum-basedmulti-master systems [17], this is ensured by requiring thatR + W > N . Global write-ordering can be achieved by re-quiring that W > N/2, with the write order determined bythe order in which write operations obtain their quorums.Cassandra takes a slightly different approach to orderingwrites: clients are expected to supply a timestamp with eachwrite operation, and these timestamps are stored with thedata in Cassandra. The global write-ordering is determinedby these client-defined timestamps. That is, the most recentwrite is defined to be the write with the largest timestamp.Thus, in our baseline DAX implementation, we can ensurethat each read sees the most recent version of the block asfollows. First, the client library chooses monotonically in-creasing timestamps for each block write operation issuedby its DBMS tenant. This is easy to do since all updatesof a given block originate from one tenant and are not con-current. Second, the client submits read and write requestson behalf of its tenant, choosing any R and W such thatR+W > N . This ensures that each read of a block will seethe most recent preceding update.

There are two potentially significant drawbacks to thisbaseline. First, it may perform poorly, especially in a geo-graphically distributed setting, because of the requirementthat R + W > N . Second, failures may impact the perfor-mance and availability of such a system. In the remainder ofthe paper, we describe how DAX addresses these drawbacks.

3. OPTIMISTIC I/OWe conducted some simple experiments to quantify the

performance tradeoffs in widely-distributed, multi-master,flexible consistency systems, like the baseline system de-scribed in the previous section. We again used Cassandra asa representative system and deployed it in Amazon’s EC2cloud. In each experiment we used a small 6-server Cassan-dra cluster deployed in one of three different configurations:

1 zone: All six servers are in the same EC2 availabilityzone, which is roughly analogous to a data center.

1 region: The servers are split evenly among three avail-ability zones, all of which are located in the same ge-ographic region (EC2’s US East Region, in Virginia).

3 regions: The servers are split evenly among three avail-ability zones, with one zone in each of three geograph-ically distributed EC2 regions: US East, US West-1(Northern California), and US West-2 (Oregon).

We used the Yahoo! Cloud Serving Benchmark [10] to pro-vide a Cassandra client application which reads or writesvalues from randomly selected rows in a billion-key Cassan-dra column family (a collection of tuples), at a rate of 1000requests/second. The client runs in the US East region. Inthese experiments, the degree of replication (N) is 3, and weensure that one copy of each row is located in each zone inthe multi-zone configurations. The client can be configuredto generate read operations with R = 1, R = N/2 + 1, orR = N , henceforth referred to as Read(1), Read(QUORUM),and Read(ALL) respectively. Similarly, writes can be config-ured with W = 1 (Write(1)), W = N/2+1 (Write(QUORUM)),or W = N (Write(ALL)). We measured the latency of theseoperations in each Cassandra configuration.

Figure 2(a) illustrates the read and write latencies forthe 1-zone configuration, in which we expect the serversto have low-latency interconnectivity. In this configuration,Read(ALL) and Write(ALL) requests take roughly 60% longerthan Read(1) and Write(1), with the latency of Read(QUORUM)and Write(QUORUM) lying in between. The situation is simi-lar in the 1-region configuration (Figure 2(b)), although thelatency penalty for reading or writing a quorum of copies orall copies is higher than in the 1-zone configuration.

In the 3-region configuration (Figure 2(c)), latenciesfor Read(ALL) and Write(ALL) operations, as well asRead(QUORUM) and Write(QUORUM), are significantly higher,almost ten times higher than latencies in the 1-regionconfiguration. The performance tradeoff between Read(1)and Read(QUORUM) or Read(ALL), and between Write(1) andWrite(QUORUM) or Write(ALL), is now much steeper – about afactor of ten. Furthermore, note that Read(1) and Write(1)are almost as fast in this configuration as they were in thetwo single-region configurations. Although Cassandra sendsRead(1) and Write(1) operations to all replicas, includingthose in the remote regions, it can respond to the clientas soon as the operation has completed at a single replica,which will typically be local.

In summary, operations which require acknowledgementsfrom many replicas are slower than those that require onlyone acknowledgement, and this “consistency penalty” is sig-nificantly larger when the replicas are geographically dis-tributed. In our baseline system, either read operations orwrite operations (or both) would suffer this penalty.

255

One Quorum All0

2

4

6

8

10

Lat

ency

(m

s)Read

Write

One Quorum All0

2

4

6

8

10

12

Lat

ency

(m

s)

Read

Write

One Quorum All0

20

40

60

80

100

Lat

ency

(m

s)

Read

Write

(a) 1 zone (b) 1 region (3 zones) (c) 3 regions

Figure 2: Latency of Read and Write operations in three Cassandra configurations.

3.1 Basic Optimistic I/OOptimistic I/O is a technique for ensuring consistent reads

for DBMS tenants, while avoiding most of the consistencypenalty shown in the previous section. In this section wedescribe an initial, basic version of optimistic I/O, for whichwe will assume that there are no server failures. In thefollowing sections, we consider the impact of server failuresand describe how to refine and extend the basic optimisticI/O technique so that failures can be tolerated.

Optimistic I/O is based on the following observation: us-ing Read(1) and Write(1) for reading and writing blocks doesnot guarantee consistency, but the Read(1) operation willusually return the most recent version. This assumes thatthe storage tier updates all replicas of a block in response toa Write(1) request, though it only waits for the first replicato acknowledge the update before acknowledging the oper-ation to the client. The remaining replica updates completeasynchronously. Since all replicas are updated, whichevercopy the subsequent Read(1) returns is likely to be current.Thus, to write data quickly, the storage tier can use fastWrite(1) operations, and to read data quickly it can op-timistically perform fast Read(1) operations and hope thatthey return the latest version. If they do not, then we mustfall back to some alternative to obtain the latest version.To the extent that our optimism is justified and Read(1) re-turns the latest value, we will be able to obtain consistentI/O using fast Read(1) and Write(1) operations. Note thatusing Write(1) provides only a weak durability guarantee,since only one replica is known to have been updated whenthe write request is acknowledged. We will return to thisissue in Section 4.

To implement optimistic I/O, we need a mechanism bywhich the storage tier can determine whether a Read(1) hasreturned the latest version. DAX does this using per-blockversion numbers, which are managed by the client library.The library assigns monotonically increasing version num-bers to blocks as they are written by the tenant DBMS.The storage tier stores a version number with each replicaand uses the version numbers to determine which replicais the most recent. The client library also maintains anin-memory version list in which it records the most recentversion numbers of different blocks. When the client librarywrites a block, it records the version number associated withthis write in the version list. When a block is read usingRead(1), the storage tier returns the first block replica thatit is able to obtain, along with that replica’s stored versionnumber. The client library compares the returned versionnumber with the most recent version number from the ver-sion list to determine whether the returned block replica is

stale. Version lists have a configurable maximum size. If aclient’s version list grows too large, it evicts entries from thelist using a least-recently-used (LRU) policy.

Since the version list is maintained in memory by theclient library, it does not persist across tenant failures. Whena DBMS tenant first starts, or when it restarts after a failure,its version list will be empty. The version list is populatedgradually as the DBMS writes blocks. If the DBMS readsa block for which there is no entry in the version list, DAXcannot use Read(1), since it will not be able to check the re-turned version for staleness. The same situation can occurif a block’s latest version number is evicted from the versionlist by the LRU policy. In these situations, the client libraryfalls back to Read(ALL), which is guaranteed to return thelatest version, and it records this latest version in the versionlist.

When the client library detects a stale block read, it musttry again to obtain the latest version. One option is toperform the second read using Read(ALL), which will ensurethat the latest version is returned. An alternative, whichDAX uses, is to simply retry the Read(1) operation. Asdescribed in Section 6, DAX uses an asynchronous mech-anism to bring stale replicas up to date. Because of thisasynchronous activity, a retried Read(1) operation will oftensucceed in returning the latest version even if it accesses thesame replica as the original, stale Read(1).

4. HANDLING FAILURESThe basic optimistic I/O protocol in Section 3.1 is not

tolerant of failures of DAX servers. Since it uses Write(1),loss of even a single replica may destroy the only copy of anupdate in the storage tier, compromising durability. (Othercopies may have been successfully updated, but there is noguarantee of this.) Conversely, since it uses Read(ALL) whenRead(1) cannot be used, failure of even a single replica willprevent read operations from completing until that replicais recovered, leading to a loss of availability.

Both of these problems can be resolved by increasing Wand decreasing R such that R + W > N is maintained. Inparticular, we could use Write(QUORUM) and Read(QUORUM)in the basic protocol in place of Write(1) and Read(ALL).The drawback of this simple approach is that Write(QUORUM)may be significantly slower than Write(1). This is partic-ularly true, as shown in Figure 2(c), if we insist that thewrite quorum include replicas in multiple, georgraphicallydistributed data centers. As discussed in Section 4.2, this isexactly what is needed to tolerate data center failures.

To achieve fault tolerance without the full expense ofWrite(QUORUM), DAX makes use of client-controlled synchro-

256

nization. The idea is to weaken the durability guaranteeslightly, so that the storage tier need not immediately guar-antee the durability of every write. Instead, we allow eachDBMS tenant to define explicit synchronization points bywhich preceding updates must be reflected in a quorum (k)of replicas. Until a DBMS has explicitly synchronized anupdate, it cannot assume that this update is durable.

This approach is very natural because many DBMS are al-ready designed to use explicit update synchronization points.For example, a DBMS that implements the write-ahead log-ging protocol will need to ensure that the transaction logrecord describing a database update is safely in the log be-fore the database itself can be updated. To do this, theDBMS will write the log page and then issue a write syn-chronization operation, such as a POSIX fsync call, to ob-tain a guarantee from the underlying file system that thelog record has been forced all the way to the underlyingpersistent storage, and will therefore be durable. Similarly,a DBMS may write a batch of dirty blocks from its bufferto the storage system and then perform a final synchroniza-tion to ensure that the whole batch is durable. Any delaybetween a write operation and the subsequent synchroniza-tion point provides an opportunity for the storage tier tohide some or all of the latency associated with that write.

When DAX receives a write request from a DBMS tenant,it uses a new type of write operation that we have defined,called Write(CSYNC). Like Write(1), Write(CSYNC) immedi-ately sends update requests to all replicas and returns to theclient after a single replica has acknowledged the update. Inaddition, Write(CSYNC) records the key (i.e., the block iden-tifier) in a per-client sync pending list at the DAX serverto which the client library is connected. The key remainsin the sync pending list until all remaining replicas have,asynchronously, acknowledged the update, at which point itis removed. Write(CSYNC) operations experience about thesame latency as Write(1), since both return as soon as onereplica acknowledges the update.

When the DBMS needs to ensure that a write is durable(at least k replicas updated) it issues an explicit synchroniza-tion request. To implement these requests, DAX providesan operation called CSync, which is analogous to the POSIXfsync. On a CSync, the DAX server to which the client isconnected makes a snapshot of the client’s sync pending listand blocks until at least k replica update acknowledgementshave been received for each update on the list. This en-sures that all preceding writes from this DBMS tenant aredurable, i.e., replicated at least k times. The synchroniza-tion request is then acknowledged to the client.

Figure 3 summarizes the refined version of the optimisticI/O protocol that uses these new operations. We will referto this version as CSync-aware optimistic I/O to distinguishit from the basic protocol that was presented in Section 3.In the CSync-aware protocol, Write(CSYNC) is used in placeof Write(1) and Read(QUORUM) is used in place of Read(ALL).In addition, the client library executes the CSync procedurewhen the DBMS tenant performs a POSIX fsync operation.

There are two details of the CSync-aware optimistic pro-tocol that are not shown in Figure 3. First, DAX ensuresthat every block that is on the sync pending list is also onthe version list. (The version list replacement policy, imple-mented by VersionList.put in Figure 3, simply skips blockswith unsynchronized updates.) This ensures that the DAXclient library always knows the current version number

Read procedure:

(1) proc Read(blockId) ≡(2) begin(3) if blockId ∈ V ersionList(4) then(5) (data, v) := Read(1);(6) comment: v is the stored version number(7) while v < V ersionList.get(blockId) do(8) comment: retry a stale read;(9) comment: give up if too many retries

(10) comment: and return an error(11) (data, v) := Read(1); od(12) else(13) comment: fall back to reading a quorum(14) (data, v) := Read(QUORUM);(15) V ersionList.put(blockId, v); fi(16) return data;(17) end

Write procedure:

(1) proc Write(blockId, data) ≡(2) begin(3) v := GenerateNewV ersionNum();(4) comment: write the data using blockId as the key(5) Write(CSYNC);(6) V ersionList.put(blockId, v);(7) end

Fsync procedure:

(1) proc Fsync() ≡(2) begin(3) comment: blocks until writes on sync pending(4) comment: list are durable(5) Csync();(6) end

Figure 3: CSync-aware optimistic I/O protocol.

for blocks with unsynchronized updates, and need not relyon Read(QUORUM) to read the current version. Read(QUORUM)cannot be relied on to do so, since the unsynchronized up-date may not yet be present at a full write quorum of repli-cas. Second, DAX maintains a copy of the sync pending

list at the client library, in case the DAX server to which itis connected should fail. If such a failure occurs, the clientlibrary connects to any other DAX server and initializes thesync pending list there using its copy.

Two parameters control the availability and durability ofthe CSync-aware optimistic I/O protocol: N , the total num-ber of copies of each block, and k, the number of copies thatmust acknowledge a synchronized update. Any synchronizedupdate will survive up to k− 1 concurrent DAX server fail-ures. Unsynchronized updates may not, but the DBMS doesnot depend on the durability of such updates. DAX will nor-mally be available (for read, write, and CSync operations),as long as no more than min(k−1, N−k) servers have failed.However, client-controlled synchronization does introduce asmall risk that a smaller number of failures may compromiseread availability. This may occur if all DAX servers holdingan unsynchronized update (which is not yet guaranteed tobe replicated k times) should fail. Since Write(CSYNC) sendsupdates to all N copies of an object immediately (thoughit does not wait for acknowledgements), such a situation is

257

unlikely. Should it occur, it will be manifested as a failedread operation (lines 7-11 in Figure 3), which would requirea restart of the DBMS server (see Section 4.1), and hence atemporary loss of DBMS availability. Since the DBMS doesnot depend on the durability of unsynchronized writes, itwill be able to recover committed database updates duringrestart using its normal recovery procedure (e.g., from thetransaction log). We never encountered this type of readfailure with our DAX prototype, although we do sometimesneed to retry Read(1) operations. The maximum number ofretries we saw in all of our experiments is 10.

4.1 Failure of a DBMS TenantRecovery from a failure of a DBMS tenant is accomplished

by starting a fresh instance of that DBMS, on a new serverif necessary. The new tenant instance goes through the nor-mal DBMS recovery process using the transaction log anddatabase stored persistently in the DAX storage tier. Thisapproach to recovering failed DBMS instances is also used byexisting services like the Amazon Relational Database Ser-vice (RDS). However, because DAX can be geographicallydistributed, a DAX tenant can be restarted, if desired, in aremote data center. The DBMS recovery process may resultin some tenant downtime. If downtime cannot be tolerated,a DBMS-level high availability mechanism can be used to re-duce or eliminate downtime. However, a discussion of suchmechanisms is outside the scope of this paper.

4.2 Loss of a Data CenterIf the DAX storage tier spans multiple data centers, it can

be used to ensure that stored data survives the loss of allservers in a data center, as can happen, for example, dueto a natural disaster. Stored blocks will remain availableprovided that a quorum of block copies is present at thesurviving data center(s). DBMS tenants hosted in the faileddata center need to be restarted in a new data center andrecovered as discussed in Section 4.1.

To ensure that DAX can survive a data center failure,the only additional mechanism we need is a data-center-aware policy for placement of replicas. The purpose of sucha policy is to ensure that the storage tier distributes replicasof every stored block across multiple data centers. Our DAXprototype, which is based on Cassandra, is able to leverageCassandra’s replica placement policies to achieve this.

4.3 ConsistencyIn this section, we discuss the consistency guarantees made

by the CSync-aware optimistic I/O protocol shown in Fig-ure 3. We will use the example timeline shown in Figure 4to illustrate the consistency guarantees. The timeline repre-sents a series of reads and writes of a single stored block. Be-cause these requests are generated by a single DBMS, theyoccur sequentially as disucussed in Section 2. Writes W0and W1 in Figure 4 are synchronized writes, because theyare followed by a CSync operation. The remaining writesare unsynchronized. The timeline is divided into epochs.Each failure of the hosted DBMS defines an epoch bound-ary. Three different epochs are shown in the figure.

The CSync-aware optimistic I/O protocol ensures that eachread sees the block as written by the most recent precedingwrite in its epoch, if such a write exists. Thus, R1 will seethe block as written by W3 and R4 will see the block as writ-ten by W4. We will refer to this property as epoch-bounded

strong consistency. It is identical to the usual notion ofstrong consistency, but it applies only to reads for whichthere is at least one preceding write (of the same block) inthe same epoch.

We refer to reads for which there is no preceding write inthe same epoch as initial reads. In Figure 4, R2, R3, andR5 are initial reads. For initial reads, the CSync-aware opti-mistic I/O protocol guarantees only that the read will reada version at least as recent as that written by latest preced-ing synchronized write. In our example, this means that R2and R3 will see what is written by either W1, W2, or W3,and R5 will see one of those versions or the version writtenby W4. This may lead to problematic anomalies. For ex-ample, R2 may return the version written by W3, while R3returns the earlier version written by W1. If R2 occurs dur-ing log-based recovery, the DBMS may decide that the blockis up-to-date and thus will not update it from the log. If R3occurs during normal operation of the new DBMS instance,those committed updates will be missing from the block.Thus, the DBMS will have failed to preserve the atomicityor durability of one or more its transactions from the firstepoch. In the following section, we present strong optimisticI/O, an extension of the protocol shown in Figure 3 thatprovides stronger guarantees for initial reads.

5. STRONG OPTIMISTIC I/OOne option for handling initial reads is to try to extend the

optimistic I/O protocol so that it provides a strong consis-tency guarantee across epochs as well as within them. How-ever, such a guarantee would be both difficult to enforce andunnecessarily strong. Instead, we argue that the DAX stor-age tier should behave like a non-replicated, locally attachedstorage device or file system that has an explicit synchro-nization operation (i.e., like a local disk). Most DBMS aredesigned to work with such a storage system. In such a sys-tem, unsynchronized writes may survive a failure, but theyare not guaranteed to do so. However, the key point is thatonly one version of each block will survive. The survival can-didates include (a) the version written by the most recentsynchronized update preceding the failure and (b) the ver-sions written by any (unsynchronized) updates that occurafter the most recent synchronized update. Thus, at R2 thestorage tier would be free to return either the version writtenby W1 or the version written by W2 or the version writtenby W3. However, once it has shown a particular version toR2, it must show the same version to R3, since only oneversion should survive across a failure, i.e., across an epochboundary. Similarly, at R5 the storage tier can return eitherthe same version that was returned to R2 (and R3) or theversion that was written by W4. We can summarize theseadditional consistency requirements as follows:• A read operation with no preceding operations (on thesame block) in the same epoch (such as R2 and R5 inFigure 4) must return either (a) the version written by themost recent synchronized update in the preceding epoch(or the version that exists at the beginning of the epoch ifthere is no synchronized update) or (b) a version writtenby one of the (unsynchronized) updates that occurs afterthe most recent synchronized update. We refer to this asNon-deterministic One-Copy (NOC) consistency.

• Consecutive read operations, with no intervening writes,must return the same version. Thus, in Figure 4, R3 must

258

W0 W1 W3 R3W2 R1 R2 W4 R4 R5

csync

DBMS

failureepoch 1 epoch 2 epoch 3

DBMS

failure

Figure 4: Timeline example showing reads and writes of a single block in the presence of failures.

return the same version as R2. We refer to this propertyas read stability.Next, we present strong optimistic I/O, a refinement of

optimistic I/O that provides NOC consistency and read sta-bility, in addition to epoch-bounded strong consistency.

The key additional idea in the strong optimistic I/O pro-tocol is version promotion. When a block is initially read ina new epoch, one candidate version of the block has its ver-sion number “promoted” from the previous epoch into thecurrent epoch and is returned by the read call. Promotionselects which version will survive from the previous epochand ensures that other versions from that earlier epoch willnever be read. Through careful management of version num-bers, the version promotion protocol ensures that only oneversion of the block can be promoted. In addition, the pro-motion protocol performs both the read of a block and itspromotion using a single round of messages in the storagetier, so promotion does not add significant overhead.

The strong optimistic I/O protocol is identical to theCSync-aware protocol presented in Figure 3, except that theRead(QUORUM) at line 14 is replaced by a new DAX opera-tion called ReadPromote (Figure 5). Note that ReadPromoteis used only when the current version number of the block isnot found in the version list. This will always be the case foran initial read of a block in a new epoch, since the versionlist does not survive failure of the DBMS tenant.

To implement ReadPromote, we rely on two-part compos-ite version numbers V = (e, t), where e is an epoch numberand t is a generated timestamp. We assume that each epochis associated by the client library with an epoch number, andthat these epoch numbers increase monotonically with eachfailover (i.e., with each failure and restart of the DBMS). Wealso rely on each client library being able to generate mono-tonically increasing timestamps, even across failures. Thatis, all version numbers generated during an epoch must havea timestamp component larger than the timestamps in allversion numbers generated during previous epochs. We saythat V2 > V1 iff t2 > t1. (Because of our timestamp mono-tonicity assumption, t2 > t1 implies e2 ≥ e1.) Whenever theclient library needs to generate a new version number, it usesthe current epoch number along with a newly generated tthat is larger than all previously generated t’s.ReadPromote takes a single parameter, which is a version

number V = (e, t) generated in the current interval. It con-tacts a quorum of servers holding copies of the block tobe read and waits for them to return a response (lines 3-4). Each contacted server promotes its replica’s version intoepoch e (the current epoch) and returns both its copy of theblock and its original, unpromoted version number. Pro-moting a version number means replacing its epoch number

ReadPromote procedure:

(1) proc ReadPromote(V ) ≡(2) begin(3) {(data1, V1), (data2, V2), ..., (dataq, Vq)}(4) := ReadAndPromote(QUORUM, epoch(V ));(6) if ∃1 ≤ i ≤ q : epoch(Vi) = epoch(V )(7) then comment: choose the latest Vi

(8) return(datai, Vi)(9) else if (V1 = V2 = ... = Vq)

(10) then return(data1, V1)(11) else(12) data := datai with largest Vi

(13) Write(QUORUM, data, V );(14) return(data, V )(15) fi(16) fi(18) end

Figure 5: ReadPromote operation.

while leaving its timestamp unchanged.If at least one server returns a version from the cur-

rent epoch (lines 6-8), the latest such version is returnedto the client. Otherwise, all of the returned versions arefrom previous epochs. If all of the returned versions areidentical (lines 9-10), then that version will be returned tothe client. That version will already have been promotedinto the current epoch by the quorum of servers that re-ceived the ReadPromote request. Otherwise, ReadPromote

must choose one of the returned versions to survive fromthe previous epoch. This version must be made to exist ona quorum of the replicas. ReadPromote ensures this by usinga Write(QUORUM) operation to write the selected version ofthe block back to the replica servers, using the new versionnumber V (lines 12-13).

In the (hopefully) common case in which the returned ver-sions from the previous epoch are the same, ReadPromote

requires only a single message exchange, like Read(QUORUM).In the worst case, which can occur, for example, if awrite was in progress at the time of the DBMS failover,ReadPromote will require a second round of messages forthe Write(QUORUM).

Theorem 1. The full optimistic I/O protocol, withReadPromote, guarantees epoch-bounded strong consistency,NOC consistency, and read stability.

Proof: Due to space constraints we only provide a sketchof the proof. First, ReadPromote is only used for block b

259

when there are no unsynchronized updates of b in the cur-rent epoch, since a block with unsynchronized updates isguaranteed to remain on the version list and be read usingRead(1). If there has been a synchronized update in theepoch and the block is subsequently evicted from the ver-sion list, ReadPromote’s initial quorum read must see it andwill return it thereby maintaining intra-epoch strong consis-tency. Since timestamps increase monotonically even acrossepochs, updates from the current epoch always have largerversion numbers than updates from previous epochs, even ifthose earlier updates have been promoted. If ReadPromoteis being used for an initial read in an epoch, it will returna version from a previous epoch and will have ensured thata quorum of replicas with that version exists, either by dis-covering such a quorum or creating it with Write(QUORUM).Replicas in this quorum will have version numbers largerthan any remaining replicas, so this version will be returnedby any subsequent reads until a new version is written inthe current epoch. �

6. DAX PROTOTYPETo build a prototype of DAX, we decided to adapt an ex-

isting multi-master, flexible consistency, key/value storagesystem rather than building from scratch. Starting with anexisting system allowed us to avoid re-implementing com-mon aspects of the functionality of such systems. For ex-ample, we re-used mechanisms for detecting failures andfor bringing failed servers back on-line, and mechanisms forstoring and retrieving values at each server.

Candidate starting points include Cassandra and Volde-mort, both of which provide open-source implementations ofDynamo-style quorum-based flexible consistency. We choseCassandra because it allows clients to choose read and writequorums on a per-request basis, as required by the optimisticI/O approach. Voldemort also supports flexible consistency,but not on a per-request basis. Cassandra also expects aclient-supplied timestamp to be associated with each writeoperation, and it uses these timestamps to impose a globaltotal ordering on each key’s writes. This also fits well withDAX’s optimistic I/O approach, which requires per-requestversion numbers. In our Cassandra-based prototype, weuse these version numbers as Cassandra timestamps. Cas-sandra also implements data-center-aware replica placementpolicies, which we take advantage of. Finally, Cassandraprovides various mechanisms (e.g., hinted-handoff and read-repair) for asynchronously updating replicas that may havemissed writes. Read-repair, for example, is triggered auto-matically when Cassandra discovers version mismatches ona read. These mechanisms help to improve the performanceof optimistic I/O by increasing the likelihood that Read(1)will find an up-to-date replica.

To build the DAX prototype, we had to modify and ex-tend Cassandra in several ways. First, we created the DAXclient library, which implements optimistic I/O and servesas the interface between the DBMS tenants and the stor-age tier, controlling the consistency-level of each read andwrite operation. Second, we added support for the CSync

operation, which involved implementing the sync pending

list at each Cassandra server. Finally, we added supportfor the ReadPromote operation described in Figure 5. Thisinvolved the creation of a new Cassandra message type andthe addition of version promotion logic at each server.

7. EXPERIMENTAL EVALUATIONWe conducted an empirical evaluation of DAX, with the

goals of measuring the performance and effectiveness of op-timistic I/O and client-controlled synchronization, verifyingthe system’s scalability, and characterizing the system’s be-havior under various failure scenarios. We conducted all ofour experiments in Amazon’s Elastic Compute Cloud (EC2).EC2 is widely used to host data-intensive services of the kindthat we are interested in supporting, so it is a natural set-ting for our experiments. Another reason for using EC2 isthat it allowed us to run experiments that use multiple datacenters. All of our experiments used EC2 large instances(virtual machines), which have 7.5 GB of memory and a sin-gle, locally attached 414 GB storage volume holding an ext3file system. The virtual machines ran Ubuntu Linux withthe 2.6.38-11-virtual kernel. Our DAX prototype is basedon Cassandra version 1.0.7. MySQL version 5.5.8 with theInnoDB storage manager was used as the DBMS tenant.

All of the experiments ran a TPC-C transaction process-ing workload generated by the Percona TPC-C toolkit forMySQL [26]. We used a 100-warehouse TPC-C databaseinstance with an initial size of 10 GB, and the MySQLbuffer pool size was set to 5 GB. The database grows dur-ing a benchmark run due to record insertions. For eachDBMS, 50 concurrent TPC-C clients were used to gener-ate the DBMS workload. The performance metric of inter-est in our experiments is throughput, measured in TPC-CNewOrder transactions per minute (TpmC). The maximumsize of the version list maintained by the client library wasset to 512K entries, sufficiently large to hold entries for mostof the database blocks.

7.1 Performance of Optimistic I/OIn this experiment, we consider three DAX configurations.

In each configuration, there is one EC2 server running asingle MySQL tenant, and nine EC2 servers implementingthe DAX storage tier. (DAX stores its data on those servers’locally attached storage volumes.) Each database block isreplicated six times for a total data size of approximately 60GB across the nine DAX servers. The three configurationsare as follows:• 1 zone: All ten servers are located in the same EC2availability zone (roughly analogous to a data center).

• 3 zones: The nine DAX servers are distributed evenlyover three EC2 availability zones, one of which also housesthe MySQL tenant server. All three zones are in the samegeographic region (US East). Two replicas of each blockare placed in each zone.

• 3 regions: Like the 3-zone configuration, except thatthe three zones are located in geographically-distributedEC2 regions: US East, US West-1 and US West-2.In each EC2 configuration, we tested four versions of DAX,

for a total of twelve experiments. The DAX versions wetested are as follows:• Baseline 1: In this configuration, DAX performs I/Ooperations using Write(ALL) and Read(1). This ensuresthat all database and log updates are replicated 6 timesbefore being acknowledged to the DBMS, and all I/O isstrongly consistent, i.e., reads are guaranteed to see themost recently written version of the data. Neither opti-mistic I/O nor client-controlled synchronization is used.

260

1 zone 3 zones 3 regions0

200

400

600

800

1000

1200

1400

1600T

pm

CBasic Optimistic I/O

Strong Optimistic I/O

Baseline 1

Baseline 2

Figure 6: Performance of four DAX variants in threeEC2 configurations.

• Baseline 2:Like Baseline 1, except that Write(QUORUM)and Read(QUORUM) are used instead of Write(ALL) andRead(1). A quorum consists of N/2+1 replicas, i.e., 4 of6 replicas. All I/O is strongly consistent, and neither op-timistic I/O nor client-controlled synchronization is used.

• Basic Optimistic I/O: In this configuration, DAXuses the optimistic I/O technique described in Section 3.This ensures strongly consistent I/O, but only within anepoch. Furthermore, database updates are not guaranteedto be replicated at the time the DBMS write operationcompletes. The database is, therefore, not safe in the faceof failures of DAX servers.

• Strong Optimistic I/O: In this configuration, DAXuses the strong optimistic I/O protocol (Section 5), in-cluding CSync and ReadPromote. I/O operations haveepoch-bounded strong consistency, NOC consistency, andread stability, and updates are guaranteed to be replicatedat the synchronization points chosen by the DBMS.Results from this experiment are presented in Figure 6. In

the two single-region configurations, optimistic I/O providesup to a factor of two improvement in transaction through-put relative to the baseline systems. In the 3-region con-figuration, the improvement relative to the baseline is ap-proximately a factor of four. Unlike the baselines, opti-mistic I/O is able to handle most read and write requestsusing Read(1) and Write(1). In contrast, the Write(ALL) orRead(QUORUM)/Write(QUORUM) operations performed by thebaseline cases will experience higher latencies. OptimisticI/O is not completely immune to the effects of inter-regionlatency. The performance of the optimistic I/O variants islower in the 3-region configuration than in the single-regionconfigurations. However, the drop in throughput is muchsmaller than for the baselines.

Strong optimistic I/O results in only a small performancepenalty relative to basic optimistic I/O, which is not tolerantto failures. Most importantly, the penalty is small even inthe 3-region configuration. Thus, strong optimistic I/O isable to guarantee that updates are replicated consistentlyacross three geographically distributed data centers (therebytolerating failures up to the loss of an entire data center)while achieving only slightly lower throughput than basicoptimistic I/O, which makes no such guarantee.

7.2 ScalabilityIn this experiment, we investigate the tenant scalability

of DAX with strong optimistic I/O by varying the number

0 6 12 18 24 30

Number of DBMS Tenants

0

100

200

300

400

500

Sin

gle

-Ten

ant

Tp

mC

1 region

3 regions

Figure 7: Average, min, and max TpmC per DBMStenant. The number of DAX servers is 3/2 the num-ber of tenants.

of DBMS tenants from 6 to 30 and increasing the numberof DAX servers proportionally, from 9 servers to 45 servers.We use two DAX configurations. In the first, all servers arelocated in a single availability zone. In the second, the DAXservers and DBMS tenants are evenly distributed in threeEC2 regions. Each DBMS tenant manages a separate TPC-C database with 100 warehouses and serves 50 concurrentclients. There are three replicas of each database in thestorage tier. Figure 7 shows, for both DAX configurations,the average, minimum, and maximum TpmC per tenant asthe number of tenants increases. With the ratio of threeDAX storage tier servers to two tenants that we maintainedin this experiment, the TPC-C throughput of the tenantsis limited by the I/O bandwidth at the DAX servers. Weobserved nearly linear scale-out up to 30 tenant databases,the largest number we tested. The per-tenant throughputis slightly higher when there are fewer DAX servers. This isbecause at smaller scales a higher portion of each tenant’sdatabase will be located locally at the DAX server to whichthe tenant connects. As expected, throughput is lower inthe 3-region configuration due to the higher latency betweenDAX servers, but we still observe nearly linear scale-out.

7.3 Fault ToleranceNext, we present the results of several experiments that

illustrate the behavior of DAX under various failure condi-tions. We present results for DAX with strong optimisticI/O, and for Baseline 2 (Write(QUORUM)/Read(QUORUM)).Baseline 1 (Write(ALL)/Read(1)) cannot be used in the pres-ence of failures since Write(ALL) requires all replicas to beavailable. In each case, the experiment begins with the 3-region EC2 configuration: nine DAX servers, three in the USEast region (primary), three in the US West-1 region, andthree in US West-2 region, with two replicas of each blockin each region. A single MySQL tenant runs in the pri-mary region. For each experiment, we measure the tenant’sTPC-C throughput (TpmC) as a function of time. Averagethroughput is reported every 10 seconds.

7.3.1 Failures of DAX ServersFigure 8 illustrates a scenario in which one DAX server

in the primary data center fails at around the 300th second,and then re-joins the storage tier at around the 900th sec-ond. After the server fails, it takes the other DAX servers30 to 90 seconds to detect the failure. During this pe-

261

Figure 8: Failure and recovery of a DAX server inthe primary data center. Failure occurs at 300 sec-onds and recovery occurs at 900 seconds.

riod, read requests that are directed at the failed server willtime out and will have to be retried by the tenant. CSync

requests may also time out. This results in a dip in through-put immediately after the failure for both strong optimisticI/O and Baseline 2. The drop in performance relative to theno-failure performance is about the same for both DAX con-figurations. Eventually, all DAX servers are made aware ofthe failure and no longer direct requests to the failed server.The tenant’s throughput returns to its original level.

After the failed server re-joins the storage tier, the datastored on that server will be stale. Read(1) requests that thestrong optimistic I/O protocol directs to the failed server arelikely to retrieve stale data and thus need to be retried bythe tenant, according to the optimistic I/O protocol. Theperformance impact of these retries gradually diminishes asthe stale data on the re-joined server is brought into syn-chronization, which Cassandra does automatically. In ourtest scenario, the system required about 10 minutes afterthe re-join to return to its full pre-failure performance level.However, the system remains available throughout this re-covery period. Furthermore, even when it temporarily dips,optimistic I/O’s performance is always better than best per-formance of Baseline 2.

The (poor) performance of Baseline 2 remains unchangedafter the failed server re-joins. That baseline routinely usesrelatively heavyweight Read(QUORUM) operations, even in thecommon case of no failure, so it does not experience anyadditional performance overhead when a failure occurs. Itsperformance remains low in all cases.

Note that the storage tier has only three servers in theprimary region in these experiments, and those servers areheavily utilized. Thus, the loss of one server represents aloss of a third of the storage tier’s peak bandwidth at theprimary site. In a larger DAX system, or one that is lessheavily loaded, we would expect the impact of a single stor-age server failure to be smaller than what is shown in Fig-ure 8. Specifically, we would expect a shallower performancedip after the server failure for strong optimistic I/O, and afaster performance ramp-up after rejoin.

Figure 9 shows the results of an experiment similar to theprevious one, except that the failed DAX server is in oneof the secondary data centers. As in the previous scenario,there is a brief dip in performance for both configurationsimmediately after the failure due to request timeouts. Per-

Figure 9: Failure and recovery of a DAX server in asecondary data center. Failure occurs at 300 secondsand recovery occurs at 900 seconds.

formance returns to normal once all DAX servers are awareof the failure. This time there is little impact on tenantperformance for both configurations when the failed serverre-joins the storage tier. Since the re-joined server is remote,Read(1) requests issued by the strong optimistic I/O proto-col are unlikely to be handled by that server. Instead, theywill be handled by servers in the primary region, which arecloser to the DBMS tenant. Since the data on those serversis up to date, retries are rarely needed.

In both configurations, the re-joined server will graduallybe brought into synchronization in the background by Cas-sandra, without affecting DBMS performance. As before,the worst performance of strong optimistic I/O is betterthan the best performance of Baseline 2.

7.3.2 Loss of a Data CenterFigure 10 shows the result of an experiment in which the

entire primary EC2 data center fails, resulting in the loss ofthe DBMS tenant and all three DAX servers running there.After the failure, a new MySQL tenant is launched in oneof the secondary data centers to take over from the failedprimary. The new tenant goes through its log-based trans-actional recovery procedure using the database and transac-tion log replicas that it is able to access through the DAXstorage servers in the secondary availability zones. Thus,DAX allows the tenant DBMS to restore its availabilityeven after the entire primary data center is lost. No spe-cial DBMS-level mechanisms are required to support this.

Figure 10 shows throughput for strong optimistic I/O andBaseline 2. For both configurations, there is an unavailabil-ity window of about 270 seconds while the new DBMS in-stance goes through its recovery procedure. All of the down-time is due to the DBMS recovery procedure. Performancethen ramps up gradually to its pre-failure level. The gradualramp-up under strong optimistic I/O is due to a combinationof several factors: the cold start of the new DBMS tenant,the fact that DAX’s workload is more read-heavy duringrestart, and the fact that these reads are more expensivethan normal for DAX. The initial read of each database orlog block requires ReadPromote since the block’s identifierwill not yet be present in the version list. Baseline 2 in-curs the penalty for cold start but not for ReadPromote.However, as in previous experiments, the performance ofBaseline 2 is always lower than strong optimistic I/O.

262

Figure 10: Failure of the primary data center andrecovery of MySQL.

8. RELATED WORKCloud Storage Systems: There are now many cloud

storage systems, all of which sacrifice functionality or con-sistency (as compared to relational database systems) toachieve scalability and fault tolerance. Different systemsmake different design choices in the way they manage up-dates, leading to different scalability, availability, and con-sistency characteristics. Google’s BigTable [7] and its opensource cousin, HBase, use master-slave replication and pro-vide strong consistency for reads and writes of single rows.In contrast, systems such as Dynamo [16] and Cassan-dra [20] use multi-master replication as described in Sec-tion 2. PNUTS [9] provides per-record timeline consistency:all replicas of a given record apply all updates to the recordin the same order. Spinnaker [27] enables client applica-tions to choose between strong consistency and eventualconsistency for every read. The system only runs in onedata center, and it uses heavyweight Paxos-based replica-tion, leading to high read and write latencies. All of thesystems mentioned thus far provide consistency guaranteesat the granularity of one row. Other cloud storage systemsoffer multi-row consistency guarantees [3, 13, 24, 25, 29],but such multi-row guarantees are not required by DAX.

The importance of consistent transactions on storage sys-tems distributed across data centers is highlighted by recentwork in this area. Patterson et al. [24] present a protocolthat provides consistent multi-row transactions across mul-tiple data centers, even in the presence of multiple writersfor every row. The protocol is fairly heavyweight, and it as-sumes that the storage system provides strong consistencyfor reads and writes of individual rows. In contrast, DAXuses a lightweight protocol that takes advantage of the prop-erties of DBMS transactions and the ability of flexibility con-sistency systems to hide latency across data centers. DAX’slatency hiding relies on the assumption that reads in even-tual consistency storage systems usually return fresh data.This assumption has been verified and quantified in [2].

Database Systems on Cloud Storage: DAX is not thefirst cloud storage tier for relational DBMS. ElasTras [12]is perhaps closest to the work presented here. Like DAX,ElasTras is intended to host database systems on a sharedstorage tier. However, ElasTras assumes that the storagetier provides strongly consistent read and write operations,so it is not concerned with questions of managing requestconsistency, which is the focus of this paper. Instead, Elas-Tras is concerned with issues that are orthogonal to this

paper, such as allowing co-located hosted database systemsto share resources. Commercial services, such as Amazon’sEBS, are also used to provide storage services for DBMStenants. EBS provides strongly consistent reads and writes,but we are not aware of any public information about itsimplementation. Unlike DAX, these storage services are notintended to support geographically distributed replication.

Brantner et al. [6] considered the question of how to hostdatabase systems on Amazon’s Simple Storage Service (S3),which provides only weak eventual consistency guarantees.As in our work, they use the underlying storage tier (S3) as ablock storage device. However, their task is more challeng-ing because of S3’s weak eventual consistency guarantees,and because they allow multiple hosted database systemsto access a single stored database. As a result, it is dif-ficult to offer transactional guarantees to the applicationsbuilt on top of the hosted database systems. In contrast,our work takes advantage of a flexible consistency model tooffer transactional guarantees to the hosted DBMS applica-tions without sacrificing performance.

Recent work [23] has looked at hosting relational databaseservices on a storage tier that provides tuple-level access aswell as indexing, concurrency control and buffering. Thestorage tier in that work is based on Sinfonia [1]. Since DAXrelies on a storage tier that provides a block-level interface,it resides in a very different part of the design spectrum.

Other Approaches to Database Scalability: DAXscales out to accommodate additional database tenants andadditional aggregate demand for storage capacity and band-width. However, by itself it does not enable scale-out ofindividual DBMS tenants. There are other approaches todatabase scalability that do enable scale-out of individualdatabase systems. One option is to use DBMS-managed(or middleware-managed) database replication. Kemme etal. [19] provide a recent overview of the substantial body ofwork on this topic. In such systems, updates are typicallyperformed at all replicas, while reads can be performed atany replica. In DAX, each update is executed at only oneDBMS, but the effect of the update is replicated in the stor-age tier. In some replicated DBMS, a master DBMS ormiddleware server is responsible for determining the orderin which update operations will occur at all replicas. Suchsystems are closest to DAX, in which each hosted DBMStenant determines the ordering of its write operations andthe storage tier adheres to the DBMS-prescribed order.

Database partitioning or sharding [15], which is widelyused in practice, is another approach to DBMS scale-out.HStore [18] and its commercial spin-off, VoltDB, storeall the data of a database in main memory, partitionedamong many servers and replicated for durability. Mem-cacheSQL [8] is another system that uses main memory,integrating buffer pool management in a relational DBMSwith the Memcached distributed caching platform to enableone DBMS instance to use buffer pools that are distributedamong many servers. Hyder [5] promises to scale a transac-tional database across multiple servers by exploiting a fastflash-based shared log. Some of these approaches may beused in conjunction with DAX if there is a need to scale aDBMS tenant beyond one server.

Spanner [11], is a recent multi-tenant database servicethat supports (a subset of) SQL and runs on a storage sys-tem distributed across multiple data centers. As such, Span-ner has the same high level goal as DAX. Spanner is a com-

263

pletely new system built from the ground up, with a newreplicated storage layer, transaction manager, and queryprocessor. It is a highly engineered system that supportsa rich set of features, some of which (e.g., multiple writersfor the same database) are not supported by DAX. However,Spanner relies on a service called TrueTime that minimizesclock skew among data centers and reports the maximumpossible clock skew. The implementation of TrueTime isdistributed among the data centers in which Spanner runs,and it relies on special hardware. Further, Spanner uses thestandard, heavyweight Paxos protocol (with some optimiza-tions) for replication among data centers, which makes itslatency high [28]. In contrast, DAX provides a simpler ser-vice and focuses on minimizing latency between data centerswhile maintaining consistency. It can support existing re-lational database systems such as MySQL and it does notrely on any special hardware.

9. CONCLUSIONSWe have presented DAX, a distributed system intended

to support the storage requirements of multiple DBMS ten-ants. DAX is scalable in the number of tenants it supports,and offers tenants scalable storage capacity and bandwidth.In addition, DAX tolerates failures up to the loss of an entiredata center. To provide these benefits, DAX relies on a flexi-ble consistency key/value storage system, and it judiciouslymanages the consistency level of different read and writeoperations to enable low-latency, geographically distributedreplication while satisfying the consistency requirements ofthe DBMS tenants. Our experiments with a prototype ofDAX based on Cassandra show that it provides up to a fac-tor of 4 performance improvement over baseline solutions.

10. ACKNOWLEDGMENTSThis work was supported by the Natural Sciences and

Engineering Research Council of Canada (NSERC) and byan Amazon Web Services Research Grant.

11. REFERENCES[1] M. K. Aguilera, A. Merchant, M. Shah, A. Veitch, and

C. Karamanolis. Sinfonia: a new paradigm forbuilding scalable distributed systems. In SOSP, 2007.

[2] P. Bailis, S. Venkataraman, M. J. Franklin, J. M.Hellerstein, and I. Stoica. Probabilistically boundedstaleness for practical partial quorums. PVLDB, 2012.

[3] J. Baker et al. Megastore: Providing scalable, highlyavailable storage for interactive services. In CIDR,2011.

[4] P. A. Bernstein et al. Adapting Microsoft SQL Serverfor cloud computing. In ICDE, 2011.

[5] P. A. Bernstein, C. Reid, and S. Das. Hyder - atransactional record manager for shared flash. InCIDR, 2011.

[6] M. Brantner, D. Florescuy, D. Graf, D. Kossmann,and T. Kraska. Building a database on S3. InSIGMOD, 2008.

[7] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A.Wallach, M. Burrows, T. Chandra, A. Fikes, andR. E. Gruber. Bigtable: A distributed storage systemfor structured data. In OSDI, 2006.

[8] Q. Chen, M. Hsu, and R. Wu. MemcacheSQL - ascale-out SQL cache engine. In BIRTE, 2011.

[9] B. F. Cooper et al. PNUTS: Yahoo!’s hosted dataserving platform. In PVLDB, 2008.

[10] B. F. Cooper, A. Silberstein, E. Tam,R. Ramakrishnan, and R. Sears. Benchmarking cloudserving systems with YCSB. In SoCC, 2010.

[11] J. C. Corbett et al. Spanner: Google’sglobally-distributed database. In OSDI, 2012.

[12] S. Das, D. Agrawal, and A. El Abbadi. ElasTraS: Anelastic transactional data store in the cloud. InHotCloud, 2009.

[13] S. Das, D. Agrawal, and A. El Abbadi. G-Store: Ascalable data store for transactional multi key accessin the cloud. In SoCC, 2010.

[14] S. Das, S. Nishimura, D. Agrawal, and A. El Abbadi.Albatross: Lightweight elasticity in shared storagedatabases for the cloud using live data migration.PVLDB, 2011.

[15] Database sharding whitepaper.http://www.dbshards.com/articles/database-sharding-whitepapers/.

[16] G. DeCandia, D. Hastorun, M. Jampani,G. Kakulapati, and A. Lakshman. Dynamo: Amazon’shighly available key-value store. In SOSP, 2007.

[17] D. K. Gifford. Weighted voting for replicated data. InSOSP, 1979.

[18] S. Harizopoulos, D. J. Abadi, S. Madden, andM. Stonebraker. OLTP through the looking glass, andwhat we found there. In SIGMOD, 2008.

[19] B. Kemme, R. Jimenez-Peris, and M. Patino-Martınez.Database Replication. Synthesis Lectures on DataManagement. Morgan & Claypool Publishers, 2010.

[20] A. Lakshman and P. Malik. Cassandra - adecentralized structured storage system. In LADIS,2009.

[21] LinkedIn Data Infrastructure Team. Datainfrastructure at LinkedIn. In ICDE, 2012.

[22] U. F. Minhas, R. Liu, A. Aboulnaga, K. Salem, J. Ng,and S. Robertson. Elastic scale-out for partition-baseddatabase systems. In SMDB, 2012.

[23] M. T. Najaran, P. Wijesekera, A. Warfield, and N. B.Hutchinson. Distributed indexing and locking: Insearch of scalable consistency. In LADIS, 2011.

[24] S. Patterson, A. J. Elmore, F. Nawab, D. Agrawal,and A. El Abbadi. Serializability, not serial:Concurrency control and availability inmulti-datacenter datastores. PVLDB, 2012.

[25] D. Peng and F. Dabek. Large-scale incrementalprocessing using distributed transactions andnotifications. In OSDI, 2010.

[26] Percona TPC-C toolkit for MySQL.https://code.launchpad.net/ percona-dev/perconatools/tpcc-mysql.

[27] J. Rao, E. J. Shekita, and S. Tata. Using Paxos tobuild a scalable, consistent, and highly availabledatastore. In PVLDB, 2011.

[28] J. Shute et al. F1: the fault-tolerant distributedRDBMS supporting Google’s ad business. InSIGMOD, 2012.

[29] Y. Sovran et al. Transactional storage forgeo-replicated systems. In SOSP, 2011.

264

DAX: A Widely Distributed Multi-tenant Storage Service … · DAX: A Widely Distributed...

Documents

Transcript of DAX: A Widely Distributed Multi-tenant Storage Service … · DAX: A Widely Distributed...