Chain replication for supporting high throughput and availability

Chain Replication for SupportingHigh Throughput and Availability

Robbert van [email protected]

Fred B. [email protected]

FAST Search & Transfer ASATromsø, Norway

and

Department of Computer ScienceCornell University

Ithaca, New York 14853

Abstract

Chain replication is a new approach to coordinatingclusters of fail-stop storage servers. The approach isintended for supporting large-scale storage servicesthat exhibit high throughput and availability with-out sacrificing strong consistency guarantees. Be-sides outlining the chain replication protocols them-selves, simulation experiments explore the perfor-mance characteristics of a prototype implementa-tion. Throughput, availability, and several object-placement strategies (including schemes based ondistributed hash table routing) are discussed.

1 Introduction

A storage system typically implements operationsso that clients can store, retrieve, and/or changedata. File systems and database systems are per-haps the best known examples. With a file system,operations (read and write) access a single file andare idempotent; with a database system, operations(transactions) may each access multiple objects andare serializable.

This paper is concerned with storage systems thatsit somewhere between file systems and databasesystems. In particular, we are concerned with stor-age systems, henceforth called storage services, that

• store objects (of an unspecified nature),

• support query operations to return a value de-rived from a single object, and

• support update operations to atomically changethe state of a single object according to some

pre-programmed, possibly non-deterministic,computation involving the prior state of thatobject.

A file system write is thus a special case of our stor-age service update which, in turn, is a special caseof a database transaction.

Increasingly, we see on-line vendors (like Ama-zon.com), search engines (like Google’s andFAST’s), and a host of other information-intensiveservices provide value by connecting large-scale stor-age systems to networks. A storage service is theappropriate compromise for such applications, whena database system would be too expensive and a filesystem lacks rich enough semantics.

One challenge when building a large-scale stor-age service is maintaining high availability andhigh throughput despite failures and concomitantchanges to the storage service’s configuration, asfaulty components are detected and replaced.

Consistency guarantees also can be crucial. Buteven when they are not, the construction of an appli-cation that fronts a storage service is often simpli-fied given strong consistency guarantees, which as-sert that (i) operations to query and update indi-vidual objects are executed in some sequential orderand (ii) the effects of update operations are necessar-ily reflected in results returned by subsequent queryoperations.

Strong consistency guarantees are often thoughtto be in tension with achieving high throughputand high availability. So system designers, reluctantto sacrifice system throughput or availability, regu-larly decline to support strong consistency guaran-tees. The Google File System (GFS) illustrates thisthinking [11]. In fact, strong consistency guarantees

OSDI ’04: 6th Symposium on Operating Systems Design and ImplementationUSENIX Association 91

in a large-scale storage service are not incompatiblewith high throughput and availability. And the newchain replication approach to coordinating fail-stopservers, which is the subject of this paper, simulta-neously supports high throughput, availability, andstrong consistency.

We proceed as follows. The interface to a genericstorage service is specified in §2. In §3, we explainhow query and update operations are implementedusing chain replication. Chain replication can beviewed as an instance of the primary/backup ap-proach, so §4 compares them. Then, §5 summarizesexperiments to analyze throughput and availabilityusing our prototype implementation of chain replica-tion and a simulated network. Some of these simula-tions compare chain replication with storage systems(like CFS [7] and PAST [19]) based on distributedhash table (DHT) routing; other simulations revealsurprising behaviors when a system employing chainreplication recovers from server failures. Chain repli-cation is compared in §6 to other work on scalablestorage systems, trading consistency for availability,and replica placement. Concluding remarks appearin §7, followed by endnotes.

2 A Storage Service Interface

Clients of a storage service issue requests for queryand update operations. While it would be possibleto ensure that each request reaching the storage ser-vice is guaranteed to be performed, the end-to-endargument [20] suggests there is little point in doingso. Clients are better off if the storage service sim-ply generates a reply for each request it receives andcompletes, because this allows lost requests and lostreplies to be handled as well: a client re-issues a re-quest if too much time has elapsed without receivinga reply.

• The reply for query(objId , opts) is derived fromthe value of object objId ; options opts charac-terizes what parts of objId are returned. Thevalue of objId remains unchanged.

• The reply for update(objId ,newVal , opts) de-pends on options opts and, in the general case,can be a value V produced in some nondeter-ministic pre-programmed way involving the cur-rent value of objId and/or value newVal ; V thenbecomes the new value of objId .1

Query operations are idempotent, but update op-erations need not be. A client that re-issues a non-idempotent update request must therefore take pre-cautions to ensure the update has not already been

State is:HistobjID : update request sequencePendingobjID : request set

Transitions are:T1: Client request r arrives:

PendingobjID := PendingobjID ∪ {r}T2: Client request r ∈ PendingobjID ignored:

PendingobjID := PendingobjID − {r}T3: Client request r ∈ PendingobjID processed:

PendingobjID := PendingobjID − {r}if r = query(objId , opts) then

reply according options opts basedon HistobjID

else if r = update(objId ,newVal , opts) thenHistobjID := HistobjID · rreply according options opts based

on HistobjID

Figure 1: Client’s View of an Object.

performed. The client might, for example, first issuea query to determine whether the current value ofthe object already reflects the update.

A client request that is lost before reaching thestorage service is indistinguishable to that clientfrom one that is ignored by the storage service. Thismeans that clients would not be exposed to a newfailure mode when a storage server exhibits transientoutages during which client requests are ignored. Ofcourse, acceptable client performance likely woulddepend on limiting the frequency and duration oftransient outages.

With chain replication, the duration of each tran-sient outage is far shorter than the time required toremove a faulty host or to add a new host. So, clientrequest processing proceeds with minimal disruptionin the face of failure, recovery, and other reconfig-uration. Most other replica-management protocolseither block some operations or sacrifice consistencyguarantees following failures and during reconfigu-rations.

We specify the functionality of our storage serviceby giving the client view of an object’s state and ofthat object’s state transitions in response to queryand update requests. Figure 1 uses pseudo-code togive such a specification for an object objID .

The figure defines the state of objID in termsof two variables: the sequence2 HistobjID of up-dates that have been performed on objID and a setPendingobjID of unprocessed requests.

OSDI ’04: 6th Symposium on Operating Systems Design and Implementation USENIX Association92

replies

TAIL

queries

HEAD

updates

Figure 2: A chain.

Then, the figure lists possible state transitions.Transition T1 asserts that an arriving client re-quest is added to PendingobjID . That some pend-ing requests are ignored is specified by transitionT2—this transition is presumably not taken too fre-quently. Transition T3 gives a high-level view ofrequest processing: the request r is first removedfrom PendingobjID ; query then causes a suitable re-ply to be produced whereas update also appends r(denoted by ·) to HistobjID .3

3 Chain Replication Protocol

Servers are assumed to be fail-stop [21]:

• each server halts in response to a failure ratherthan making erroneous state transitions, and

• a server’s halted state can be detected by theenvironment.

With an object replicated on t servers, as many ast−1 of the servers can fail without compromising theobject’s availability. The object’s availability is thusincreased to the probability that all servers hostingthat object have failed; simulations in §5.4 explorethis probability for typical storage systems. Hence-forth, we assume that at most t − 1 of the serversreplicating an object fail concurrently.

In chain replication, the servers replicating a givenobject objID are linearly ordered to form a chain.(See Figure 2.) The first server in the chain is calledthe head, the last server is called the tail, and requestprocessing is implemented by the servers roughly asfollows:

Reply Generation. The reply for every request isgenerated and sent by the tail.

Query Processing. Each query request is directedto the tail of the chain and processed thereatomically using the replica of objID stored atthe tail.

Update Processing. Each update request is di-rected to the head of the chain. The requestis processed there atomically using replica ofobjID at the head, then state changes are for-warded along a reliable FIFO link to the nextelement of the chain (where it is handled andforwarded), and so on until the request is han-dled by the tail.

Strong consistency thus follows because query re-quests and update requests are all processed seriallyat a single server (the tail).

Processing a query request involves only a singleserver, and that means query is a relatively cheapoperation. But when an update request is processed,computation done at t − 1 of the t servers does notcontribute to producing the reply and, arguably, isredundant. The redundant servers do increase thefault-tolerance, though.

Note that some redundant computation associ-ated with the t− 1 servers is avoided in chain repli-cation because the new value is computed once bythe head and then forwarded down the chain, soeach replica has only to perform a write. This for-warding of state changes also means update can be anon-deterministic operation—the non-deterministicchoice is made once, by the head.

3.1 Protocol Details

Clients do not directly read or write variablesHistobjID and PendingobjID of Figure 1, so we arefree to implement them in any way that is conve-nient. When chain replication is used to implementthe specification of Figure 1:

• HistobjID is defined to be HistTobjID , the value

of HistobjID stored by tail T of the chain, and

• PendingobjID is defined to be the set of clientrequests received by any server in the chain andnot yet processed by the tail.

The chain replication protocols for query processingand update processing are then shown to satisfy thespecification of Figure 1 by demonstrating how eachstate transition made by any server in the chain isequivalent either to a no-op or to allowed transitionsT1, T2, or T3.

Given the descriptions above for how HistobjIDand PendingobjID are implemented by a chain (andassuming for the moment that failures do not occur),we observe that the only server transitions affectingHistobjID and PendingobjID are: (i) a server in thechain receiving a request from a client (which affectsPendingobjID ), and (ii) the tail processing a client


request (which affects HistobjID ). Since other servertransitions are equivalent to no-ops, it suffices toshow that transitions (i) and (ii) are consistent withT1 through T3.

Client Request Arrives at Chain. Clients sendrequests to either the head (update) or the tail(query). Receipt of a request r by either addsr to the set of requests received by a server butnot yet processed by the tail. Thus, receipt ofr by either adds r to PendingobjID (as definedabove for a chain), and this is consistent withT1.

Request Processed by Tail. Execution causesthe request to be removed from the set ofrequests received by any replica that have notyet been processed by the tail, and thereforeit deletes the request from PendingobjID (asdefined above for a chain)—the first step ofT3. Moreover, the processing of that requestby tail T uses replica HistT

objID which, asdefined above, implements HistobjID—and thisis exactly what the remaining steps of T3specify.

Coping with Server Failures

In response to detecting the failure of a server that ispart of a chain (and, by the fail-stop assumption, allsuch failures are detected), the chain is reconfiguredto eliminate the failed server. For this purpose, weemploy a service, called the master, that

• detects failures of servers,

• informs each server in the chain of its new pre-decessor or new successor in the new chain ob-tained by deleting the failed server,

• informs clients which server is the head andwhich is the tail of the chain.

In what follows, we assume the master is a singleprocess that never fails. This simplifies the expo-sition but is not a realistic assumption; our pro-totype implementation of chain replication actuallyreplicates a master process on multiple hosts, usingPaxos [16] to coordinate those replicas so they be-have in aggregate like a single process that does notfail.

The master distinguishes three cases: (i) failureof the head, (ii) failure of the tail, and (iii) failureof some other server in the chain. The handlingof each, however, depends on the following insightabout how updates are propagated in a chain.

Let the server at the head of the chain be labeledH , the next server be labeled H + 1, etc., throughthe tail, which is given label T . Define

Hist iobjID � Hist j

objID

to hold if sequence4 of requests Hist iobjID at the

server with label i is a prefix of sequence Hist jobjID

at the server with label j. Because updates are sentbetween elements of a chain over reliable FIFO links,the sequence of updates received by each server is aprefix of those received by its successor. So we have:

Update Propagation Invariant. For serverslabeled i and j such that i ≤ j holds (i.e., i isa predecessor of j in the chain) then:

Hist jobjID � Hist i

objID .

Failure of the Head. This case is handled by themaster removing H from the chain and making thesuccessor to H the new head of the chain. Such asuccessor must exist if our assumption holds that atmost t − 1 servers are faulty.

Changing the chain by deleting H is a transitionand, as such, must be shown to be either a no-op or consistent with T1, T2, and/or T3 of Fig-ure 1. This is easily done. Altering the set ofservers in the chain could change the contents ofPendingobjID—recall, PendingobjID is defined as theset of requests received by any server in the chainand not yet processed by the tail, so deleting serverH from the chain has the effect of removing fromPendingobjID those requests received by H but notyet forwarded to a successor. Removing a requestfrom PendingobjID is consistent with transition T2,so deleting H from the chain is consistent with thespecification in Figure 1.

Failure of the Tail. This case is handled by re-moving tail T from the chain and making predeces-sor T− of T the new tail of the chain. As before,such a predecessor must exist given our assumptionthat at most t − 1 server replicas are faulty.

This change to the chain alters the values ofboth PendingobjID and HistobjID , but does so ina manner consistent with repeated T3 transitions:PendingobjID decreases in size because HistT

objID �HistT−

objID (due to the Update Propagation Invariant,since T− < T holds), so changing the tail from Tto T− potentially increases the set of requests com-pleted by the tail which, by definition, decreasesthe set of requests in PendingobjID . Moreover, asrequired by T3, those update requests completed


by T− but not completed by T do now appear inHistobjID because with T− now the tail, HistobjID isdefined as HistT−

objID .

Failure of Other Servers. Failure of a server Sinternal to the chain is handled by deleting S fromthe chain. The master first informs S’s successor S+

of the new chain configuration and then informs S’spredecessor S−. This, however, could cause the Up-date Propagation Invariant to be invalidated unlesssome means is employed to ensure update requeststhat S received before failing will still be forwardedalong the chain (since those update requests alreadydo appear in Hist i

objID for any predecessor i of S).The obvious candidate to perform this forwardingis S−, but some bookkeeping and coordination arenow required.

Let U be a set of requests and let <U be a totalordering on requests in that set. Define a requestsequence r to be consistent with (U, <U ) if (i) all re-quests in r appear in U and (ii) requests are arrangedin r in ascending order according to <U . Finally, forrequest sequences r and r′ consistent with (U, <U ),define r⊕ r′ to be a sequence of all requests appear-ing in r or in r′ such that r ⊕ r′ is consistent with(U, <U ) (and therefore requests in sequence r ⊕ r′

are ordered according to <U ).The Update Propagation Invariant is preserved by

requiring that the first thing a replica S− connectingto a new successor S+ does is: send to S+ (usingthe FIFO link that connects them) those requests inHistS−

objID that might not have reached S+; only afterthose have been sent may S− process and forwardrequests that it receives subsequent to assuming itsnew chain position.

To this end, each server i maintains a list Sentiof update requests that i has forwarded to somesuccessor but that might not have been processedby the tail. The rules for adding and deleting el-ements on this list are straightforward: Wheneverserver i forwards an update request r to its succes-sor, server i also appends r to Senti. The tail sendsan acknowledgement ack(r) to its predecessor whenit completes the processing of update request r. Andupon receipt ack(r), a server i deletes r from Sentiand forwards ack(r) to its predecessor.

A request received by the tail must have been re-ceived by all of its predecessors in the chain, so wecan conclude:

Inprocess Requests Invariant. If i ≤ j then

Hist iobjID = Hist j

objID ⊕ Senti.

3

master S

1

2

4

S+S−

Figure 3: Space-time diagram for deletion of internalreplica.

Thus, the Update Propagation Invariant will bemaintained if S−, upon receiving notification fromthe master that S+ is its new successor, first for-wards the sequence of requests in SentS− to S+.Moreover, there is no need for S− to forward theprefix of SentS− that already appears in HistS+

objID .The protocol whose execution is depicted in Fig-

ure 3 embodies this approach (including the opti-mization of not sending more of the prefix than nec-essary). Message 1 informs S+ of its new role; mes-sage 2 acknowledges and informs the master whatis the sequence number sn of the last update re-quest S+ has received; message 3 informs S− of itsnew role and of sn so S− can compute the suffix ofSentS− to send to S+; and message 4 carries thatsuffix.

Extending a Chain. Failed servers are removedfrom chains. But shorter chains tolerate fewer fail-ures, and object availability ultimately could becompromised if ever there are too many server fail-ures. The solution is to add new servers when chainsget short. Provided the rate at which servers fail isnot too high and adding a new server does not taketoo long, then chain length can be kept close to thedesired t servers (so t− 1 further failures are neededto compromise object availability).

A new server could, in theory, be added anywherein a chain. In practice, adding a server T + to thevery end of a chain seems simplist. For a tail T +,the value of SentT+ is always the empty list, so ini-tializing SentT+ is trivial. All that remains is toinitialize local object replica HistT+

objID in a way thatsatisfies the Update Propagation Invariant.

The initialization of HistT+

objID can be accom-


plished by having the chain’s current tail T forwardthe object replica HistT

objID it stores to T +. Theforwarding (which may take some time if the ob-ject is large) can be concurrent with T ’s processingquery requests from clients and processing updatesfrom its predecessor, provided each update is alsoappended to SentT . Since HistT+

objID � HistTobjID

holds throughout this forwarding, Update Propaga-tion Invariant holds. Therefore, once

HistTobjID = HistT+

objID ⊕ SentT

holds, Inprocess Requests Invariant is establishedand T + can begin serving as the chain’s tail:

• T is notified that it no longer is the tail. Tis thereafter free to discard query requests itreceives from clients, but a more sensible policyis for T to forward such requests to new tail T +.

• Requests in SentT are sent (in sequence) to T +.

• The master is notified that T + is the new tail.

• Clients are notified that query requests shouldbe directed to T +.

4 Primary/Backup Protocols

Chain replication is a form of primary/backup ap-proach [3], which itself is an instance of the state ma-chine approach [22] to replica management. In theprimary/backup approach, one server, designatedthe primary

• imposes a sequencing on client requests (andthereby ensures strong consistency holds),

• distributes (in sequence) to other servers,known as backups, the client requests or result-ing updates,

• awaits acknowledgements from all non-faultybackups, and

• after receiving those acknowledgements thensends a reply to the client.

If the primary fails, one of the back-ups is promotedinto that role.

With chain replication, the primary’s role in se-quencing requests is shared by two replicas. Thehead sequences update requests; the tail extendsthat sequence by interleaving query requests. Thissharing of responsibility not only partitions the se-quencing task but also enables lower-latency and

lower-overhead processing for query requests, be-cause only a single server (the tail) is involved inprocessing a query and that processing is never de-layed by activity elsewhere in the chain. Comparethat to the primary backup approach, where the pri-mary, before responding to a query, must await ac-knowledgements from backups for prior updates.

In both chain replication and in the pri-mary/backup approach, update requests must bedisseminated to all servers replicating an object orelse the replicas will diverge. Chain replication doesthis dissemination serially, resulting in higher la-tency than the primary/backup approach where re-quests were distributed to backups in parallel. Withparallel dissemination, the time needed to generatea reply is proportional to the maximum latency ofany non-faulty backup; with serial dissemination, itis proportional to the sum of those latencies.

Simulations reported in §5 quantify all of theseperformance differences, including variants of chainreplication and the primary/backup approach inwhich query requests are sent to any server (with ex-pectations of trading increased performance for thestrong consistency guarantee).

Simulations are not necessary for understandingthe differences in how server failures are handled bythe two approaches, though. The central concernhere is the duration of any transient outage expe-rienced by clients when the service reconfigures inresponse to a server failure; a second concern is theadded latency that server failures introduce.

The delay to detect a server failure is by far thedominant cost, and this cost is identical for bothchain replication and the primary/backup approach.What follows, then, is an analysis of the recoverycosts for each approach assuming that a server fail-ure has been detected; message delays are presumedto be the dominant source of protocol latency.

For chain replication, there are three cases to con-sider: failure of the head, failure of a middle server,and failure of the tail.

• Head Failure. Query processing continues un-interrupted. Update processing is unavailablefor 2 message delivery delays while the masterbroadcasts a message to the new head and itssuccessor, and then it notifies all clients of thenew head using a broadcast.

• Middle Server Failure. Query processingcontinues uninterrupted. Update processingcan be delayed but update requests are notlost, hence no transient outage is experienced,provided some server in a prefix of the chainthat has received the request remains operating.


Failure of a middle server can lead to a delay inprocessing an update request—the protocol ofFigure 3 involves 4 message delivery delays.

• Tail Failure. Query and update processing areboth unavailable for 2 message delivery delayswhile the master sends a message to the newtail and then notifies all clients of the new tailusing a broadcast.

With the primary/backup approach, there are twocases to consider: failure of the primary and failureof a backup. Query and update requests are affectedthe same way for each.

• Primary Failure. A transient outage of 5message delays is experienced, as follows. Themaster detects the failure and broadcasts a mes-sage to all backups, requesting the number ofupdates each has processed and telling themto suspend processing requests. Each backupreplies to the master. The master then broad-casts the identity of the new primary to allbackups. The new primary is the one havingprocessed the largest number of updates, andit must then forward to the backups any up-dates that they are missing. Finally, the masterbroadcasts a message notifying all clients of thenew primary.

• Backup Failure. Query processing continuesuninterrupted provided no update requests arein progress. If an update request is in progressthen a transient outage of at most 1 message de-lay is experienced while the master sends a mes-sage to the primary indicating that acknowl-edgements will not be forthcoming from thefaulty backup and requests should not subse-quently be sent there.

So the worst case outage for chain replication(tail failure) is never as long as the worst case out-age for primary/backup (primary failure); and thebest case for chain replication (middle server fail-ure) is shorter than the best case outage for pri-mary/backup (backup failure). Still, if duration oftransient outage is the dominant consideration indesigning a storage service then choosing betweenchain replication and the primary/backup approachrequires information about the mix of request typesand about the chances of various servers failing.

5 Simulation Experiments

To better understand throughput and availabilityfor chain replication, we performed a series of ex-

periments in a simulated network. These involveprototype implementations of chain replication aswell as some of the alternatives. Because we aremostly interested in delays intrinsic to the processingand communications that chain replication entails,we simulated a network with infinite bandwidth butwith latencies of 1 ms per message.

5.1 Single Chain, No Failures

First, we consider the simple case when there is onlyone chain, no failures, and replication factor t is 2,3, and 10. We compare throughput for four differentreplication management alternatives:

• chain: Chain replication.

• p/b: Primary/backup.

• weak-chain: Chain replication modified soquery requests go to any random server.

• weak-p/b: Primary/backup modified so queryrequests go to any random server.

Note, weak-chain and weak-p/b do not imple-ment the strong consistency guarantees that chainand p/b do.

We fix the query latency at a server to be 5 ms andfix the update latency to be 50 ms. (These numbersare based on actual values for querying or updatinga web search index.) We assume each update en-tails some initial processing involving a disk read,and that it is cheaper to forward object-differencesfor storage than to repeat the update processinganew at each replica; we expect that the latencyfor a replica to process an object-difference messagewould be 20 ms (corresponding to a couple of diskaccesses and a modest computation).

So, for example, if a chain comprises three servers,the total latency to perform an update is 94 ms: 1ms for the message from the client to the head, 50ms for an update latency at the head, 20 ms to pro-cess the object difference message at each of the twoother servers, and three additional 1 ms forwardinglatencies. Query latency is only 7 ms, however.

In Figure 4 we graph total throughput as a func-tion of the percentage of requests that are updatesfor t = 2, t = 3 and t = 10. There are 25 clients,each doing a mix of requests split between queriesand updates consistent with the given percentage.Each client submits one request at a time, delayingbetween requests only long enough to receive theresponse for the previous request. So the clientstogether can have as many as 25 concurrent re-quests outstanding. Throughput for weak-chain


0

100

200

300

400

500

600

0 5 10 15 20 25 30 35 40 45 50

tota

l thr

ough

put

percentage updates

weakchain

p/b

0

100

200

300

400

500

600

0 5 10 15 20 25 30 35 40 45 50

tota

l thr

ough

put

percentage updates

weakchain

p/b

0

100

200

300

400

500

600

0 5 10 15 20 25 30 35 40 45 50

tota

l thr

ough

put

percentage updates

weakchain

p/b

(a) t = 2 (b) t = 3 (c) t = 10

Figure 4: Request throughput as a function of the percentage of updates for various replication managementalternatives chain, p/b, and weak (denoting weak-chain, and weak-p/b) and for replication factors t.

and weak-p/b was found to be virtually identical,so Figure 4 has only a single curve—labeled weak—rather than separate curves for weak-chain andweak-p/b.

Observe that chain replication (chain) has equalor superior performance to primary-backup (p/b)for all percentages of updates and each replica-tion factor investigated. This is consistent withour expectations, because the head and the tail inchain replication share a load that, with the pri-mary/backup approach, is handled solely by the pri-mary.

The curves for the weak variant of chain replica-tion are perhaps surprising, as these weak variantsare seen to perform worse than chain replication(with its strong consistency) when there are morethan 15% update requests. Two factors are involved:

• The weak variants of chain replication and pri-mary/backup outperform pure chain replica-tion for query-heavy loads by distributing thequery load over all servers, an advantage thatincreases with replication factor.

• Once the percentage of update requests in-creases, ordinary chain replication outperformsits weak variant—since all updates are done atthe head. In particular, under pure chain repli-cation (i) queries are not delayed at the headawaiting completion of update requests (whichare relatively time consuming) and (ii) there ismore capacity available at the head for updaterequest processing if query requests are not alsobeing handled there.

Since weak-chain and weak-p/b do not implementstrong consistency guarantees, there would seem to

be surprisingly few settings where these replicationmanagement schemes would be preferred.

Finally, note that the throughput of both chainreplication and primary backup is not affected byreplication factor provided there are sufficient con-current requests so that multiple requests can bepipelined.

5.2 Multiple Chains, No Failures

If each object is managed by a separate chain andobjects are large, then adding a new replica couldinvolve considerable delay because of the time re-quired for transferring an object’s state to that newreplica. If, on the other hand, objects are small,then a large storage service will involve many ob-jects. Each processor in the system is now likely tohost servers from multiple chains—the costs of mul-tiplexing the processors and communications chan-nels may become prohibitive. Moreover, the failureof a single processor now affects multiple chains.

A set of objects can always be grouped into a sin-gle volume, itself something that could be consideredan object for purposes of chain replication, so a de-signer has considerable latitude in deciding objectsize.

For the next set of experiments, we assume

• a constant number of volumes,

• a hash function maps each object to a volume,hence to a unique chain, and

• each chain comprises servers hosted by proces-sors selected from among those implementingthe storage service.


0

20

40

60

80

100

120

0 20 40 60 80 100 120 140

aver

age

thro

ughp

ut

#servers

queries only5%

10%25%50%

updates only

Figure 5: Average request throughput per client asa function of the number of servers for various per-centages of updates.

Clients are assumed to send their requests to adispatcher which (i) computes the hash to deter-mine the volume, hence chain, storing the objectof concern and then (ii) forwards that request to thecorresponding chain. (The master sends configura-tion information for each volume to the dispatcher,avoiding the need for the master to communicate di-rectly with clients. Interposing a dispatcher adds a1ms delay to updates and queries, but doesn’t af-fect throughput.) The reply produced by the chainis sent directly to the client and not by way of thedispatcher.

There are 25 clients in our experiments, each sub-mitting queries and updates at random, uniformlydistributed over the chains. The clients send re-quests as fast as they can, subject to the restrictionthat each client can have only one request outstand-ing at a time.

To facilitate comparisons with the GFS experi-ments [11], we assume 5000 volumes each replicatedthree times, and we vary the number of servers.We found little or no difference among chain, p/b,weak chain, and weak p/b alternatives, so Fig-ure 5 shows the average request throughput perclient for one—chain replication—as a function ofthe number of servers, for varying percentages ofupdate requests.

5.3 Effects of Failures on Throughput

With chain replication, each server failure causes athree-stage process to start:

1. Some time (we conservatively assume 10 sec-onds in our experiments) elapses before themaster detects the server failure.

2. The offending server is then deleted from thechain.

3. The master ultimately adds a new server tothat chain and initiates a data recovery process,which takes time proportional to (i) how muchdata was being stored on the faulty server and(ii) the available network bandwidth.

Delays in detecting a failure or in deleting a faultyserver from a chain can increase request processinglatency and can increase transient outage duration.The experiments in this section explore this.

We assume a storage service characterized by theparameters in Table 1; these values are inspired bywhat is reported for GFS [11]. The assumptionabout network bandwidth is based on reserving fordata recovery at most half the bandwidth in a 100Mbit/second network; the time to copy the 150 Gi-gabytes stored on one server is now 6 hours and 40minutes.

In order to measure the effects of a failures on thestorage service, we apply a load. The exact detailsof the load do not matter greatly. Our experimentsuse eleven clients. Each client repeatedly chooses arandom object, performs an operation, and awaitsa reply; a watchdog timer causes the client to startthe next loop iteration if 3 seconds elapse and noreply has been received. Ten of the clients exclu-sively submit query operations; the eleventh clientexclusively submits update operations.

parameter valuenumber of servers (N) 24number of volumes 5000chain length (t) 3data stored per server 150 Gigabytesmaximum network band-width devoted to datarecovery to/from anyserver

6.25 Megabytes/sec

server reboot time after afailure

10 minutes

Table 1: Simulated Storage Service Characteristics.


90

95

100

105

110

00:30 01:00 01:30 02:00

quer

y th

rupu

t

time

90

95

100

105

110

00:30 01:00 01:30 02:00

quer

y th

rupu

t

time

9

10

11

00:30 01:00 01:30 02:00

upd.

thru

put

time

9

10

11

00:30 01:00 01:30 02:00

upd.

thru

put

time

(a) one failure (b) two failures

Figure 6: Query and update throughput with one or two failures at time 00:30.

Each experiment described executes for 2 simu-lated hours. Thirty minutes into the experiment,the failure of one or two servers is simulated (asin the GFS experiments). The master detects thatfailure and deletes the failed server from all of thechains involving that server. For each chain that wasshortened by the failure, the master then selects anew server to add. Data recovery to those servers isstarted.

Figure 6(a) shows aggregate query and updatethroughputs as a function of time in the case a singleserver F fails. Note the sudden drop in throughputwhen the simulated failure occurs 30 minutes intothe experiment. The resolution of the x-axis is toocoarse to see that the throughput is actually zero forabout 10 seconds after the failure, since the masterrequires a bit more than 10 seconds to detect theserver failure and then delete the failed server fromall chains.

With the failed server deleted from all chains, pro-cessing now can proceed, albeit at a somewhat lowerrate because fewer servers are operational (and thesame request processing load must be shared amongthem) and because data recovery is consuming re-sources at various servers. Lower curves on thegraph reflect this. After 10 minutes, failed serverF becomes operational again, and it becomes a pos-sible target for data recovery. Every time data re-covery of some volume successfully completes at F ,query throughput improves (as seen on the graph).

This is because F , now the tail for another chain, ishandling a growing proportion of the query load.

One might expect that after all data recovery con-cludes, the query throughput would be what it wasat the start of the experiment. The reality is moresubtle, because volumes are no longer uniformly dis-tributed among the servers. In particular, serverF will now participate in fewer chains than otherservers but will be the tail of every chain in which itdoes participate. So the load is no longer well bal-anced over the servers, and aggregate query through-put is lower.

Update throughput decreases to 0 at the time ofthe server failure and then, once the master deletesthe failed server from all chains, throughput is actu-ally better than it was initially. This throughput im-provement occurs because the server failure causessome chains to be length 2 (rather than 3), reduc-ing the amount of work involved in performing anupdate.

The GFS experiments [11] consider the case wheretwo servers fail, too, so Figure 6(b) depicts thisfor our chain replication protocol. Recovery is stillsmooth, although it takes additional time.

5.4 Large Scale Replication of Criti-cal Data

As the number of servers increases, so should theaggregate rate of server failures. If too many serversfail, then a volume might become unavailable. The


0.0001

0.001

0.01

0.1

1

10

100

1000

10000

1 10 100

MT

BU

(da

ys)

# servers

t = 4t = 3t = 2t = 1

0.0001

0.001

0.01

0.1

1

10

100

1000

10000

1 10 100

MT

BU

(da

ys)

# servers

0.0001

0.001

0.01

0.1

1

10

100

1000

10000

1 10 100

# servers

t = 4t = 3t = 2t = 1

0.0001

0.001

0.01

0.1

1

10

100

1000

10000

1 10 100

# servers

0.0001

0.001

0.01

0.1

1

10

100

1000

10000

1 10 100

# servers

t = 4t = 3t = 2t = 1

0.0001

0.001

0.01

0.1

1

10

100

1000

10000

1 10 100

# servers

(a) ring (b) rndseq (c) rndpar

Figure 7: The MTBU and 99% confidence intervals as a function of the number of servers and replicationfactor for three different placement strategies: (a) DHT-based placement with maximum possible parallelrecovery; (b) random placement, but with parallel recovery limited to the same degree as is possible withDHTs; (c) random placement with maximum possible parallel recovery.

probability of this depends on how volumes areplaced on servers and, in particular, the extent towhich parallelism is possible during data recovery.

We have investigated three volume placementstrategies:

• ring: Replicas of a volume are placed at con-secutive servers on a ring, determined by a con-sistent hash of the volume identifier. This isthe strategy used in CFS [7] and PAST [19].The number of parallel data recoveries possibleis limited by the chain length t.

• rndpar: Replicas of a volume are placed ran-domly on servers. This is essentially the strat-egy used in GFS.5 Notice that, given enoughservers, there is no limit on the number of par-allel data recoveries possible.

• rndseq: Replicas of a volume are placed ran-domly on servers (as in rndpar), but the max-imum number of parallel data recoveries is lim-ited by t (as in ring). This strategy is not usedin any system known to us but is a useful bench-mark for quantifying the impacts of placementand parallel recovery.

To understand the advantages of parallel data re-covery, consider a server F that fails and was partic-ipating in chains C1, C2, . . . , Cn. For each chain Ci,data recovery requires a source from which the vol-ume data is fetched and a host that will become thenew element of chain Ci. Given enough processorsand no constraints on the placement of volumes, it

is easy to ensure that the new elements are all dis-joint. And with random placement of volumes, it islikely that the sources will be disjoint as well. Withdisjoint sources and new elements, data recovery forchains C1, C2, . . . , Cn can occur in parallel. And ashorter interval for data recovery of C1, C2, . . . , Cn,implies that there is a shorter window of vulnera-bility during which a small number of concurrentfailures would render some volume unavailable.

We seek to quantify the mean time between un-availability (MTBU) of any object as a function ofthe number of servers and the placement strategy.Each server is assumed to exhibit exponentially dis-tributed failures with a MTBF (Mean Time BetweenFailures) of 24 hours.6 As the number of servers ina storage system increases, so would the number ofvolumes (otherwise, why add servers). In our exper-iments, the number of volumes is defined to be 100times the initial number of servers, with each serverstoring 100 volumes at time 0.

We postulate that the time it takes to copy allthe data from one server to another is four hours,which corresponds to copying 100 Gigabytes acrossa 100 Mbit/sec network restricted so that only halfbandwidth can be used for data recovery. As in theGFS experiments, the maximum number of paralleldata recoveries on the network is limited to 40% ofthe servers, and the minimum transfer time is set to10 seconds (the time it takes to copy an individualGFS object, which is 64 KBytes).

Figure 7(a) shows that the MTBU for the ringstrategy appears to have an approximately Zipfiandistribution as a function of the number of servers.


Thus, in order to maintain a particular MTBU, itis necessary to grow chain length t when increasingthe number of servers. From the graph, it seemsas though chain length needs to be increased as thelogarithm of the number of servers.

Figure 7(b) shows the MTBU for rndseq. For t >1, rndseq has lower MTBU than ring. Comparedto ring, random placement is inferior because withrandom placement there are more sets of t serversthat together store a copy of a chain, and thereforethere is a higher probability of a chain getting lostdue to failures.

However, random placement makes additional op-portunities for parallel recovery possible if there areenough servers. Figure 7(c) shows the MTBU forrndpar. For few servers, rndpar performs the sameas rndseq, but the increasing opportunity for paral-lel recovery with the number of servers improves theMTBU, and eventually rndpar outperforms rnd-seq, and more importantly, it outperforms ring.

6 Related Work

Scalability. Chain replication is an example ofwhat Jimenez-Peris and Patino-Martınez [14] call aROWAA (read one, write all available) approach.They report that ROWAA approaches provide su-perior scaling of availability to quorum techniques,claiming that availability of ROWAA approachesimproves exponentially with the number of repli-cas. They also argue that non-ROWAA approachesto replication will necessarily be inferior. BecauseROWAA approaches also exhibit better throughoutthan the best known quorum systems (except fornearly write-only applications) [14], ROWAA wouldseem to be the better choice for replication in mostreal settings.

Many file services trade consistency for perfor-mance and scalability. Examples include Bayou [17],Ficus [13], Coda [15], and Sprite [5]. Typically,these systems allow continued operation when a net-work partitions by offering tools to fix inconsisten-cies semi-automatically. Our chain replication doesnot offer graceful handling of partitioned operation,trading that instead for supporting all three of: highperformance, scalability, and strong consistency.

Large-scale peer-to-peer reliable file systems are arelatively recent avenue of inquiry. OceanStore [6],FARSITE [2], and PAST [19] are examples. Ofthese, only OceanStore provides strong (in fact,transactional) consistency guarantees.

Google’s File System (GFS) [11] is a large-scalecluster-based reliable file system intended for ap-

plications similar to those motivating the inventionof chain replication. But in GFS, concurrent over-writes are not serialized and read operations are notsynchronized with write operations. Consequently,different replicas can be left in different states, andcontent returned by read operations may appear tovanish spontaneously from GFS. Such weak seman-tics imposes a burden on programmers of applica-tions that use GFS.

Availability versus Consistency. Yu and Vah-dat [25] explore the trade-off between consistencyand availability. They argue that even in relaxedconsistency models, it is important to stay as closeto strong consistency as possible if availability is tobe maintained in the long run. On the other hand,Gray et al. [12] argue that systems with strong con-sistency have unstable behavior when scaled-up, andthey propose the tentative update transaction for cir-cumventing these scalability problems.

Amza et al. [4] present a one-copy serializabletransaction protocol that is optimized for replica-tion. As in chain replication, updates are sent toall replicas whereas queries are processed only byreplicas known to store all completed updates. (Inchain replication, the tail is the one replica knownto store all completed updates.) The protocol of [4]performs as well as replication protocols that provideweak consistency, and it scales well in the numberof replicas. No analysis is given for behavior in theface of failures.

Replica Placement. Previous work on replicaplacement has focussed on achieving high through-put and/or low latency rather than on supportinghigh availability. Acharya and Zdonik [1] advocatelocating replicas according to predictions of futureaccesses (basing those predictions on past accesses).In the Mariposa project [23], a set of rules allowsusers to specify where to create replicas, whetherto move data to the query or the query to the data,where to cache data, and more. Consistency is trans-actional, but no consideration is given to availabil-ity. Wolfson et al. consider strategies to optimizedatabase replica placement in order to optimize per-formance [24]. The OceanStore project also con-siders replica placement [10, 6] but from the CDN(Content Distribution Network, such as Akamai)perspective of creating as few replicas as possiblewhile supporting certain quality of service guaran-tees. There is a significant body of work (e.g., [18])concerned with placement of web page replicas aswell, all from the perspective of reducing latencyand network load.


Douceur and Wattenhofer investigate how to max-imize the worst-case availability of files in FAR-SITE [2], while spreading the storage load evenlyacross all servers [8, 9]. Servers are assumed to havevarying availabilities. The algorithms they considerrepeatedly swap files between machines if doing soimproves file availability. The results are of a theo-retical nature for simple scenarios; it is unclear howwell these algorithms will work in a realistic storagesystem.

7 Concluding Remarks

Chain replication supports high throughput forquery and update requests, high availability of dataobjects, and strong consistency guarantees. This ispossible, in part, because storage services built us-ing chain replication can and do exhibit transientoutages but clients cannot distinguish such outagesfrom lost messages. Thus, the transient outages thatchain replication introduces do not expose clients tonew failure modes—chain replication represents aninteresting balance between what failures it hidesfrom clients and what failures it doesn’t.

When chain replication is employed, high avail-ability of data objects comes from carefully se-lecting a strategy for placement of volume repli-cas on servers. Our experiments demonstrated thatwith DHT-based placement strategies, availabilityis unlikely to scale with increases in the numbersof servers; but we also demonstrated that randomplacement of volumes does permit availability toscale with the number of servers if this placementstrategy is used in concert with parallel data recov-ery, as introduced for GFS.

Our current prototype is intended primarily foruse in relatively homogeneous LAN clusters. Wereour prototype to be deployed in a heterogeneouswide-area setting, then uniform random placementof volume replicas would no longer make sense. In-stead, replica placement would have to depend onaccess patterns, network proximity, and observedhost reliability. Protocols to re-order the elementsof a chain would likely become crucial in order tocontrol load imbalances.

Our prototype chain replication implementationconsists of 1500 lines of Java code, plus another 2300lines of Java code for a Paxos library. The chainreplication protocols are structured as a library thatmakes upcalls to a storage service (or other appli-cation). The experiments in this paper assumed a“null service” on a simulated network. But the li-brary also runs over the Java socket library, so it

could be used to support a variety of storage service-like applications.

Acknowledgements. Thanks to our colleaguesHakon Brugard, Kjetil Jacobsen, and Knut Omang atFAST who first brought this problem to our attention.Discussion with Mark Linderman and Sarah Chung werehelpful in revising an earlier version of this paper. Weare also grateful for the comments of the OSDI reviewersand shepherd Margo Seltzer. A grant from the ResearchCouncil of Norway to FAST ASA is noted and acknowl-edged.

Van Renesse and Schneider are supported, in part, byAFOSR grant F49620–03–1–0156 and DARPA/AFRL-IFGA grant F30602–99–1–0532, although the views andconclusions contained herein are those of the authors andshould not be interpreted as necessarily representing theofficial policies or endorsements, either expressed or im-plied, of these organizations or the U.S. Government.

Notes

1The case where V = newVal yields a semanticsfor update that is simply a file system write opera-tion; the case where V = F (newVal , objID) amountsto support for atomic read-modify-write operationson objects. Though powerful, this semantics fallsshort of supporting transactions, which would allowa request to query and/or update multiple objectsindivisibly.

2An actual implementation would probably storethe current value of the object rather than storingthe sequence of updates that produces this currentvalue. We employ a sequence of updates represen-tation here because it simplifies the task of arguingthat strong consistency guarantees hold.

3If HistobjID stores the current value of objIDrather than its entire history then “HistobjID · r”should be interpreted to denote applying the updateto the object.

4If Hist iobjID is the current state rather than a

sequence of updates, then � is defined to be the“prior value” relation rather than the “prefix of”relation.

5Actually, the placement strategy is not dis-cussed in [11]. GFS does some load balancing thatresults in an approximately even load across theservers, and in our simulations we expect that ran-dom placement is a good approximation of this strat-egy.


6An unrealistically short MTBF was selectedhere to facilitate running long-duration simulations.

References

[1] S. Acharya and S.B. Zdonik. An efficient schemefor dynamic data replication. Technical Report CS-93-43, Brown University, September 1993.

[2] A. Adya, W.J. Bolosky, M. Castro, G. Cermak,R. Chaiken, J.R. Douceur, J. Howell, J.R. Lorch,M. Theimer, and R.P. Wattenhofer. FARSITE: Fed-erated, Available, and Reliable Storage for an In-completely Trusted Environment. In Proc. of the5th Symp. on Operating Systems Design and Imple-mentation, Boston, MA, December 2002. USENIX.

[3] P.A. Alsberg and J.D. Day. A principle for resilientsharing of distributed resources. In Proc. of the 2ndInt. Conf. on Software Engineering, pages 627–644,October 1976.

[4] C. Amza, A.L. Cox, and W. Zwaenepoel. Dis-tributed Versioning: Consistent replication for scal-ing back-end databases of dynamic content websites. In Proc. of Middleware’03, pages 282–304,Rio de Janeiro, Brazil, June 2003.

[5] M.G. Baker and J.K. Ousterhout. Availability inthe Sprite distributed file system. Operating Sys-tems Review, 25(2):95–98, April 1991. Also ap-peared in the 4th ACM SIGOPS European Work-shop – Fault Tolerance Support in Distributed Sys-tems.

[6] Y. Chen, R.H. Katz, and J. Kubiatowicz. Dynamicreplica placement for scalable content delivery. InProc. of the 1st Int. Workshop on Peer-To-PeerSystems, Cambridge, MA, March 2002.

[7] F. Dabek, M.F. Kaashoek, D. Karger, R. Morris,and I. Stoica. Wide-area cooperative storage withCFS. In Proc. of the 18th ACM Symp. on OperatingSystems Principles, Banff, Canada, October 2001.

[8] J.R. Douceur and R.P. Wattenhofer. Competitivehill-climbing strategies for replica placement in adistributed file system. In Proc. of the 15th In-ternational Symposium on DIStributed Computing,Lisbon, Portugal, October 2001.

[9] J.R. Douceur and R.P. Wattenhofer. Optimizingfile availability in a secure serverless distributed filesystem. In Proc. of the 20th Symp. on Reliable Dis-tributed Systems. IEEE, 2001.

[10] D. Geels and J. Kubiatowicz. Replica manage-ment should be a game. In Proc. of the 10th Eu-ropean SIGOPS Workshop, Saint-Emilion, France,September 2002. ACM.

[11] S. Ghermawat, H. Gobioff, and S.-T. Leung. TheGoogle file system. In Proc. of the 19th ACM Symp.on Operating Systems Principles, Bolton Landing,NY, October 2003.

[12] J. Gray, P. Helland, P. O’Neil, and D. Shasha. Thedangers of replication and a solution. In Proc. of theInternational Conference on Management of Data(SIGMOD), pages 173–182. ACM, June 1996.

[13] J.S. Heidemann and G.J. Popek. File system devel-opment with stackable layers. ACM Transactionson Computer Systems, 12(1):58–89, February 1994.

[14] R. Jimenez-Peris and M. Patino-Martınez. Are quo-rums an alternative for data replication? ACMTransactions on Database Systems, 28(3):257–294,September 2003.

[15] J. Kistler and M. Satyanarayanann. Disconnectedoperation in the Coda file system (preliminary ver-sion). ACM Transactions on Computer Systems,10(1):3–25, February 1992.

[16] L. Lamport. The part-time parliament. ACMTransactions on Computer Systems, 16(2):133–169,1998.

[17] K. Petersen, M.J. Spreitzer, D.B. Terry, M.M.Theimer, and A.J. Demers. Flexible update propa-gation for weakly consistent replication. In Proc. ofthe 16th ACM Symp. on Operating Systems Prin-ciples, pages 288–301, Saint-Malo, France, October1997.

[18] L. Qiu, V.N. Padmanabhan, and G.M. Voelker. Onthe placement of web server replicas. In Proc. ofthe 20th INFOCOM, Anchorage, AK, March 2001.IEEE.

[19] A. Rowstron and P. Druschel. Storage manage-ment and caching in PAST, a large scale, persis-tent peer-to-peer storage utility. In Proc. of the18th ACM Symp. on Operating Systems Principles,Banff, Canada, October 2001.

[20] J. Saltzer, D. Reed, and D. Clark. End-to-end ar-guments in system design. ACM Transactions onComputer Systems, 2(4):277–288, November 1984.

[21] F.B. Schneider. Byzantine generals in action: Im-plementing fail-stop processors. ACM Transactionson Computer Systems, 2(2):145–154, May 1984.

[22] F.B. Schneider. Implementing fault-tolerant ser-vices using the state machine approach: A tutorial.ACM Computing Surveys, 22(4):299–319, Decem-ber 1990.

[23] M. Stonebraker, P.M. Aoki, R. Devine, W. Litwin,and M. Olson. Mariposa: A new architecture fordistributed data. In Proc. of the 10th Int. Conf. onData Engineering, Houston, TX, 1994.

[24] O. Wolfson, S. Jajodia, and Y. Huang. An adaptivedata replication algorithm. ACM Transactions onComputer Systems, 22(2):255–314, June 1997.

[25] H. Yu and A. Vahdat. The cost and limits ofavailability for replicated services. In Proc. of the18th ACM Symp. on Operating Systems Principles,Banff, Canada, October 2001.


Chain replication for supporting high throughput and availability

Documents

Transcript of Chain replication for supporting high throughput and availability