Post on 30-Oct-2014
Distributed SystemsConsistency & Replication
Alberto Montresor
University of Trento, Italy
2011/11/02
Redundancy is our main avenue of survivalRobert Silverberg, “Shadrach in the furnace”
../references
Alberto Montresor (UniTN) DS - Replication 2011/11/02 1 / 82
Contents1 Introduction to replicated systems2 Consistency models
IntroductionStrict consistencyLinearizabilitySequential consistencyEventual ConsistencyClient-centric consistency
3 Replication architecturesOverviewPrimary-BackupQuorum protocolsState machinesClient-centric consistency
4 CAP Theorem5 Bibliography
Alberto Montresor (UniTN) DS - Replication 2011/11/02 2 / 82
Introduction to replicated systems
Introduction
Definition (Availability)
The probability that a system will provide its required service, or theratio of the total time a system is capable of being used during a giveninterval to the length of the interval:
A =E[uptime]
E[uptime + downtime]
Example
One single server
On average, crashes once per week(mtbf: 10.080′)
Two minutes to reboot (mtbr: 2′)
A =10080
10080 + 2= 0.9998
Alberto Montresor (UniTN) DS - Replication 2011/11/02 3 / 82
Introduction to replicated systems
Introduction
Definition (Availability)
The probability that a system will provide its required service, or theratio of the total time a system is capable of being used during a giveninterval to the length of the interval:
A =E[uptime]
E[uptime + downtime]
Example
Ten servers
mtbf, mtbr as before
All needed at the same time toperform the service
pf =2
10082A = (1− pf )10 = 0.998
Alberto Montresor (UniTN) DS - Replication 2011/11/02 4 / 82
Introduction to replicated systems
Introduction
Definition (Availability)
The probability that a system will provide its required service, or theratio of the total time a system is capable of being used during a giveninterval to the length of the interval:
A =E[uptime]
E[uptime + downtime]
Example
Ten servers
mtbf, mtbr as before
One replica needed to perform theservice
pf =2
10082A = 1− (pf )10 = 1− 10−38
Alberto Montresor (UniTN) DS - Replication 2011/11/02 5 / 82
Introduction to replicated systems
Replication
How to increase availability:
Avoid single point of failures
Use replication (time/space)
Replication in space:
Run parallel copies
Vote on replica output
High-availability, high-cost
Replication in time:
When a replica fails, restart it (or replace it)
Lower maintenance, lower availability
Alberto Montresor (UniTN) DS - Replication 2011/11/02 6 / 82
Introduction to replicated systems
Replication
Replication advantages:
Replicating a service increases its availability
Performance benefits:I Geographical co-locationI Load-balancingI No bottlenecks
Replication drawbacks:
Trade-off between availability and consistency
Transparent replication is difficult
Alberto Montresor (UniTN) DS - Replication 2011/11/02 7 / 82
Introduction to replicated systems
Consistency problem
The consistency problem:
Whenever a copy is modified, that copy becomes different from therest
Modifications have to be carried out on all copies to ensureconsistency
Conflicting operations - from the world of transactions:
Read–write conflict: concurrent read operation and write operation
Write–write conflict: two concurrent write operations
Alberto Montresor (UniTN) DS - Replication 2011/11/02 8 / 82
Introduction to replicated systems
Consistency problem
The goal
We generally need to ensure that all conflicting operations are done inthe same order everywhere
The problem
Guaranteeing global ordering on conflicting operations may be a costlyoperation, downgrading scalability
The solution
Weaken consistency requirements so that hopefully globalsynchronization can be avoided
Alberto Montresor (UniTN) DS - Replication 2011/11/02 9 / 82
Introduction to replicated systems
Consistency example
Example (Flight reservation database)
At 9.36, all seats of flight 48 are booked
At 9.37, Jane cancel its reservation on flight 48
At 9.38, Michael tries to reserve a seat on flight 48I the answer is fully booked
At 9.39, George tries to reserve a set on flight 48I the seat is granted
What do you think?
Alberto Montresor (UniTN) DS - Replication 2011/11/02 10 / 82
Consistency models Introduction
Consistency models
Definition (Consistency model)
A contract between a distributed data store and a set of processes,which specifies what the results of read/write operations are in thepresence of concurrency
Definition (Distributed data store)
A distributed collection of storageentities accessible to clients
Distributed database, file system
Shared memory in a parallel system
Alberto Montresor (UniTN) DS - Replication 2011/11/02 11 / 82
Consistency models Introduction
Consistency models
Werner Vogels
Whether or not inconsistenciesare acceptable depends on theclient application. In all casesthe developer must be aware thatconsistency guarantees areprovided by the storage systemsand must be taken into accountwhen developing applications.
W. Vogels. Eventual consistent. Comm. of the
ACM, 52(1):40–44, 2009Amazon’s vice-presidentand Chief Scientific Officer
Alberto Montresor (UniTN) DS - Replication 2011/11/02 12 / 82
Consistency models Introduction
Consistency models
Strong consistency models
Strict consistency
Linearizability
Sequential consistency
Weak consistency models
Eventual consistency
Causal consistency
Client-centric consistency models
Read-after-read (monotonic read)
Write-after-write (monotonic write)
Read-after-write (read your writes)
Write-after-read (write follows read)
Alberto Montresor (UniTN) DS - Replication 2011/11/02 13 / 82
Consistency models Introduction
Notation
Write operation: wi(s, a)Process i has written a on variable s
Read operation: ri(s)→ aProcess i has read a from variable s
p1
p2
p3
w1(s, 100) w
1(s, 99)
r2(s) → 100
r2(s) → 99 w
3(s, 100)
Alberto Montresor (UniTN) DS - Replication 2011/11/02 14 / 82
Consistency models Strict consistency
Strict consistency
Definition (Strict consistency)
A read operation must return the result of the latest write operationwhich occurred on the data item
Implementation:
Only possible with a global, perfectly synchronized clock
Only possible if all writes instantaneously visible to all
It makes sense, though:
it is the model of uniprocessor systems!
Alberto Montresor (UniTN) DS - Replication 2011/11/02 15 / 82
Consistency models Linearizability
Linearizability
Definition (Linearizability, Herlihy and Wing, 1991)
1 The result of any execution is the same as if the operations by allprocesses on the data store were executed in some sequential order
2 The operation of each process appear in this sequence in the orderspecified by its program
3 If t1 < t2 are the times at which two distinct processes performoperations o1 and o2, then o1 must appear before o2 in thesequence
Alberto Montresor (UniTN) DS - Replication 2011/11/02 16 / 82
Consistency models Linearizability
Linearizability
Example
Is the example below linearizable? (Read: given a replicationprotocol that produces these actions, is this protocol linearizable?)
Are the following linear sequences possible linearizations?I w1(s, 100) w1(s, 99) r2(s)→ 100 r3(s)→ 99 w3(s, 100)I w1(s, 100) r2(s)→ 100 w1(s, 99) r3(s)→ 99 w3(s, 100)
p1
p2
p3
w1(s, 100) w
1(s, 99)
r2(s) → 100
r2(s) → 99 w
3(s, 100)
Alberto Montresor (UniTN) DS - Replication 2011/11/02 17 / 82
Consistency models Sequential consistency
Sequential Consistency
Definition (Sequential Consistency, Lamport, 1978)
1 The result of any execution is the same as if the operations by allprocesses on the data store were executed in some sequential order
2 The operation of each process appear in this sequence in the orderspecified by its program
3 If t1 < t2 are the times at which two distinct processes performoperations o1 and o2, then o1 must appear before o2 in thesequence
Comments:
Much more common
Alberto Montresor (UniTN) DS - Replication 2011/11/02 18 / 82
Consistency models Sequential consistency
Sequential Consistency
Example
Is the example below sequentially consistent?
Is the following sequence a sequentially consistent one?I w1(s, 100) r2(s)→ 100 w1(s, 99) r3(s)→ 99 w3(s, 100)
p1
p2
p3
w1(s, 100) w
1(s, 99)
r2(s) → 100
r2(s) → 99 w
3(s, 100)
Alberto Montresor (UniTN) DS - Replication 2011/11/02 19 / 82
Consistency models Sequential consistency
Sequential Consistency
Example
Is the example below sequentially consistent?
Is the following sequence a sequentially consistent one?I From 1,2: w1(s, 99) r2(s)→ 99 w2(s, 100)I From 3: r3(s, 100) r3(s)→ 99
p1
p2
p3
w1(s, 98) w
1(s, 99)
r2(s) → 99
r3(s) → 100 r
3(s, 99)
w2(s, 100)
Alberto Montresor (UniTN) DS - Replication 2011/11/02 20 / 82
Consistency models Sequential consistency
Sequential Consistency
Process p1x← 1print y, z
Process p2y ← 1print x, z
Process p3z ← 1print x, y
Initially, all variables have value 0
How many “potential executions” (without conditions 1,2)?6! = 720
How many “valid executions” (without condition 1)?(5!/4) · 3 = 90
How many “potential outputs” (signatures ordered by p1, p2, p3)?26 = 64
Alberto Montresor (UniTN) DS - Replication 2011/11/02 21 / 82
Consistency models Sequential consistency
Sequential Consistency
Process p1x← 1print y, z
Process p2y ← 1print x, z
Process p3z ← 1print x, y
How many “sequentially consistent outputs”? < 64
Example: Is 000000 sequentially consistent? NoI All print operations “happen before” the updates - impossible
Example: Is 001001 sequentially consistent? NoI print yz = 00 after x← 1, before y ← 1, z ← 1I x← 1, print yz = 00, y ← 1, print xz = 10, z ← 1, print xy = 11,
No!I x← 1, print yz = 00, z ← 1 – no (z was never equal to 1)
Alberto Montresor (UniTN) DS - Replication 2011/11/02 22 / 82
Consistency models Sequential consistency
Causal Consistency – (Hutto and Ahamad, 1990)
Definition (Causal Consistency)
All writes that are (potentially) causally related must be seen by everyprocess in causal order
Define “causally related”:
a read followed by a write, on the same process:I the write is (potentially) causally related by the read
a write followed by a read of the same value, on diff. process:I the read is (potentially) causally related by the write
Example of use:
Bulletin board
Alberto Montresor (UniTN) DS - Replication 2011/11/02 23 / 82
Consistency models Sequential consistency
Causal Consistency
Example
Is the following example causally consistent?
Is the following example sequentially consistent?
p1
p2
p3
w1(s, 99)
r3(s) → 100 r
3(s) → 99
w2(s, 100)
r4(s) → 99p
4r4(s) → 100
Alberto Montresor (UniTN) DS - Replication 2011/11/02 24 / 82
Consistency models Sequential consistency
Causal Consistency
Example
Is the following example causally consistent?
Is the following example sequentially consistent?
p1
p2
p3
w1(msg
1, 99)
w2(msg
2, 100)
p4
r2(msg
1) → 99r
2(msg
1) → 99
r3(msg
1) → 99 r
3(msg
2) → 100
r4(msg
1) → 99r
4(msg
2) → 100
Alberto Montresor (UniTN) DS - Replication 2011/11/02 25 / 82
Consistency models Sequential consistency
Several other models
FIFO/PRAM Consistency (Lipton and Sandberg, 1988)
Release Consistency (Gharachorloo et al, 1990)
Entry Consistency (Bershad et al, 1993)
. . .
Alberto Montresor (UniTN) DS - Replication 2011/11/02 26 / 82
Consistency models Eventual Consistency
Eventual Consistency
Scenario: consider a system where
updates are rare
concurrent updates are absent, or can be easily resolved in anautomatic way
Example: DNS
Updates are rare w.r.t. to reads!
Only a centralized authority can update the system; no concurrentupdates.
Do we need sequential consistency in this case?
Alberto Montresor (UniTN) DS - Replication 2011/11/02 27 / 82
Consistency models Eventual Consistency
Eventual Consistency
Definition (Eventual consistency)
If no updates take place for a long time, all replicas will graduallybecome consistent (i.e., the same)
Comment:
The consistency policy of epidemic protocols
This is not a safety property, is a liveness one
What happens in our three- process example with prints?
Alberto Montresor (UniTN) DS - Replication 2011/11/02 28 / 82
Consistency models Eventual Consistency
Eventual Consistency
Process p1x← 1print y, z
Process p2y ← 1print x, z
Process p3z ← 1print x, y
Example: Is 000000 eventual consistent? Yes
In general, all the potential 64 outputs are possible
Alberto Montresor (UniTN) DS - Replication 2011/11/02 29 / 82
Consistency models Eventual Consistency
Consistency for mobile users
Consider a replicated database that you access through your notebook.The notebook acts as a front-end to the database
Alberto Montresor (UniTN) DS - Replication 2011/11/02 30 / 82
Consistency models Client-centric consistency
Consistency for mobile users
Problem: Eventual Consistency is not sufficient
You move from location A to location B
Unless you use the same server, you may detect inconsistencies:I your updates at A may not have yet been propagated to BI you may be reading newer entries than the ones available at AI your updates at B may eventually conflict with those at A
What we can do?The only thing you really care is that the entries you updated and/orread at A, are in B the way you left them in A. In that case, thedatabase will appear to be consistent to you
Alberto Montresor (UniTN) DS - Replication 2011/11/02 31 / 82
Consistency models Client-centric consistency
Client-centric consistency
Idea
In some cases, we can avoid system-wide consistency, by concentratingon what specific clients want, instead of what should be provided byservers
Models:
Read-after-read / Monotonic reads
Write-after-write / Monotonic writes
Read-after-write / Read-your-writes
Write-after-read / Write-follows-reads
Alberto Montresor (UniTN) DS - Replication 2011/11/02 32 / 82
Consistency models Client-centric consistency
Notation
xi denotes the version of data item x at local copy Li
WS (xi) denotes the set of the write/update operations that havecaused xi to assume such value on Li
WS (xi;xj) denotes the fact that the operations in WS (xi) havebeen also performed at local copy Lj
Time specifications should be added to this notation; in the nextslides we will use a space-time diagram, instead
Alberto Montresor (UniTN) DS - Replication 2011/11/02 33 / 82
Consistency models Client-centric consistency
Monotonic reads – Read-after-read
Definition (Monotonic reads)
If a process reads the value of a data item x, any successive readoperation on x by that process will always return that same or a morerecent value
Example
Reading incoming mail on aweb-server. Each time youconnect to a different e-mailserver, that server fetches (atleast) all the updates from theserver you previously visited
L1
L2
r(x1)
r(x2)
WS(x1)
WS(x1 ; x
2)
L1
L2
r(x1)
r(x2)
WS(x1)
Alberto Montresor (UniTN) DS - Replication 2011/11/02 34 / 82
Consistency models Client-centric consistency
Monotonic writes – Write-after-write
Definition (Monotonic writes)
A write operation by a process on a data item x is completed beforeany successive write operation on x by the same process
Example
Maintaining versions ofreplicated files in the correctorder everywhere (propagatethe previous version to theserver where the newest versionis installed)
L1
L2
w(x1)
w(x2)WS(x
1 ; x
2)
L1
L2
w(x1)
w(x2)
WS(x1)
WS(x1)
Alberto Montresor (UniTN) DS - Replication 2011/11/02 35 / 82
Consistency models Client-centric consistency
Read your writes – Read-after-write
Definition (Read your writes)
The effect of a write operation by a process on data item x, will alwaysbe seen by a successive read operation on x by the same process
Example
Updating your Web page andguaranteeing that your Webbrowser shows the newestversion instead of its cachedcopy
L1
L2
w(x1)
r(x2)
WS(x1)
WS(x1 ; x
2)
L1
L2
w(x1)
r(x2)
WS(x1)
Alberto Montresor (UniTN) DS - Replication 2011/11/02 36 / 82
Consistency models Client-centric consistency
Writes follow read – Write-after-read
Definition (Writes follow read)
A write operation by a process P on a data item x following a previousread operation on x by P , is guaranteed to take place on the same or amore recent value of x that was read
Example (Newsgroup)
To guarantee that users of anetwork newsgroup see aposting of a reaction to anarticle only after they have seenthe original article
L1
L2
r(x1)
w(x2)
WS(x1)
WS(x1 ; x
2)
L1
L2
r(x1)
w(x2)
WS(x1)
WS(x1 ; x
2)
Alberto Montresor (UniTN) DS - Replication 2011/11/02 37 / 82
Consistency models Client-centric consistency
Session consistency
Definition (Session consistency)
A practical version of read-your-writes, where processes access adata storage in the context of a session
As long as the session exists, the system guaranteesread-your-writes
If the session terminates because of a failure, a new session mustbe created
Guarantees are limited to sessions
Alberto Montresor (UniTN) DS - Replication 2011/11/02 38 / 82
Consistency models Client-centric consistency
Client-centric consistency
Relevant bibliography
A. S. Tanenbaum and M. van Steen. Distributed Systems: Principles andParadigms. Prentice-Hall, 2nd edition, 2007. [Chapter 7]
D. Terry, M. Theimer, K. Petersen, A. Demers, M. Spreitzer, and C. Hauser.Managing update conflicts in Bayou, a weakly connected replicated storagesystem. In Proc. of the 15th ACM symposium on Operating systems principles,SOSP’95, pages 172–182. ACM, 1995.http://www.disi.unitn.it/~montreso/ds/papers/bayou.pdf
Alberto Montresor (UniTN) DS - Replication 2011/11/02 39 / 82
Consistency models Client-centric consistency
Reality Check
Amazon S3
Amazon S3 (Simple Storage Service) is an online storage web serviceoffered by Amazon Web Services. S3 is designed to provide 99.99%availability and 99.999999999% durability of objects over a given year.
From Amazon S3’s FAQ
Q: What data consistency model does Amazon S3 employ?Amazon S3 buckets in the US West (Northern California),EU (Ireland), Asia Pacific (Singapore), and Asia Pacific(Tokyo) Regions provide read-after-write consistency forPUTS of new objects and eventual consistency for overwritePUTS and DELETES. Amazon S3 buckets in the USStandard Region provide eventual consistency.
Alberto Montresor (UniTN) DS - Replication 2011/11/02 40 / 82
Consistency models Client-centric consistency
Reality Check
Berkeley DB
Oracle’s Berkeley DB is a computer software library that provides ahigh-performance embedded database for key/value data. Used inPostfix, Subversion, SpamAssassin, BitCoin.
From the Berkeley DB manual
In a distributed system, the changes made at the master arenot always instantaneously available at every replica, althoughthey eventually will be. In general, replicas not directlyinvolved in contributing to a transaction commit will lagbehind other replicas because they do not synchronize theircommits with the master. For this reason, you might want tomake use of the read-your-writes consistency feature.
Alberto Montresor (UniTN) DS - Replication 2011/11/02 41 / 82
Consistency models Client-centric consistency
Reality Check
Apache ZooKeeper
Apache ZooKeeper is a software project of the Apache SoftwareFoundation, providing an open source centralized configuration serviceand naming registry for large distributed systems. ZooKeeper is a subproject of Hadoop.
From ZooKeeper
Sequential Consistency: Updates from a client will be appliedin the order that they were sent.
What?
Alberto Montresor (UniTN) DS - Replication 2011/11/02 42 / 82
Consistency models Client-centric consistency
Reality Check
Relevant bibliography
H. Wada, A. Fekete, L. Zhao, K. Lee, and A. Liu. Data consistency properties andthe trade-offs in commercial cloud storage: the consumers’ perspective. In Proc. of5th Biennial Conference on Innovative Data Systems Research (CIDR’11), pages134–143, Asilomar, CA, USA, Jan. 2011.http://www.disi.unitn.it/~montreso/ds/papers/ConsistencyCloud.pdf
Alberto Montresor (UniTN) DS - Replication 2011/11/02 43 / 82
Replication architectures Overview
Passive replication
Clients communicate with primary server
Updates are forwarded from primary to backups
Queries are replied by the primary
PrimaryServer
BackupServer
BackupServer
Clients
Alberto Montresor (UniTN) DS - Replication 2011/11/02 44 / 82
Replication architectures Overview
Active replication
Several (all) replicas handle the invocation and send the response
Updates must be applied in the same order – total order broadcast
Replica
Replica
Replica
Clients Clients
Alberto Montresor (UniTN) DS - Replication 2011/11/02 45 / 82
Replication architectures Overview
Passive vs Active
Passive replication
Computation is performed only at primary
If state updates are large, can waste network bandwidth
Can handle non-determinism
Active replication
Small recovery delay after failures
If operations are compute intensive, can waste computationalresources
Only deterministic
Alberto Montresor (UniTN) DS - Replication 2011/11/02 46 / 82
Replication architectures Overview
Consistency protocols
Primary-based protocolsI DefinitionI Lower bounds
Replicated-write protocolsI Majority, quorum-basedI State machine approach
Client-centric protocolsI Monotonic readsI Read-your-writes
Alberto Montresor (UniTN) DS - Replication 2011/11/02 47 / 82
Replication architectures Primary-Backup
Primary-Backup
The idea
Clients communicate with a single replica (the primary)
The primary updates the other replicas (backup)
Backups detect the failure of the primary using a timeoutmechanism
Clients learn from the service when the primary fails and theservice “fail over” to a backup
Note: non-deterministic events are executed only at the primary
Alberto Montresor (UniTN) DS - Replication 2011/11/02 48 / 82
Replication architectures Primary-Backup
How to evaluate a primary-backup protocol
Definition (Degree of replication)
Number of servers used to implement the service; the smaller, thebetter
Definition (Blocking time)
The worst-case period between a request and its response in anyfailure-free execution
Definition (Failover time)
The worst-case period during which request can be lost because thereis no primary
Alberto Montresor (UniTN) DS - Replication 2011/11/02 49 / 82
Replication architectures Primary-Backup
Definitions
Definition (Service outage)
The service has a server outage at t if some correct client sends arequest at time t to the service, but does not receive a response
Definition ((k,∆)-bofo service - “bounded outage, finitely often”)
A service in which all server outages can be grouped into at most kintervals of time, each of at most length ∆
Alberto Montresor (UniTN) DS - Replication 2011/11/02 50 / 82
Replication architectures Primary-Backup
Specification
PB1 At any time, there is at most one server pi that acts as a primary
PB2 If a client request arrives at a server that is not the currentprimary, then the request is ignored
PB3 There exist fixed values k and ∆ such that the service behaves likea single (k,∆)-bofo service
Alberto Montresor (UniTN) DS - Replication 2011/11/02 51 / 82
Replication architectures Primary-Backup
Primary-backup – Simple protocol
System model:
point-to-point communication
no communication failures
upper bound δ on message delivery time
FIFO channels
at most one server crashes
Two servers:
The primary p1
The backup p2
Variables:
At server pi, primary = true if pi acts as the current primary
At clients, primary is equal to the identifier of the current primary
Alberto Montresor (UniTN) DS - Replication 2011/11/02 52 / 82
Replication architectures Primary-Backup
Primary-backup – Simple protocol
Protocol executed by the primary p1
upon initialization doprimary ← true
upon receive 〈req, r〉 from c dostate ← update(state, r) % Update local statesend 〈state, state〉 to p2 % Send update to backupsend 〈rep, reply(r)〉 to c % Reply to client
repeat every τ secondssend 〈hb〉 to p2 % Heartbeat message
upon recovery after a failure do{ start behaving like a backup }
Alberto Montresor (UniTN) DS - Replication 2011/11/02 53 / 82
Replication architectures Primary-Backup
Primary-backup – Simple protocol
Protocol executed by the backup p2
upon initialization doprimary ← false
upon receive 〈state, s〉 dostate ← s % Update local state
upon not receiving a heartbeat for τ + δ seconds doprimary ← true % Becomes new primarysend 〈newp〉 to c % Inform the client of new primary{ start behaving like a primary }
Alberto Montresor (UniTN) DS - Replication 2011/11/02 54 / 82
Replication architectures Primary-Backup
Primary-backup – Client code
Protocol executed by client c
upon initialization doprimary ← p1 % Initial primary
upon receive 〈newp〉 from p2 doprimary ← p2 % Backup
upon operation(r) dowhile not received a reply do
send 〈req, r〉 to primarywait receive 〈rep, v〉 or receive 〈newp〉
return v
Alberto Montresor (UniTN) DS - Replication 2011/11/02 55 / 82
Replication architectures Primary-Backup
Simple protocol – Proof of correctness
PB1 At any time, there is at most one server pi that acts as a primary
Proof I primary1 = true ∧ primary2 = false until the failure of p1I primary2 = false until the expiration of the timeoutI primary2 = true after the expiration of the timeoutI Failover time: τ + 2δ
c
p1
p2
δ
τREQ REP
HBHB
δ τ
HB
NEWPSTATE
δ
Alberto Montresor (UniTN) DS - Replication 2011/11/02 56 / 82
Replication architectures Primary-Backup
Simple protocol – Proof of correctness
PB2 If a client request arrives at a server that is not the currentprimary, then the request is ignored
Proof Trivially follows from the protocol
Alberto Montresor (UniTN) DS - Replication 2011/11/02 57 / 82
Replication architectures Primary-Backup
Simple protocol – Proof of correctness
PB3 There exist fixed values k and ∆ such that the service behaves likea single (k,∆)-bofo service
Proof Find k, ∆I At most one process can fail: k = 1I ∆ = τ + 4δ:
F assume p1 crashes at tcF any client request sent to p1 at time tc − δ or later may be lostF p2 may not become the new primary until tc + τ + 2δF client may not learn that p2 is new primary for another δ
c
p1
p2
δ
REQ
HB HB
δ τ
HB NEWP
τ + 4δ
δ
REP
δ
REQ
Alberto Montresor (UniTN) DS - Replication 2011/11/02 58 / 82
Replication architectures Primary-Backup
Simple protocol – Questions
Question
What kind of consistency model is provided by this simple protocol?
Answer: Linearizability1 The result of any execution is the same as if the operations by all
processes on the data store were executed in some sequential order
2 The operation of each process appear in this sequence in the orderspecified by its program
3 If t1 < t2 are the times at which two distinct processes performoperations o1 and o2, then o1 must appear before o2 in thesequence
Alberto Montresor (UniTN) DS - Replication 2011/11/02 59 / 82
Replication architectures Primary-Backup
Primary-backup – Multiple backups
System model:
point-to-point communication
Perfect Channels
perfect failure detector P
FIFO channels
at most f < n servers crash
n servers:
p1, . . . , pn
Alberto Montresor (UniTN) DS - Replication 2011/11/02 60 / 82
Replication architectures Primary-Backup
Primary-backup – Multiple backups
Protocol executed by process pi
upon receive 〈req, id , r〉 from c doservers ← servers − {pj : pj ∈ servers ∧ j < i}if id 6∈ state then
state ← update(state, r)
send 〈state, state, id〉 to serverswait receive〈state, id〉 from serverssend 〈rep, id , reply(r)〉 to c
upon suspect(pj) doservers ← servers − {pj}
upon receive 〈state, id , s〉 from pk doservers ← servers − {pj : pj ∈ servers ∧ j < k}if pk ∈ servers then
state ← ssend 〈state, id〉 to pk
Alberto Montresor (UniTN) DS - Replication 2011/11/02 61 / 82
Replication architectures Primary-Backup
Primary-Backup – Client code
Protocol executed by client c
upon initialization doMap response ← new Map();
upon receive 〈rep, id , v〉 doresponse[id ]← v
upon suspect(pj) doservers ← servers − {pj}
upon operation(r) doid ← newId()while servers 6= ∅ or response[id ] = nil do
pk ← min(servers)send 〈req, id , r〉 to pkwait response[id ] 6= nil or pk /∈ ∅
return response[id ]
Alberto Montresor (UniTN) DS - Replication 2011/11/02 62 / 82
Replication architectures Primary-Backup
Primary-backup – Multiple backups
How large is the failover time?τ + 2δ, as before (hidden in the Failure Detector)
How large is the outage period ∆?(τ + 2δ)(n− 1)
What kind of consistency model we obtain if all operations arehandled by the primary?Linearizability
What kind of consistency model we obtain if only write operationsare handled by the primary?Sequential consistency
Alberto Montresor (UniTN) DS - Replication 2011/11/02 63 / 82
Replication architectures Primary-Backup
Lower bounds
Assuming that no more than f components can fail, what are thesmallest possible values (lower bounds) of
I the degree of replicationI the failover time?I the blocking time
Knowing the lower bounds for a problem enables to evaluate thequality of a protocol
Tight lower bounds → optimal protocols
Components:I ProcessesI Point-to-point linksI Up to f crash+link failures → at most f processes may crash or f
links may crash or f1 links + f2 processes = f components
Alberto Montresor (UniTN) DS - Replication 2011/11/02 64 / 82
Replication architectures Primary-Backup
Lower bounds
Failure Degree of Blocking FailoverModel Replication Time Time
crash n > f 0 fδ
crash+link n > f + 1 0 2fδ
rec-omission⌊3f2
⌋2δ 2fδ
send-omission n > f 2δ 2fδ
omission n > 2f 2δ 2fδ
Alberto Montresor (UniTN) DS - Replication 2011/11/02 65 / 82
Replication architectures Primary-Backup
Lower bounds
Crash+link
To tolerate up to f crash+link failures, more than f + 1 servers areneeded
Proof – by contradiction
Suppose n = f + 1 servers is sufficient
divide the n servers in two subsets Aand B1 . . . Bf
if all server in B crash, A mustbecome primary
if A crashes, one of servers Bi mustbecome primary
what if all f links between A and Bi
fails?
A
B1
B2
B3
Alberto Montresor (UniTN) DS - Replication 2011/11/02 66 / 82
Replication architectures Primary-Backup
Multiple primaries
Alberto Montresor (UniTN) DS - Replication 2011/11/02 67 / 82
Replication architectures Quorum protocols
Quorum protocols (Gifford, 1979)
Definition
Quorum-based protocols guarantee that each operation is carried outin such a way that a majority vote (a quorum) is established.
Write quorum nW : the number of replicas that need toacknowledge the receipt of the update to complete the update
Read quorum nR: the number of replicas that are contacted whena data object is accessed through a read operation
Constraints
nR + nW > n (prevent R-W conflicts)
nW > n/2 (prevent W-W conflicts)
The algorithm
To read, the most up-to-date entry is taken
Quorums guarantee that the last written entry will be present
Alberto Montresor (UniTN) DS - Replication 2011/11/02 68 / 82
Replication architectures Quorum protocols
Quorum protocols (Gifford, 1979)
Alberto Montresor (UniTN) DS - Replication 2011/11/02 69 / 82
Replication architectures Quorum protocols
Quorum protocols (Gifford, 1979)
Alberto Montresor (UniTN) DS - Replication 2011/11/02 70 / 82
Replication architectures State machines
State machine
Definition (State machine)
A state machine consists of:
State variables
Commands which transforms its stateI Implemented by deterministic programsI Atomic with respect to other commands
Specification
Agreement: every correct replica receives the same set ofcommands
Order: every non-faulty state machine processes the commands itreceives in the same order
Alberto Montresor (UniTN) DS - Replication 2011/11/02 71 / 82
Replication architectures State machines
Implementing linearizability – General scheme
Implementation
The initiator A-broadcasts all read, write requests to all servers
When the message is A-delivered at the initiator, it replies to theclient
Correctness
All replicas execute read, write in the same order
Assumptions
Synchronous system
Asynchronous system with �S failure detector
Alberto Montresor (UniTN) DS - Replication 2011/11/02 72 / 82
Replication architectures State machines
Implementing sequential consistency – General scheme
Implementation
The initiator A-broadcasts write requests to all servers
When the message is A-delivered, the replica updates its local copy
Read request are replied immediately by the initiator
Correctness
Writes are executed in the same order everywhere
Reads are consistent with local order
Assumptions
Synchronous system
Asynchronous system with �S failure detector
Alberto Montresor (UniTN) DS - Replication 2011/11/02 73 / 82
Replication architectures State machines
Implementing causal consistency – General scheme
Implementation
The initiator C-broadcasts write requests to all servers
When the message is C-delivered, the replica updates its local copy
Read request are replied immediately by the initiator
Correctness
Writes are executed in a causal order
Reads are consistent with local (and causal) order
Assumptions
Asynchronous system
Alberto Montresor (UniTN) DS - Replication 2011/11/02 74 / 82
Replication architectures State machines
Hypervisor-based fault tolerance
Implement state machine on virtual machines running on the sameinstruction-set as underlying hardware
Undetectable by higher layers of software
One of the great come-backs in systems research!I CP-67 for IBM 369 [1970]I Xen [SOSP 2003], VMware
State transition should be deterministic
...but some VM instructions are not (e.g. time-of-day)!
Two types of commandsI Virtual-machine instructionsI Virtual-machine interrupts (with DMA input)
Interrupts must be delivered at the same point in cmd sequence
Alberto Montresor (UniTN) DS - Replication 2011/11/02 75 / 82
Replication architectures State machines
Hypervisor-based fault tolerance
Thomas C. Bressoud, Fred B. Schneider. Hypervisor-based FaultTolerance. ACM TOCS, 14(1):80-107
John R. Douceur an Jon Howell. Replicated Virtual Machines.Microsoft Research TR-2005-119
I Technical paper associated to a patent
Brendan Cully et al. Remus: High Availability via AsynchronousVirtual Machine Replication. NSDI’08.
I Best paper awardI Real implementation for XEN
Alberto Montresor (UniTN) DS - Replication 2011/11/02 76 / 82
Replication architectures Client-centric consistency
Client-centric consistency - Naive implementations
Each write operation is assigned a unique identifierI Done by the server where the operation is requested
For each client c, we keep track of:I Read set WS r: contains write operations relevant to the read
operations performed by cI Write set WSw: contains write operations relevant to the write
operations performed by c
For each server, we keep track of:I Write set WS : contains the write operations executed so far
Alberto Montresor (UniTN) DS - Replication 2011/11/02 77 / 82
Replication architectures Client-centric consistency
Monotonic reads - Naive implementation
To perform a read operation or, a client:I send or and its read set WS r to the server
The serverI Checks whether all the writes in WS r have been executed locally
(WS r ⊆WS?)I If not, asks the appropriate servers the missing operations OI Applies the operations O and add them to WSI Returns the requested value and the WS set to the client
The clientI Adds WS to its local read set: WS r = WS r ∪WS
Alberto Montresor (UniTN) DS - Replication 2011/11/02 78 / 82
Replication architectures Client-centric consistency
Read your writes - Naive implementation
To perform a read operation or, a client:I send or and its write set WSw to the server
The serverI Checks whether all the writes in WSw have been executed locally
(WSw ⊆WS?)I If not, asks the appropriate servers the missing operations OI Applies the operations O and add them to WSI Returns the requested value to the client
To perform a write operation ow, a client cI send ow to the serverI add ow to the write set WSw
Alberto Montresor (UniTN) DS - Replication 2011/11/02 79 / 82
CAP Theorem
CAP theorem
Theorem (Impossibility of CAP)
It is impossible for a web service to provide more than two of thefollowing three guarantees:
Consistency
Availability
Partition-tolerance
This is the reason why Amazon Web Services only provideeventual consistency
I W. Vogels. Eventual consistent. Comm. of the ACM, 52(1):40–44, 2009
Similar stands have been taken for example by HPI HP. There is no free lunch with distributed systems.
http://www.disi.unitn.it/~montreso/ds/papers/NoFreeLunchDS.pdf
Alberto Montresor (UniTN) DS - Replication 2011/11/02 80 / 82
CAP Theorem
CAP theorem
History:
First introduced by Eric Brewer in a keynote at PODC’00I E. A. Brewer. Towards robust distributed systems (abstract). In Proc.
of the 19th ACM symposium on Principles of distributed computing,
PODC’00, page 7. ACM, 2000
Formally proved by Gilbert and Lynch two years laterI S. Gilbert and N. Lynch. Brewer’s conjecture and the feasibility of
consistent, available, partition-tolerant web services. SIGACT News,
33:51–59, June 2002.
http://www.disi.unitn.it/~montreso/ds/papers/CapProof.pdf
Alberto Montresor (UniTN) DS - Replication 2011/11/02 81 / 82
Bibliography
Reading material
W. Vogels. Eventual consistent. Comm. of the ACM, 52(1):40–44, 2009.http://www.disi.unitn.it/~montreso/ds/papers/EventualConsistent.pdf
N. Budhiraja, K. Marzullo, F. Schneider, and S. Toueg. The primary-backupapproach. In S. Mullender, editor, Distributed Systems (2nd ed.).Addison-Wesley, 1993.http://www.disi.unitn.it/~montreso/ds/papers/PrimaryBackup.pdf
F. Schneider. Replication management using the state machine approach. InS. Mullender, editor, Distributed Systems (2nd ed.). Addison-Wesley, 1993.http://www.disi.unitn.it/~montreso/ds/papers/StateMachine.pdf
E. A. Brewer.
Towards robust distributed systems (abstract).In Proc. of the 19th ACM symposium on Principles of distributed computing, PODC’00, page 7. ACM,2000.
N. Budhiraja, K. Marzullo, F. Schneider, and S. Toueg.
The primary-backup approach.In S. Mullender, editor, Distributed Systems (2nd ed.). Addison-Wesley, 1993.http://www.disi.unitn.it/~montreso/ds/papers/PrimaryBackup.pdf.
S. Gilbert and N. Lynch.
Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services.SIGACT News, 33:51–59, June 2002.http://www.disi.unitn.it/~montreso/ds/papers/CapProof.pdf.
HP.
There is no free lunch with distributed systems.http://www.disi.unitn.it/~montreso/ds/papers/NoFreeLunchDS.pdf.
F. Schneider.
Replication management using the state machine approach.In S. Mullender, editor, Distributed Systems (2nd ed.). Addison-Wesley, 1993.http://www.disi.unitn.it/~montreso/ds/papers/StateMachine.pdf.
A. S. Tanenbaum and M. van Steen.
Distributed Systems: Principles and Paradigms.Prentice-Hall, 2nd edition, 2007.
D. Terry, M. Theimer, K. Petersen, A. Demers, M. Spreitzer, and C. Hauser.
Managing update conflicts in Bayou, a weakly connected replicated storage system.In Proc. of the 15th ACM symposium on Operating systems principles, SOSP’95, pages 172–182.ACM, 1995.http://www.disi.unitn.it/~montreso/ds/papers/bayou.pdf.
W. Vogels.
Eventual consistent.Comm. of the ACM, 52(1):40–44, 2009.
W. Vogels.
Eventual consistent.Comm. of the ACM, 52(1):40–44, 2009.http://www.disi.unitn.it/~montreso/ds/papers/EventualConsistent.pdf.
H. Wada, A. Fekete, L. Zhao, K. Lee, and A. Liu.
Data consistency properties and the trade-offs in commercial cloud storage: the consumers’perspective.In Proc. of 5th Biennial Conference on Innovative Data Systems Research (CIDR’11), pages 134–143,Asilomar, CA, USA, Jan. 2011.http://www.disi.unitn.it/~montreso/ds/papers/ConsistencyCloud.pdf.
Alberto Montresor (UniTN) DS - Replication 2011/11/02 82 / 82