Ch12 (continued) Replicated Data Management. Outline Serializability Quorum-based protocol...
-
Upload
alban-cannon -
Category
Documents
-
view
228 -
download
0
Transcript of Ch12 (continued) Replicated Data Management. Outline Serializability Quorum-based protocol...
Ch12 (continued)Replicated Data Management
Outline Serializability Quorum-based protocol Replicated Servers Dynamic Quorum Changes One-copy serializability View-based quorums
Atomic Multicast Virtual synchrony Ordered Multicast Reliable and Causal Multicast Trans Algorithm Group Membership Transi Algorithm
Update Propagation
Data replication
Data replication: Why? Make data available in spite of failure of some processors Enable Transactions (user-defined actions) to complete successfully even if some failures occur in the system
i.e. Actions are resilient to failures of some nodes
Problems to be solved: Consistency Management of replicas
Data replication Intuitive representation of a replicated data system
Transactions
d: a logical data
replica of dreplica of d replica of d
Logical operation
Mapping of logical operation
Transactions see logical data
The underlying system maps each operation on the logical data to
operations on multiple copies
Data replication Correctness criteria of the underlying system
To be correct, the mapping performed by the underlying system must ensure the one-copy serializability: One-copy serializability property: The concurrent execution of transactions on replicated data should be equivalent to some serial execution of the transaction on non-replicated data
Data replication Quorum-based protocol
Ensure that any pair of conflicting accesses to the same data item access overlapping sites
Here we discuss the Read/Write quorums A data item d is stored at every processor p in P(d)
Every processor p in P(d) has a vote weight vp(d) R(d) : read threshold W(d): write threshold Read quorum of d: a subset P’ of P(d) such that ( vp(d), pP’ ) R(d)
Data replication Quorum-based protocol
Write quorum of d: a subset P’ of P(d) such that ( vp(d), pP’ ) W(d)
The total number of votes for d: V(d) = ( vp(d), pP(d) )
Quorums must satisfy the following two conditions: condition 1: R(d) + W(d) > V(d) Intuitively, every read quorum of d intersect with every write quorum of d. Hence,
read and write cannot be performed concurrently on d
every read quorum can access the copy that reflects the latest update
Data replication Quorum-based protocol
condition 2: 2*W(d) > V(d) Intuitively, two write quorums of d intersect Hence,
a write can be performed in at most one group
How read and write operations work ?
Data replication Quorum-based protocol
Read operation: Each replica of d has a version number np(d) : version number of the replica at processor p Initially, np(d) is zero,When a transaction T wants to read d, the following steps are performed 1. Broadcast request for votes to P(d) (a remote processor q replies by sending a message with nq(d), vq(d)) 2. Collect replies and construct P’ until ( vp(d), pP’ ) R(d) 3. Lock all copies in P’ 4. Read the replica dp, p in P’ with a highest version number 5. Unlock copies of d
Data replication Quorum-based protocol
Write operation: Each replica of d has a version number np(d) : version number of the replica at processor p Initially, np(d) is zero,When a transaction T wants to write d, the following steps are performed 1. Broadcast request for votes to P(d) (a remote processor q replies by sending a message with nq(d), vq(d)) 2. Collect replies and construct P’ until ( vp(d), pP’ ) W(d) 3. Lock all copies in P’ 4. Compute the new value d’ of d 5. Let max_n(d) be the highest version number read in step 2, For all p in P’ , write d’ to d with np(d’)=max_n(d)+1
6. Unlock copies of d
Data replication Replicated Servers
Clients
S:a logical server
replica of Sreplica of S replica of S
Logical requests
Mapping of logical requests
Clients see logical servers
The underlying system maps each operation on the logical servers to
operations on multiple copies Reply
Data replication Replicated Servers
In the context of replicated data, one might consider thatthe system consists of servers and clients
Servers : are processors having copy of the data item Clients : are processors requesting operation on the data item
Some approaches for replicating servers: Active replication Primary site approach
Data replication Replicated Servers
A copy of S is at every processor p in P(S)
Active replication: All the replicas are simultaneously active All replicas are equivalent
When a client C requests a service from S, C contacts any one of the replicas Sp, for p in P(S) Sp acts as the coordinator for the transaction
To be fault tolerant, the client must contact all the replicas (plus other restrictions e.g. same set of requests; same order of requests at all replicas )
In general, suitable for processor failures
Data replication Replicated Servers
Primary site approach: One replica is the primary copy: coordinator for all transactions
All other replicas are backups (passive in general)
When a client C requests a service from S, C contacts the primary copy If the primary fails, a new primary is elected
If a network partitioning occurs, only the partition having the primary can be serviced
Data replication Replicated Servers
Primary site approach: Read operation: If the requested operation is a read, the primary performs the operation and sends the result to the requester Write operation: If the requested operation is a write, The primary server make sure that all the backup maintains the most recent up to date value of the data item The primary processor might periodically checkpoints the state of the data item on the backups to reduce the computation overhead at the backups
Data replication Dynamic Quorum changes
The quorum-based protocol we have seen is a static method A single processor failure can make a data item unavailable
If the system breaks into small groups, it might be the case that no group will perform the write operation
The dynamic quorum change algorithm avoids this (at certain limits) Idea: For a data item d, Quorums are defined on the set of alive replicas of d: Introduction of the notion of view, Each transaction executes in a single view Views are changed sequentially
Data replication Dynamic Quorum changes
d: data itemP(d) : processors at which a copy of d is storedsome processors in P(d) can fail
View: we can regard a view of d as consisting of : alive processors of P(d): AR(d) a read quorum defined on AR(d) a write quorum defined on AR(d) a unique name v(d) (view names are assumed to be totally ordered)
Data replication Dynamic Quorum changes
For a transaction Ti, v(Ti) denotes the view in which Ti executes
The idea behind view-based quorum is to ensure that If v(Ti) < v(Tj) then, Ti comes before Tj in an equivalent serial execution
Problem: ensure serializability within view serializability between views New conditions are necessary for quorums to satisfy the aboverequirements
Data replication Dynamic Quorum changes
New conditions for quorums• d: data item• v: a view of d• P(d,v) : alive processor that store d in view v,• |P(d,v)| = n(d,v)• R(d,v) : read threshold for d in view v,• W(d,v) : write threshold for d in view v,• Ar(d) : read accessibility threshold for d in all views (availability, d can be read in a view v as long as there are Ar(d) alive processors in view v )• Aw(d): write accessibility threshold for d in all views (availability)
Data replication Dynamic Quorum changes
New conditions for quorums (cont.) The threshold must satisfy the following conditions: DQC1. R(d,v) + W(d,v) > n(d,v) in a view, read write quorums intersect
DQC2. 2*W(d,v) > n(d,v) /* in a view, write quorum intersect; nodes participating in an update form a majority of the view */
DQC3. Ar(d) + Aw(d) > |P(d)| /* read accessibility and write accessibility intersect in all views */
DQC4. Aw(d) W(d,v) n(d,v) /* ensure consistency of views (we’ll see later)
*/
DQC5. 1 R(d,v) n(d,v) /* The minimum size of a read quorum is 1 */
Data replication Dynamic Quorum changes
Restrictions on read and write operations: A data item d can be
read in view v only if n(d,v) Ar(d) i.e. the number of alive replicas of d must be greater than or equal to read availability
written in view v only if n(d,v) Aw(d) i.e. the number of alive replicas of d must be greater than or equal to the write availability These restrictions are imposed to ensure consistent changes of quorums
Data replication Dynamic Quorum changes
How read and write operations work: Similar to the static quorum-based protocol except that:
Only processors in P(d,v) are contacted for votes (hence, for constructing the quorum i.e. P’)
The version number of each replica becomes : (view_number, in_view_sequence_number)
If a processor p receives a request from a transaction Ti and v(Ti) is not the view p has for d, then p rejects the request
Data replication Dynamic Quorum changes
Installation of new view: We have claimed that views are changed sequentially
How this is achieved?
A processor p in P(d) can initiate an update of the view for d due to recovering , failure of a member of the view or because its version number for d is not current.
Data replication Dynamic Quorum changes
Installation of new view (cont.):The idea: Assume that processor p is the one that wants to change the view 1. p determines if the view (the set of nodes with which
p can communicate) it belongs to satisfies the new conditions for quorums (n(d,v) Ar(d) and n(d,v) Aw(d), …) If this is not the case, p cannot change the view 2. p reads all copies of d in P(d,v) 3. p gets the new copy from a replica with the highest version number; 4. p increments the view number 5. p broadcasts the latest version to all members of P(d,v)
Data replication Dynamic Quorum changes
Installation of new view (cont.):Let v be the old view an v’ the view after the changeWe have that W(d,v) Aw(d) n(d,v’) Ar(d) Ar(d) + Aw(d) > |P(d)|
which implies that W(d,v) + n(d,v’) > |P(d)| That is, read and write quorum in view v overlap when changing to v’ : “consistent” change of view.
Data replication Dynamic Quorum changes
View change handle network partitions: Assume a data item d is replicated at five processors A,B,C,D,E Ar(d)=2, Aw(d)=3 Initial view 0: P(d,0) ={A,B,C,D,E}, W(d,0)=5, R(d,0)=1
Assume that the system partitions A B C D || E /* node E cannot communicate with others */ If an update request d arrives at any processor, if the view is not updated, the operation cannot be performed
Data replication View changes handle network partitions: Assume a data item d is replicated at five processors A,B,C,D,E Ar(d)=2, Aw(d)=3 Let view 1 be: P(d,1) ={A,B,C,D}, W(d,1)=4, R(d,1)=1 In this view, partition {E} can still read d but cannot update d
partition {A,B,C,D} can read and write d Assume that D fails, partitions: {E}, {A,B,C} d can be read by both partitions, to enable write operation, the view must be updated e.g. P(d,2) ={A,B,C}, W(d,2)=3, R(d,2)=1
Data replication View change illustrated:
A B C D E
r1(d)w1(d) w1(d) w1(d) w1(d) w1(d)
r3(d) w3(d) w3(d) w3(d) w3(d) r2(d)
r4(d)w4(d) w4(d) w4(d) w4(d) w4(d)
View 0W(d,0)= 5R(d,0)= 1
View 1W(d,1)= 4R(d,1)= 1
View 1W(d,2)= 5R(d,2)= 1
The quorum based algorithm serialized T2 before T3
Is there any notion of majority behind view changes?
Outline Serializability Quorum-based protocol Replicated Servers Dynamic Quorum Changes One-copy serializability View-based quorums
Atomic Multicast Virtual synchrony Ordered Multicast Reliable and Causal Multicast Trans Algorithm Group Membership Transi Algorithm
Update Propagation
Atomic Multicast
In many situations, one-to-many form of communication is useful e.g. maintaining replicated servers,etc.
Two forms of one-to-many communication are possible: Broadcast and Multicast
Broadcast: the sender sends to all the nodes in the system
Multicast: the sender sends to a subset L of the nodes in the system
we are interested in multicast and we assume the sender sends a message m
Atomic Multicast A naïve algorithm for Multicast For each processor p in L send m to p
Problem: The sender fails after sending it has only sent to some processors
Some members of the list L receive m while others do not receive m This is not acceptable in fault tolerant systems Multicast must be Reliable: If one processor in L receives m, every alive processor in L must receive m
Atomic Multicast The naïve algorithm for Multicast + 2PC technique The 2PC technique can improve reliability of the naïve algorithm
Idea: Regard a multicast as a transaction (“all-or-nothing” property) distinguish delivery of a message from the reception of that message
APP APP
m
Delivery of message mReception of message m
Atomic Multicast The naïve algorithm for Multicast + 2PC technique The 2PC technique can improve reliability of the naïve algorithm
Idea(cont.): Rule for delivery: deliver a message only when you know that the message will be delivered everywhere
Algorithm for the sender: 1. send m to every processor in L 2. When you have received all acknowledgements, deliver m locally; tell every processor in L that it can deliver
Atomic Multicast The naïve algorithm for Multicast + 2PC technique This technique might require an important amount of work due to recovering of a failed processor In addition, it is vulnerable vulnerability inherited from the vulnerability of the 2PC
The main difficulty comes from the correctness criteria: How can a processor determines which nodes in L are up ? Virtual synchrony takes this into account
Atomic Multicast Virtual synchrony Accounts for the fact that it is difficult to determine exactly which are the non-failed processors
Processors are organized into groups that cooperate to perform a reliable multicast
Each group corresponds to a multicast list: multicast in a group
Group view : The current list of processors to receive a multicast message (+ some global properties)
Consistency of group view: common view on the members
Atomic Multicast Virtual synchrony An algorithm for performing reliable multicast is virtual synchronyIf : 1. In any consistent state, there is a unique group view on which all members of the group agree
2. If a message m is multicast in group view v before view change c, then either: 2.1. No processor in v participating in c can ever receive m, or 2.2. All processors in v participating in c receive m before performing c
Atomic Multicast Virtual synchrony illustrated
AA B DC CB
c :{A,B,C} participate
m{} {} {}
AA B DC CB
C: {A,B,C} participate
m{m} {m} {m}
c
2.1
2.2
Atomic Multicast Virtual synchrony View changes can be considered as checkpoints
Delivery list in virtual synchrony Between two consecutive “checkpoints” v and v’, A set G of messages is multicast A sender of a message in G must be in v Hence, if p is removed from the view, the remaining processors can consider that p has failed
There is a guarantee that from v’ , no message from p will be delivered in the future
Atomic Multicast Ordered multicasts One might want a multicast that satisfies a specific order: e.g. Causal order, total order
Causal order (for causal multicast): If processor p receives m1 and then multicast m2 then every processor that receives {m1,m2} should receive m1 before m2
Total order (for atomic multicast) If p receives m1 before m2 then every processor that receives {m1,m2} should receive m1 before m2
(i.e. the same order of reception everywhere)
Atomic Multicast Why causal multicast ? Assume that the data item x is replicated and consider the followingscenario:
m2m2
m1
m1
m1: “set x to zero”m2: “increment x by 1”
p rq
m1 must be delivered before m2. Otherwise,inconsistency !
Atomic Multicast Why total order for multicast ? Assume that p sends m and after that, p crashesthen by some mechanism, q and r are informed about the crash of pbut q receives m before crash(p); r receives crash(p) before m
p rq
m
m
crash(p)
Total order is necessaryotherwise, q and r mighttake different decisions
Atomic Multicast Why total order for multicast (cont.)? Assume that the data item x is a replicated queue and consider the following scenario:
m2m2
m1
m1
m1: “insert a to x”m2: “delete a from x”
p rq
m1 and m2 must be delivered in the same order. Otherwise,inconsistency !
Atomic Multicast The Trans algorithm
Executes between two views changes exploitation of the guarantee provided by virtual synchrony
Hence, the algorithm works on one view
Mechanisms: Combination of positive and negative acknowledgements for reliability Piggybacking acknowledgements on messages being multicast simplifies detection of missed messages minimizes the need for explicit acknowledgements
Atomic Multicast
By piggybacking positive acknowledgements and negative acknowledgements, when a processor p receives a multicast message, p learns : which message it doe not need to acknowledge which message it has missed and must request a retransmission
Atomic Multicast The idea behind Trans is illustrated by the following scenario: Let L =[ P,Q,R] be a delivery list for multicast
step 1. P multicasts m1 step 2. Q receives m1 and piggybacks a positive acknowledgement on the next message m2 that it multicasts ( we write m2:ack(m1) to mean m2 contains ack for m1) step3 . R receives m2 (i.e. m2:ack(m1))
Two cases are in order for R upon receipt of m2: case 1 If R had received m1, it realizes that it does not need to send an acknowledgement for it, as Q had acknowledged it
case 2 If R had not received m1, then R learns (because of the ack(m1) attached to m2) that m1 is missing then R requests a retransmission of m1 by attaching a negative acknowledgement for m1 in the next message it multicast
Atomic Multicast Trans: an invariant The protocol maintains the following invariant
A processor p multicasts an acknowledgement of message m only if processor p has received m and all messages that causally precede m.
Causal order m
If you acknowledge m, you do not need to acknowledgeall unacknowledged messages that precede m
Atomic Multicast Trans: stable messagesA message is said to be stable if it has reached all the processorsin the group view
This is detectable because each receiver of a message multicasts acknowledges
Some assumptions:
All messages are assumed to be uniquely identified (processor_id , message_seq_number)
Each sender sequentially number its messages
A virtual synchrony layer is assumed
Atomic Multicast Trans: Variables used Each processor maintains:
ack_list : the list of identifiers of messages for which that node has to send a positive acknowledgement
nack_list : the list of identifiers of messages for which that node has to send a negative acknowledgement
G : the causal DAG contains all messages that the processor has received but that are not yet stable (m, m’) is in G if message m acknowledges message m’
Atomic Multicast Trans: retransmission Using information given by the local DAG, a processor can determine which messages it should have received
For such a message, a negative acknowledgement is multicast to request a retransmission
Atomic Multicast Trans: variables (cont.)
m : message container (serves as id of message here) m.message : application-level message (to be delivered at the app.) m.nacks : list of negative acknowledgments m.acks : list of positive acknowledgments
L : destinations list (maintained by an underlying algorithm)
Atomic Multicast Trans: Causal DAG functions used
add_to_DAG(m,G) : insert m into G
not_duplicate(m,G) : True if m has never been received before
causal(m,G) : True if all messages that m causally follows have been received
stable(m,G) : True if all (alive) processors have acknowledged m
Atomic Multicast Trans: Sending a message Trans_send(message) create m a container; m.message := message; m.ack := ack_list; /* attach positive acknowledgements to m */ m .nack :=nack_list ;/* attach negative acknowledgements to m */ put m in ack_list; add_to_DAG(m,G); send m to every processor in L
Atomic Multicast Trans: Receiving a message Trans_receive(m) for every nack(x) in m.nacks /* note: for a retransmission, m.nacks is empty */ if x in G then multicast x; if not_duplicate(m,G) then /* m has never been received before */ for every ack(m’) in m.acks do /* update of nack_list */ if not_duplicate(m’,G) then add m’ to nack_list if m is in nack_list then /* m is a retransmission */ remove m from nack_list; add m to undelivered_list; remove m.nacks from m; add_to_DAG(m,G); while there is m’ in undelivered_list such that causal(m’,G) do begin remove m’ from undelivered_list; deliver m’.message to the application; end
compute ack_list to be all m in G such that causal(m,G) and there is no m’ such that causal(m’, G) and m acknowledges m’ for every m’ in G do if stable(m’,G) then remove m’ from G and reclaim the buffer
Atomic Multicast Trans: the Causal DAG G : the causal DAG contains all messages that the processor has received but that are not yet stable (m, m’) is in G if message m acknowledges message m’
c1
d1b1Message (c1, [ack(b1), nack_(d1)])represents a piece of the DAG
Processor C that sends c1, acknowledges message b1, but it also requests retransmissionof message d1
Atomic Multicast Trans: retransmission
A retransmitted message can it be different from the original message ? The retransmitted message must contain all the positive acknowledgements that the original message had
the list of negative acknowledgements is not useful in the retransmission
Atomic Multicast Trans: illustrated
A B DC
a aa
(a,[]) A multicasts D does not get
a aa
(b,[ack(a)])
b bb
a
b
B multicast ;
D receives b:it learns about unreceived message a
All nodes know B has got a
Atomic Multicast Trans: illustrated
A B DC
a aa
(c,[ack(b)])
b bb
a
b
C multicast and C onlyacknowledges b(implicit acknowledgement for a) All nodes know C has got b and a (no nack(a) in themessage c broadcasts)
c cc c
Atomic Multicast Trans: illustrated
A B DC
a aa
(d,[nack(a)])
b bb
a
b
D multicasts , in the D’s message, there is a nack(a) : request retransmission
A does not get message from D;
D must piggyback ack(c) onits next multicast
c cc c
d dd ?a ?a?a
Atomic Multicast Trans: illustrated
A B DC
a aa
(d,[nack(a)])
b bb
a
b
D multicasts , in the D’s message, there is a nack for a: request retransmission
A does not get message from D;
D must piggyback only ack(c) on its next multicast i.e. implicit ack for a and b
c cc c
dd d ?a?a ?a
Atomic Multicast Trans: illustrated
A B DC
a aa
(a,[])
b bb
a
b
C “remulticasts” ;no attached ack Note: representation of aat D changed
In their next message, C and B acknowledge d
c cc c
dd d
Atomic Multicast Trans: Properties If a message is received to a processor that does not fail, eventually every non-failed processor receives that message
Messages are delivered in causal order
G forms a tight description of the causal ordering among messages
If a processor fails, the storage requirement of the algorithm grows without bound
The trans algorithm needs to be composed with a distributed algorithm that maintains consistency of views to avoid unbounded storage requirement
Atomic Multicast Group membership Goal: To maintain membership (delivery) list -- a replicated data
Properties of the membership list The value of Li is the same at all processors in Li
Processors install new versions of the membership list in exactly the same order
Atomic Multicast Group membership Determination of group view:
A group view is determined by the set of alive processors that are involved in the computation of that view
Notion of agreed view
Atomic Multicast Correctness criteria for a reliable group membership distributed algorithms: 1. There is an initial agreed group view in any execution 2. Processors change their local view based on information about failures and new processors
3. The agreed view is unique in any consistent state 4. If p and q are members of the agreed view that goes through a series of changes, p and q see the same sequence of changes
5. The algorithm responds to notifications that processors are faulty or operating
Atomic Multicast The Transis Algorithm
“Asynchronous agreement distributed algorithm” Properties: (make things simple !!)
Paranoid (one sacrifices accurate identification of failed processor) If any a non suspected processor p suspects a processor q of failing, processor q is declared faulty and all processors in the group will remove q from their local view, even if they can still communicate with q
Unidirectional Once a processor is removed from the membership list, that processor is never readmitted in the group, except as “new”
Atomic Multicast The paranoid and the unidirectional properties lead to monotone agreement The set of suspected processors monotonically increases at every non-suspected processor processor
eventually, the same set of suspected processors at every non-suspected processor (agreement !)
If everyone suspects everyone, the view collapses
Virtual synchrony Views are changed in a manner that ensures virtual synchrony every processor must must be able to agree on what the last message from every other processor was
Atomic Multicast Interactions with Trans Transis interacts with trans through C-DAG
e.g. Transis might : query C-DAG; modify C-DAG; block some messages require immediate sending of some messages
Fault detection Through consistent line in the C-DAG i.e. a global state in which all the members agree on the new group view
Atomic Multicast Achievement of a consistent line Message F(q) means “q is suspected faulty”
Algorithm for new group view computation
When processor p suspects that processor q is faulty, p multicasts F(q) using Trans
When message F(q) becomes causal at a r.DAG processor r processor r removes q from its local view; processor r multicasts F(q)
When all non-suspected processor have received F(q) from, the agreement on the new view is reached
Atomic Multicast Achievement of a consistent line Message F(q) means “q is suspected faulty”
A B C D E
F(E)
F(E)
F(E)
F(E)
F(q) are multicast using Trans;All the processors receive the same DAG, Consistent line can be computed from DAG
Lx
Lx+1
time
Atomic Multicast Messages partitioning by a consistent line regular message that precede the consistent line are in view Lx
regular messages that follow the consistent line are in Lx+1
A B C D E
F(E)
F(E)
F(E)
F(E)
Lx
Lx+1
timem1
m2
Regular message = message different from F(q)
Atomic Multicast Messages delivery with respect to a consistent line (virtual synchrony)
regular messages that follow the consistent line are in view Lx+1 and should be delivered in view Lx+1
A B C D E
F(E)
F(E)
F(E)
F(E)
Lx
Lx+1
timem1
m2
m1 should be delivered after installing Lx+1 at C
Atomic Multicast Messages delivery with respect to a consistent line (virtual synchrony)
regular message that precede the consistent line are in view Lx should be delivered before installing the new view Lx+1
A B C D E
F(E)
F(E)
F(E)
F(E)
Lx
Lx+1
timem1
m2
Regular message = message different from F(q)
Atomic Multicast Concurrent failures handling the first processor that learns F(q) and F(r) proposes F(q,r)
A B C D E
Lx
Lx+1
time
G
F(C)
m’
F(C)
F(C,G)
F(C,G)
F(C,G)
F(G)
F(G)
F(C,G)
New view : A, B, D, E
Atomic Multicast Handling messages from suspected processors
Assume q is suspected of being faulty At some processors, messages from q might causally precede or follow F(q)
To ensure virtual synchrony, All message from q that precede the first F(q) must be delivered before installing the new view
No message from q that causally follow any F(q) can be delivered (they are discarded)
Messages from q that are concurrent to all F(q) are discarded
Atomic Multicast Handling messages from a suspected processor illustrated
A B C D E
F(E)
F(E)
F(E)
F(E)
Follows:discarded
Concurrent:discarded
Precede: deliver it
Atomic Multicast Preventing regular messages from straddling view change
When a processor p receives F(q), processor p multicasts F(q) before it multicasts any regular message i.e. F(q) receives “high priority”
The causal layer guarantees that any message that p sends after F(q) causally follows F(q)
Hence the following situation cannot occur
F(q)m
F(q)
Atomic Multicast Preventing empty views
Suspected processors are removed forever this might lead to empty views
When a non-faulty processor learns that it has been removed from the view, it fails then it rejoins as a new processor (incarnation number) are used for that
algorithm for adding new processor is needed
Atomic Multicast Adding new processor Algorithm similar to the algorithm for fault detection
Progressive (to account for concurrent join propositions) construction of the join set
When all processors in the current view multicast the same join set, a consistent line is achieved
When a consistent line is achieved, joining processors must beproperly initialized to start participating in the multicast (Trans)
Atomic Multicast Update propagation Deals with relaxed consistency constraints on replicated data
The main requirement: updates on data items must be propagated among all the replicas in a timely manner.
Example: Routing tables
Useful for large networks
Algorithms often based on Gossip processor contact each other to make themselves up to date by exchanging “news”
Atomic Multicast Gossip-type Algorithms
Based on the meaning of “update”
1) update = “overwrite the old value of an object”
2) update = “modify the old value of an object” (e.g. increment a counter, change an entry of an array)
Atomic Multicast Epidemic Algorithms
Method for gossiping updates to replicated data ?
Assume: update = “overwrite the old value”;
d : data item replicated on M servers;
the computation of an update of d (data item) assigns d a version number (timestamp)
When “gossiping”: if a newer version is proposed, take it
Difficulty : how to spread new updates without too many messages
Atomic Multicast Epidemic Algorithms
The simplest method for distributing the news of a new update Direct mail : when a server performs an update, it informs all other servers directly
M-1 messages are sent Properties: simple; all reached in fault-free system
Problems: sender might fail some destinations will not hear of the update large communication burden on the sender not suitable for dynamic topologies
Atomic Multicast Epidemics: Idea: when a server performs an update, its direct neighbors;
A neighbor informs its neighbors and the propagation continues
- important number of messages;
no guarantee that the update will reach all sites
Atomic Multicast Randomized Epidemics: Idea: when a server performs an update, it informs some direct neighbors
Details: Definitions: Let u(d) be be an update of data item d;
1. A susceptible server for u(d) is one which has never heard of u(d);
2. An infectious server for u(d) is one which has heard of u(u) and is actively propagating u(d) ;
3. A removed server for u(d) is that knows of u(d) but is no longer actively propagating u(d)
Atomic Multicast Randomized Epidemics: The algorithm: Let k be a parameter;
1. When a susceptible server learns u(d), it becomes infectious for u(d)
2. An infectious server repeatedly contact a random server and informs this random server about u(d)
3. If an infectious server p for u(d) contacts an infectious server or a removed server p’ for u(d) then with probability 1/k, p becomes removed
Atomic Multicast Analysis of the randomized Epidemics: In : fraction of infectious servers s : fraction of susceptible servers
Assume: on every time unit (dt), every infectious server contacts another server
Note ds : variation of s during dt; dIn : variation of In during dt;then on average, s.In servers become infectious during dt (1-s).In /k become removed during dt;
Atomic Multicast Analysis of the randomized Epidemics:
Thus ds = -sIn (1) dIn = sIn - (1-s)In /k (2)
(2) /(1) gives: dIn / ds = 1/ks -(k+1) /k : differential equation
solution: In(s) = [(k+1)/k].(1-s) + log(s)/k
solve In(s)=0 to determine the fraction of sites that do not hear about the update when the epidemic is terminated (In = 0)
Atomic Multicast Analysis of the randomized Epidemics:Trivial solution s=1;
Solution : s0(k) = exp(-(k+1)(1-s)) exp(-(k+1)) when s << 1
An infectious site becomes removed at its ith contact with probability (1-1/k)(i-1).(1/k)
Thus the expected number of messages is mtot = M (1-s) [1 + sum{ (i-1/k)(i-1).1/k, 1<= i <= }] M(1-s)
Atomic Multicast Analysis of the randomized Epidemics:Let consider the variation of the parameter from k to k+1:
then [mtot(k) -mtot(k+1)]/(k+1 -k) M[(1-s0(k)) - (1-s0(k+1))] M.s0(k+1) - M.s0(k) M.(1- 1/exp)(exp(-(k+1))) i.e. the number of message required to infect new sites decreases exponentially with k
Conclusion: epidemic algorithm are good for initial distribution (high probability to contact susceptible servers), but some other mechanism is needed to infect the last few sites
Atomic Multicast Anti-entropy:Idea: one site contact another to exchange recent updatesa processor p initiates a contact by executing Gossip(): pick a random processor, s; exchange(s)
The contacted processor sends the list of its timestamps for the data item d
Methods for accomplishing exchange(s) Pull: p pulls the most recent updates from s (take); Push: p pushes the most recent updates to s (give) Pull-push: p takes most recent updates from s and p gives its most recent updates to s
Atomic Multicast Antientropy:
If most processors are infectious, Pull is better than Push
Let pi be the probability that after i contacts, a random processor is still uninfected
Pull: pi+1 = pi2
(remains uninfected if contacting an uninfected processor)
Push: pi+1 pi /e (from s0(k) = exp(-(k+1)))
If pi 1 then Pull is slowIf pi << 1 then Pull is much better than Push
Atomic Multicast Update logs Here we assume that update =“update the old value”
Value of an object is considered to consist of: an initial value the history of updates applied to the object For consistency : the same history of updates must be applied to all copies
A new update is then function of the whole history of updates
Atomic Multicast Update logs Each processor keeps a log: an ordered listing containing all the updates that the processor has processed
Processors distribute their logs to each other
Causal log propagation: events are added to the log in a order that is consistent with causality
Notations: L: log; L[i] : the first i element of L e an element of L; index(e) position of e in L L[e] shorthand for L[index(e)]
Atomic Multicast Update logs Consistency of logs: Let e be an event that is first executed at processor p. Then for every processor j =1,..,M and every event f, f is in p.L[index(e)] if and only if f is j.L[index(e)]
When a processor p propagates its log, processor p propagates all events in its log : event propagation is transitive
Log propagation transmit context of events: context of an event is described by a vector timestamp
Contexts are merged upon reception (vector timestamp technique)
Atomic Multicast Causal log propagation
12
6
When event 6occurs at p1,the first 5 events from p1 2 events from p2, 3 event from p3 3 events from p4are in p1’s log
p1 p2 p3 p4