Post on 04-Jan-2016
November 2005 Distributed systems: distributed algorithms
1
Distributed Systems:
Distributed algorithms
November 2005 Distributed systems: distributed algorithms
2
Overview of chapters
• Introduction• Co-ordination models and languages• General services• Distributed algorithms
– Ch 10 Time and global states, 11.4-11.5– Ch 11 Coordination and agreement, 12.1-12.5
• Shared data
• Building distributed services
November 2005 Distributed systems: distributed algorithms
3
This chapter: overview• Introduction• Logical clocks• Global states• Failure detectors• Mutual exclusion• Elections• Multicast communication• Consensus and related problems
November 2005 Distributed systems: distributed algorithms
4
Logical clocks• Problem: ordering of events
– requirement for many algorithms– physical clocks cannot be used
• use causality:– within a single process: observation– between different processes: sending of a
message happens before receiving the same message
November 2005 Distributed systems: distributed algorithms
5
Logical clocks (cont.)
• Formalization: happens before relation
• Rules: – if x happens before y in any process p
then x y – for any message m: send (m) receive (m)– if x y and y z
then x z
• Implementation: logical clocks
x y
November 2005 Distributed systems: distributed algorithms
6
Logical clocks (cont.)
• Logical clock– counter appropriately incremented– one counter per process
• Physical clock– counts oscillations occurring in a crystal at a
definitive frequency
November 2005 Distributed systems: distributed algorithms
7
Logical clocks (cont.)
• Rules for incrementing local logical clock1 for each event (including send) in process p:
Cp := Cp + 1
2 when a process sends a message m, it piggybacks on m the value of Cp
3 on receiving (m, t), a process q
• computes Cq := max (Cq, t)
• applies rule 1: Cq := Cq +1
Cq is logical time for event receive(m)
November 2005 Distributed systems: distributed algorithms
8
•1
a
Logical clocks (cont.)
• Logical timestamps: example
P1
P2
P3 •1
b
•2
c
•3
d•4
e
•5
f
•3
g
•0
November 2005 Distributed systems: distributed algorithms
9
Logical clocks (cont.)
• C(x) logical clock value for event x
• Correct usage: if x y then C(x) < C(y)
• Incorrect usage: if C(x) < C(y) then x y
• Solution: Logical vector clocks
November 2005 Distributed systems: distributed algorithms
10
Logical clocks (cont.)
• Vector clocks for N processes:– at process Pi: Vi[j] for j = 1, 2,…,N
– Properties:
if x y then V(x) < V(y)
if V(x) < V(y) then x y
November 2005 Distributed systems: distributed algorithms
11
Logical clocks (cont.)
• Rules for incrementing logical vector clock1 for each event (including send) in process Pi:
Vi[i] := Vi[i] + 1
2 when a process Pi sends a message m, it piggybacks on m the value of Vi
3 on receiving (m, t), a process Pi
• apply rule 1
• Vi[j] := max(Vi[j] , t[j]) for j = 1, 2,…, N
November 2005 Distributed systems: distributed algorithms
12
Logical clocks (cont.)
• Logical vector clocks : example
a b
c d
e f
m1
m2
(2,0,0)(1,0,0)
(2,1,0) (2,2,0)
(2,2,2)(0,0,1)
p1
p2
p3
Physical time
November 2005 Distributed systems: distributed algorithms
13
This chapter: overview• Introduction• Logical clocks• Global states• Failure detectors• Mutual exclusion• Elections• Multicast communication• Consensus and related problems
November 2005 Distributed systems: distributed algorithms
14
Global states• Detect global properties
p2p1
message
garbage object
objectreference
a. Garbage collection
p2p1 wait-for
wait-forb. Deadlock
p2p1
activatepassive passivec. Termination
November 2005 Distributed systems: distributed algorithms
15
Global states (cont.)
• Local states & events– Process Pi : ei
k events
sik state, before event k
– History of Pi :
hi = < ei0, ei
1, ei2,…>
– Finite prefix of history of Pi :
hik = < ei
0, ei1, ei
2,…, eik >
November 2005 Distributed systems: distributed algorithms
16
Global states (cont.)
• Global states & events– Global history
H = h1 h2 h3 … hn
– Global state (when?)
S = ( s1p, s2
q, …, snu)
consistent?– Cut of the systems execution
C = h1c1 h1
c2 … h1cn
November 2005 Distributed systems: distributed algorithms
17
Global states (cont.)
• Example of cuts:
m1 m2
p1
p2Physical
time
e10
Consistent cutInconsistent cut
e11
e12
e13
e20
e 21
e 22
November 2005 Distributed systems: distributed algorithms
18
Global states (cont.)
• Finite prefix of history of Pi :
hik = < ei
0, ei1, ei
2,…, eik >
• Cut of the systems executionC = h1
c1 h1c2 … h1
cn
• Consistent cut C e C, f e f C
• Consistent global statecorresponds to consistent cut
November 2005 Distributed systems: distributed algorithms
19
Global states (cont.)
• Model execution of a (distributed) system
S0 S1 S2 S3 …
– Series of transitions between consistent states
– Each transition corresponds to one single event• Internal event
• Sending message
• Receiving message
– Simultaneous events order events
November 2005 Distributed systems: distributed algorithms
20
Global states (cont.)
• Definitions:– Run = ordering of all events (in a global history)
consistent with each local history’s ordering
– Linearization =consistent run +consistent with
– S’ reachable from S linearization: … S … S’ …
November 2005 Distributed systems: distributed algorithms
21
Global states (cont.)
• Kinds of global state predicates:– Stable
– Safety
– Liveness
= true in SS’, S … S’ = true in S’
= undesirable propertyS0 = initial state of systemS, S0 … S = false in S
= desirable propertyS0 = initial state of systemS, S0 … S = true in S
November 2005 Distributed systems: distributed algorithms
22
Global states (cont.)
• Snapshot algorithm of Chandy & Lamport– Record consistent global state– Assumptions:
• Neither channels nor processes fail• Channels are unidirectional and provide FIFO-
ordered message delivery• Graph of channels and processes is strongly
connected• Any process may initiate a global snapshot• Process may continue their execution during the
snapshot
November 2005 Distributed systems: distributed algorithms
23
Global states (cont.)
• Snapshot algorithm of Chandy & Lamport– Elements of algorithm
• Players: processes Pi with– Incoming channels– Outgoing channels
• Marker messages• 2 rules
– Marker receiving rule– Marker sending rule
– Start of algorithm• A process acts as it received a marker message
November 2005 Distributed systems: distributed algorithms
24
Global states (cont.)
Marker receiving rule for process pi
On pi’s receipt of a marker message over channel c:if (pi has not yet recorded its state) it
records its process state now;records the state of c as the empty set;turns on recording of messages arriving over other incoming channels;
else pi records the state of c as the set of messages it has received over c since it saved its state.
end if
Marker sending rule for process pi
After pi has recorded its state, for each outgoing channel c: pi sends one marker message over c (before it sends any other message over c).
November 2005 Distributed systems: distributed algorithms
25
Global states (cont.)
• Example:
p1 p2c2
c1
account widgets
$1000 (none)
account widgets
$50 2000
November 2005 Distributed systems: distributed algorithms
26
Global states (cont.)
p1 p2(empty)<$1000, 0> <$50, 2000>
(empty)
c2
c1
1. Global state S 0
2. Global state S 1 p1 p2(Order 10, $100), M<$900, 0> <$50, 2000>
(empty)
c2
c1
3. Global state S 2 p1 p2(Order 10, $100), M<$900, 0> <$50, 1995>
(five widgets)
c2
c1
(M = marker message)
4. Global state S 3 p1 p2(Order 10, $100)<$900, 5> <$50, 1995>
(empty)
c2
c1 C2 = <>C1=<(five widgets)>
November 2005 Distributed systems: distributed algorithms
27
Global states (cont.)
p1 p2(empty)<$1000, 0> <$50, 2000>
(empty)
c2
c1
1. Global state S 0
(M = marker message)
4. Global state S 3 p1 p2(Order 10, $100)<$900, 5> <$50, 1995>
(empty)
c2
c1 C2 = <>C1=<(five widgets)>
5. Global state S 4 p1 p2(Order 10, $100)<$900, 5> <$50, 1995>
M
c2
c1 C2 = <>C1=<(five widgets)>
6. Global state S 5 p1 p2(Order 10, $100)<$900, 5> <$50, 1995>
(empty)
c2
c1 C2 = <>C1=<(five widgets)>
November 2005 Distributed systems: distributed algorithms
28
Global states (cont.)
• Observed state– Corresponds to consistent cut– Reachable!
Sinit Sfinal
Ssnap
actual execution e 0,e1,...
recording recording begins ends
pre-snap: e '0,e'1,...e'R-1 post-snap: e'R,e'R+1,...
November 2005 Distributed systems: distributed algorithms
29
This chapter: overview• Introduction• Logical clocks• Global states• Failure detectors• Mutual exclusion• Elections• Multicast communication• Consensus and related problems
November 2005 Distributed systems: distributed algorithms
30
Failure detectors• Properties
– Unreliable failure detector: answers with
• Suspected• Unsuspected
– Reliable failure detector: answers with
• Failed• Unsuspected
• Implementation– Every T sec: multicast by P of “P is here”– Maximum on message transmission time:
• Asynchronous system: estimate E• Synchronous system: absolute bound A
No “P is here” within T + E sec
No “P is here” within T + A sec
November 2005 Distributed systems: distributed algorithms
31
This chapter: overview• Introduction• Logical clocks• Global states• Failure detectors• Mutual exclusion• Elections• Multicast communication• Consensus and related problems
November 2005 Distributed systems: distributed algorithms
32
Mutual exclusion• Problem: how to give a single process
temporarily a privilege?– Privilege = the right to access a (shared)
resource– resource = file, device, window,…
• Assumptions– clients execute the mutual exclusion algorithm– the resource itself might be managed by a
server– Reliable communication
November 2005 Distributed systems: distributed algorithms
33
Mutual exclusion (cont.)
• Basic requirements:– ME1: at most one process might execute
in the shared resource at any time
(Safety)– ME2: a process requesting access to the
shared resource is eventually granted it (Liveness)
– ME3: Access to the shared resource should be granted in happened-before order
(Ordering or fairness)
November 2005 Distributed systems: distributed algorithms
34
Mutual exclusion (cont.)
• Solutions:– central server algorithm
– distributed algorithm using logical clocks
– ring-based algorithm
– voting algorithm
• Evaluation– Bandwidth (= #messages to enter and exit)
– Client delay (incurred by a process at enter and exit)
– Synchronization delay (delay between exit and enter)
November 2005 Distributed systems: distributed algorithms
35
Mutual exclusion (cont.)
central server algorithm• Central server offering 2 operations:
– enter()• if resource free
then operation returns without delayelse request is queued and return from operation is delayed
– exit()• if request queue is empty
then resource is marked freeelse return for a selected request is executed
November 2005 Distributed systems: distributed algorithms
36
Mutual exclusion (cont.)
central server algorithm• Example:
Server
Queue:
P1
P2P3
P4
User
Enter()
November 2005 Distributed systems: distributed algorithms
37
Mutual exclusion (cont.)
central server algorithm• Example:
Server
Queue:
P1
P2P3
P4
User3
Enter()
November 2005 Distributed systems: distributed algorithms
38
Mutual exclusion (cont.)
central server algorithm• Example:
Server
Queue:
P1
P2P3
P4
User3
November 2005 Distributed systems: distributed algorithms
39
Mutual exclusion (cont.)
central server algorithm• Example:
Server
Queue:
P1
P2P3
P4
User3
enter()
November 2005 Distributed systems: distributed algorithms
40
Mutual exclusion (cont.)
central server algorithm• Example:
Server
Queue:
4
P1
P2P3
P4
User3
enter()
November 2005 Distributed systems: distributed algorithms
41
Mutual exclusion (cont.)
central server algorithm• Example:
Server
Queue:
4
P1
P2P3
P4
User3
enter()
enter()
November 2005 Distributed systems: distributed algorithms
42
Mutual exclusion (cont.)
central server algorithm• Example:
Server
Queue:
4, 2
P1
P2P3
P4
User3
enter()
enter()
November 2005 Distributed systems: distributed algorithms
43
Mutual exclusion (cont.)
central server algorithm• Example:
Server
Queue:
4, 2
P1
P2P3
P4
User3
enter()
enter() exit()
November 2005 Distributed systems: distributed algorithms
44
Mutual exclusion (cont.)
central server algorithm• Example:
Server
Queue:
4, 2
P1
P2P3
P4
User
enter()
enter()
November 2005 Distributed systems: distributed algorithms
45
Mutual exclusion (cont.)
central server algorithm• Example:
Server
Queue:
2
P1
P2P3
P4
User4
enter()
November 2005 Distributed systems: distributed algorithms
46
Mutual exclusion (cont.)
central server algorithm• Evaluation:
– ME3 not satisfied!
– Performance:• single server is performance bottleneck
• Enter critical section: 2 messages
• Synchronization: 2 messages between exit of one process and enter of next
– Failure: • Central server is single point of failure
• what if a client, holding the resource, fails?
• Reliable communication required
November 2005 Distributed systems: distributed algorithms
47
Mutual exclusion (cont.)
ring-based algorithm • All processes arranged in a
– unidirectional
– logical
ring
• token passed in ring
• process with token has access to resource
November 2005 Distributed systems: distributed algorithms
48
Mutual exclusion (cont.)
ring-based algorithm
P1
P4
P3
P2
P5
P6
November 2005 Distributed systems: distributed algorithms
49
Mutual exclusion (cont.)
ring-based algorithm
P1
P4
P3
P2
P5
P6
P2 can use resource
November 2005 Distributed systems: distributed algorithms
50
Mutual exclusion (cont.)
ring-based algorithm
P1
P4
P3
P2
P5
P6
P2 stopped using resource and forwarded token
November 2005 Distributed systems: distributed algorithms
51
Mutual exclusion (cont.)
ring-based algorithm
P1
P4
P3
P2
P5
P6
P3 doesn’t need resource and forwards token
November 2005 Distributed systems: distributed algorithms
52
Mutual exclusion (cont.)
ring-based algorithm
P1
P4
P3
P2
P5
P6
November 2005 Distributed systems: distributed algorithms
53
Mutual exclusion (cont.)
ring-based algorithm • Evaluation:
– ME3 not satisfied
– efficiency
• high when high usage of resource
• high overhead when very low usage
– failure
• Process failure: loss of ring!
• Reliable communication required
November 2005 Distributed systems: distributed algorithms
54
Mutual exclusion (cont.)
distributed algorithm using logical clocks• Distributed agreement algorithm
– multicast requests to all participating processes– use resource when all other participants agree
(= reply received)• Processes
– keep logical clock; included in all request messages
– behave as finite state machine:• released• wanted• held
November 2005 Distributed systems: distributed algorithms
55
Mutual exclusion (cont.)
distributed algorithm using logical clocks• Ricart and Agrawala’s algorithm: process Pj
– on initialization:• state := released;
– to obtain resource:• state := wanted;
• T = logical clock value for next event;
• multicast request to other processes <T, Pj>;
• wait for n-1 replies;
• state := held;
November 2005 Distributed systems: distributed algorithms
56
Mutual exclusion (cont.)
distributed algorithm using logical clocks• Ricart and Agrawala’s algorithm: process Pj
– on receipt of request <Ti, Pi> :
• if (state = held) or (state = wanted and (T,Pj) < (Ti,Pi) )
then queue request from Pi
else reply immediately to Pi
– to release resource:• state := released;
• reply to any queued requests;
November 2005 Distributed systems: distributed algorithms
57
Mutual exclusion (cont.)
distributed algorithm using logical clocks• Ricart and Agrawala’s algorithm: example
– 3 processes
– P1 and P2 will request it concurrently
– P3 not interested in using resource
November 2005 Distributed systems: distributed algorithms
58
Mutual exclusion (cont.)
distributed algorithm using logical clocks• Ricart and Agrawala’s algorithm: example
P1
releasedQueue:
P2
releasedQueue:
P3
releasedQueue:
November 2005 Distributed systems: distributed algorithms
59
Mutual exclusion (cont.)
distributed algorithm using logical clocks
P1
wantedQueue:
P2
releasedQueue:
P3
releasedQueue:
<41,P1>
<41,P1>0
November 2005 Distributed systems: distributed algorithms
60
Mutual exclusion (cont.)
distributed algorithm using logical clocks
P1
wantedQueue:
P2
wantedQueue:
P3
releasedQueue:
<41,P1>
<41,P1>
<34,P2>
<34,P2>
0
0
November 2005 Distributed systems: distributed algorithms
61
Mutual exclusion (cont.)
distributed algorithm using logical clocks
P1
wantedQueue:
P2
wantedQueue:
P3
releasedQueue:
<41,P1>
<41,P1>
<34,P2>
<34,P2>
<43,P3>
0
0
November 2005 Distributed systems: distributed algorithms
62
Mutual exclusion (cont.)
distributed algorithm using logical clocks
P1
wantedQueue:
P2
wantedQueue:
P3
releasedQueue:
<41,P1>
<41,P1>
<34,P2>
<34,P2>
<43,P3>
<45,P3>
0
0
November 2005 Distributed systems: distributed algorithms
63
Mutual exclusion (cont.)
distributed algorithm using logical clocks
P1
wantedQueue:
P2
wantedQueue:
P3
releasedQueue:
<41,P1>
<41,P1>
<34,P2>
<34,P2>
<43,P3>
<45,P3>
1
1
November 2005 Distributed systems: distributed algorithms
64
Mutual exclusion (cont.)
distributed algorithm using logical clocks
P1
wantedQueue:
P2
wantedQueue:P1
P3
releasedQueue:
<41,P1>
<41,P1>
<34,P2>
<34,P2>
<43,P3>
<45,P3>
1
1
November 2005 Distributed systems: distributed algorithms
65
Mutual exclusion (cont.)
distributed algorithm using logical clocks
P1
wantedQueue:
P2
wantedQueue:P1
P3
releasedQueue:
<41,P1>
<41,P1>
<34,P2>
<34,P2>
<43,P3>
<45,P3>
1
1
November 2005 Distributed systems: distributed algorithms
66
Mutual exclusion (cont.)
distributed algorithm using logical clocks
P1
wantedQueue:
P2
wantedQueue:P1
P3
releasedQueue:
<41,P1>
<41,P1>
<34,P2>
<34,P2>
<43,P3>
<45,P3>
1
1
November 2005 Distributed systems: distributed algorithms
67
Mutual exclusion (cont.)
distributed algorithm using logical clocks
P1
wantedQueue:
P2
heldQueue:P1
P3
releasedQueue:
<41,P1>
<41,P1>
<34,P2>
<34,P2>
<43,P3>
<45,P3>
1
2
November 2005 Distributed systems: distributed algorithms
68
Mutual exclusion (cont.)
distributed algorithm using logical clocks
P1
wantedQueue:
P2
heldQueue:P1
P3
releasedQueue:
1
2
November 2005 Distributed systems: distributed algorithms
69
Mutual exclusion (cont.)
distributed algorithm using logical clocks
P1
wantedQueue:
P2
releasedQueue:
P3
releasedQueue:
1
0
November 2005 Distributed systems: distributed algorithms
70
Mutual exclusion (cont.)
distributed algorithm using logical clocks• Evaluation
– Performance:
• expensive algorithm:
2 * ( n - 1 ) messages to get resource
• Client delay: round trip
• Synchronization delay: 1 message to pass section to another process
• does not solve the performance bottleneck
– Failure
• each process must know all other processes
• Reliable communication required
November 2005 Distributed systems: distributed algorithms
71
Mutual exclusion (cont.)
voting algorithm• Approach of Maekawa:
– Communication with subset of partners should suffice– Candidate collects sufficient votes
• Voting set: Vi = voting set for pi
i, j: Vi Vj – | Vi | = K
– Pj contained in M voting sets– Optimal solution
• K ~ N• M = K
November 2005 Distributed systems: distributed algorithms
72
Mutual exclusion (cont.)
voting algorithmOn initialization
state := RELEASED;voted := FALSE;
For pi to enter the critical sectionstate := WANTED;Multicast request to all processes in Vi – {pi};Wait until (number of replies received = (K – 1));state := HELD;
On receipt of a request from pi at pj (i ≠ j)if (state = HELD or voted = TRUE)then
queue request from pi without replying; else
send reply to pi;voted := TRUE;
end if
Maekawa’s algorithm
November 2005 Distributed systems: distributed algorithms
73
Mutual exclusion (cont.)
voting algorithmMaekawa’s algorithm (cont.)For pi to exit the critical section
state := RELEASED;Multicast release to all processes in Vi – {pi};
On receipt of a release from pi at pj (i ≠ j)if (queue of requests is non-empty)then
remove head of queue – from pk, say; send reply to pk;voted := TRUE;
else voted := FALSE;
end if
November 2005 Distributed systems: distributed algorithms
74
Mutual exclusion (cont.)
voting algorithm• Evaluation
– Properties• ME1: OK
• ME2: NOK, deadlock possiblesolution: process requests in order
• ME3: Ok– Performance
• Bandwidth: on enter 2 N messages + on exit N messages
• Client delay: round trip
• Synchronization delay: round trip
– Failure:• Crash of process on another voting set can be tolerated
• Reliable communication required
November 2005 Distributed systems: distributed algorithms
75
Mutual exclusion (cont.)
• Discussion
– algorithms are expensive and not practical
– algorithms are extremely complex in the
presence of failures
– better solution in most cases:
• let the server, managing the resource, perform
concurrency control
• gives more transparency for the clients
November 2005 Distributed systems: distributed algorithms
76
This chapter: overview• Introduction• Logical clocks• Global states• Failure detectors• Mutual exclusion• Elections• Multicast communication• Consensus and related problems
November 2005 Distributed systems: distributed algorithms
77
Elections• Problem statement:
– select a process from a group of processes– several processes may start election
concurrently
• Main requirement:– unique choice– Select process with highest id
• Process id
• <1/load, process id>
November 2005 Distributed systems: distributed algorithms
78
Elections (cont.)
• Basic requirements:– E1: participant pi set electedi = or
electedi = P; P process with highest id (Safety)
– E2: all processes pi participate and setelectedi or crash (Liveness)
not yet defined
November 2005 Distributed systems: distributed algorithms
79
Elections (cont.)
• Solutions:– Bully election algorithm
– Ring based election algorithm
• Evaluation
– Bandwidth ( ~ total number of messages)
– Turnaround time (the number of serialized message
transmission times between initiation and termination
of a single run)
November 2005 Distributed systems: distributed algorithms
80
Elections (cont.)
Bully election• Assumptions:
– each process has identifier– processes can fail during an election– communication is reliable
• Goal: – surviving member with the largest identifier is
elected as coordinator
November 2005 Distributed systems: distributed algorithms
81
Elections (cont.)
Bully election
• Roles for processes:
– coordinator
• elected process
• has highest identifier, at the time of election
– initiator
• process starting the election for some reason
November 2005 Distributed systems: distributed algorithms
82
Elections (cont.)
Bully election• Three types of messages:
– election message• sent by an initiator of the election
to all other processes with a higher identifier
– answer message• a reply message sent by the receiver of an election
message
– coordinator message• sent by the process becoming the coordinator
to all other processes with lower identifiers
November 2005 Distributed systems: distributed algorithms
83
Elections (cont.)
Bully election• Algorithm:
– send election message:• process doing it is called initiator
• any process may do it at any time
• when a failed process is restarted, it starts an election, even though the current coordinator is functioning (bully)
– a process receiving an election message• replies with an answer message
• will start an election itself (why?)
November 2005 Distributed systems: distributed algorithms
84
Elections (cont.)
Bully election
• Algorithm:
– actions of an initiator
• when not receiving an answer message within a
certain time (2Ttrans +Tprocess) becomes coordinator
• when having received an answer message
( a process with a higher identifier is active)
and not receiving a coordinator message (after x
time units)
will restart elections
November 2005 Distributed systems: distributed algorithms
85
Elections (cont.)
Bully election
• Example: election of P2 after failure of P3 and P4
P1 P3
P2 P4
P1 initiator
election
November 2005 Distributed systems: distributed algorithms
86
Elections (cont.)
Bully election
• Example: election of P2 after failure of P3 and P4
P1 P3
P2 P4
P2 and P3 reply and start election
election
answer
November 2005 Distributed systems: distributed algorithms
87
Elections (cont.)
Bully election
• Example: election of P2 after failure of P3 and P4
P1 P3
P2 P4
Election messages of P2 arrive
election
answer
November 2005 Distributed systems: distributed algorithms
88
Elections (cont.)
Bully election
• Example: election of P2 after failure of P3 and P4
P1 P3
P2 P4
P3 replies
election
answer
November 2005 Distributed systems: distributed algorithms
89
Elections (cont.)
Bully election
• Example: election of P2 after failure of P3 and P4
P1 P3
P2 P4
P3 fails
election
answer
November 2005 Distributed systems: distributed algorithms
90
Elections (cont.)
Bully election
• Example: election of P2 after failure of P3 and P4
P1 P3
P2 P4
Timeout at P1 : election starts again
election
answer
November 2005 Distributed systems: distributed algorithms
91
Elections (cont.)
Bully election
• Example: election of P2 after failure of P3 and P4
P1 P3
P2 P4
Timeout at P1 : election starts again
election
answer
November 2005 Distributed systems: distributed algorithms
92
Elections (cont.)
Bully election
• Example: election of P2 after failure of P3 and P4
P1 P3
P2 P4
P2 replies and starts election
election
answer
November 2005 Distributed systems: distributed algorithms
93
Elections (cont.)
Bully election
• Example: election of P2 after failure of P3 and P4
P1 P3
P2 P4
P2 receives no replies coordinator
election
answer
November 2005 Distributed systems: distributed algorithms
94
Elections (cont.)
Bully election• Evaluation
– Correctness: E1 & E2 ok, if• Reliable communication• No process replaces crashed process
– Correctness: no guarantee for E1, if• Crashed process is replaced by process with same id• Assumed timeout values are inaccurate (= unreliable failure
detector)
– Performance• Worst case: O(n2)• Optimal: bandwidth: n-2 messages
turnaround: 1 message
November 2005 Distributed systems: distributed algorithms
95
Elections (cont.) Ring based election
• Assumptions:
– processes arranged in a logical ring
– each process has an identifier: i for Pi
– processes remain functional and reachable during the
algorithm
November 2005 Distributed systems: distributed algorithms
96
Elections (cont.) Ring based election
• Messages:– forwarded over logical ring– 2 types:
• election: used during electioncontains identifier
• elected: used to announce new coordinator
• Process States:– participant– non-participant
November 2005 Distributed systems: distributed algorithms
97
Elections (cont.) Ring based election
• Algorithm
– process initiating an election
• becomes participant
• sends election message to its neighbour
November 2005 Distributed systems: distributed algorithms
98
Elections (cont.) Ring based election
P1
P21P8
P11
P14
P5
initiator
11
November 2005 Distributed systems: distributed algorithms
99
Elections (cont.) Ring based election
• Algorithm– upon receiving an election message, a process
compares identifiers:• Received: identifier in message
• own: identifier of process
– 3 cases:• Received > own
• Received < own
• Received = own
November 2005 Distributed systems: distributed algorithms
100
Elections (cont.) Ring based election
• Algorithm– receive election message
• Received > own– message forwarded
– process becomes participant
November 2005 Distributed systems: distributed algorithms
101
Elections (cont.) Ring based election
P1
P21P8
P11
P14
P5
initiator
11
November 2005 Distributed systems: distributed algorithms
102
Elections (cont.) Ring based election
P1
P21P8
P11
P14
P5
initiator
11
November 2005 Distributed systems: distributed algorithms
103
Elections (cont.) Ring based election
• Algorithm– receive election message
• Received > own– message forwarded
– process becomes participant
• Received < own and process is non-participant– substitutes own identifier in message
– message forwarded
– process becomes participant
November 2005 Distributed systems: distributed algorithms
104
Elections (cont.) Ring based election
P1
P21P8
P11
P14
P5
initiator
11
November 2005 Distributed systems: distributed algorithms
105
Elections (cont.) Ring based election
P1
P21P8
P11
P14
P5
initiator
21
November 2005 Distributed systems: distributed algorithms
106
Elections (cont.) Ring based election
P1
P21P8
P11
P14
P5
initiator
21
November 2005 Distributed systems: distributed algorithms
107
Elections (cont.) Ring based election
P1
P21P8
P11
P14
P5
initiator
21
November 2005 Distributed systems: distributed algorithms
108
Elections (cont.) Ring based election
• Algorithm: – receive election message
• Received > own
– ...
• Received < own and process is non-participant– ...
• Received = own– identifier must be greatest
– process becomes coordinator
– new state: non-participant
– sends elected message to neighbour
November 2005 Distributed systems: distributed algorithms
109
Elections (cont.) Ring based election
P1
P21P8
P11
P14
P5 coordinator
21
November 2005 Distributed systems: distributed algorithms
110
Elections (cont.) Ring based election
• Algorithm receive election message• Received > own
– message forwarded– process becomes participant
• Received < own and process is non-participant– substitutes own identifier in message– message forwarded– process becomes participant
• Received = own– identifier must be greatest– process becomes coordinator– new state: non-participant– sends elected message to neighbour
November 2005 Distributed systems: distributed algorithms
111
Elections (cont.) Ring based election
• Algorithm – receive elected message
• participant:– new state: non-participant
– forwards message
• coordinator:– election process completed
November 2005 Distributed systems: distributed algorithms
112
Elections (cont.) Ring based election
P1
P21P8
P11
P14
P5 coordinator
21
November 2005 Distributed systems: distributed algorithms
113
Elections (cont.) Ring based election
P1
P21P8
P11
P14
P5 coordinator
21
November 2005 Distributed systems: distributed algorithms
114
Elections (cont.) Ring based election
P1
P21P8
P11
P14
P5 coordinator
21
November 2005 Distributed systems: distributed algorithms
115
Elections (cont.) Ring based election
P1
P21P8
P11
P14
P5 coordinator
21
November 2005 Distributed systems: distributed algorithms
116
Elections (cont.) Ring based election
P1
P21P8
P11
P14
P5 coordinator 21
November 2005 Distributed systems: distributed algorithms
117
Elections (cont.) Ring based election
P1
P21P8
P11
P14
P5 coordinator
21
November 2005 Distributed systems: distributed algorithms
118
Elections (cont.) Ring based election
P1
P21P8
P11
P14
P5 coordinator
November 2005 Distributed systems: distributed algorithms
119
Elections (cont.) Ring based election
• Evaluation– Why is condition
Received < own and process is non-participantnecessary? (see next slide for full algorithm)
– Number of messages:• worst case: 3 * n - 1• best case: 2 * n
– concurrent elections: messages are extinguished • as soon as possible• before winning result is announced
November 2005 Distributed systems: distributed algorithms
120
Elections (cont.) Ring based election
• Algorithm receive election message• Received > own
– message forwarded– process becomes participant
• Received < own and process is non-participant– substitutes own identifier in message– message forwarded– process becomes participant
• Received = own– identifier must be greatest– process becomes coordinator– new state: non-participant– sends elected message to neighbour
November 2005 Distributed systems: distributed algorithms
121
This chapter: overview• Introduction• Logical clocks• Global states• Failure detectors• Mutual exclusion• Elections• Multicast communication• Consensus and related problems
November 2005 Distributed systems: distributed algorithms
122
Multicast communication• Essential property:
– 1 multicast operation <> multiple sendsHigher efficiencyStronger delivery guarantees
• Operations: g = group, m = message
– X-multicast(g, m)– X-deliver(m)
• <> receive(m)
– X additional property Basic, Reliable, FifO,….
November 2005 Distributed systems: distributed algorithms
123
Multicast communication(cont.) IP multicast
• Datagram operations – with multicast IP address
• Failure model cfr UDP
– Omission failures– No ordering or reliability guarantees
November 2005 Distributed systems: distributed algorithms
124
Multicast communication(cont.) Basic multicast
• = IP multicast + delivery guarantee if multicasting process does not crash
• Straightforward algorithm: (with reliable send)
• Ex. practical algorithm using IP-multicast
To B-multicast(g, m):
p g: send(p, m)
On receive(m) at p:
B-deliver(m)
November 2005 Distributed systems: distributed algorithms
125
Multicast communication(cont.) Reliable multicast
• Properties:– Integrity (safety)
• A correct process delivers a message at most once
– Validity (liveness)
• Correct process p multicasts m p delivers m
– Agreement (liveness)
correct process p delivering m all correct processes will deliver m
– Uniform agreement (liveness)
process p (correct or failing) delivering m all correct processes will deliver m
November 2005 Distributed systems: distributed algorithms
126
Multicast communication(cont.) Reliable multicast
• 2 algorithms:
1. Using B-multicast
2. Using IP-multicast + piggy backed acks
November 2005 Distributed systems: distributed algorithms
127
Multicast communication(cont.) Reliable multicast
Algorithm 1 with B-multicast
November 2005 Distributed systems: distributed algorithms
128
Multicast communication(cont.) Reliable multicast
• Correct?
– Integrity
– Validity
– Agreement
• Efficient?
– NO: each message transmitted g times
Algorithm 1 with B-multicast
November 2005 Distributed systems: distributed algorithms
129
Multicast communication(cont.) Reliable multicast
Messageprocessing
Delivery queueHold-back
queue
deliver
Incomingmessages
When delivery guarantees aremet
Algorithm 2 with IP-multicast
November 2005 Distributed systems: distributed algorithms
130
Multicast communication(cont.) Reliable multicast
Algorithm 2 with IP-multicast
Data structures at process p:
Sgp : sequence number
Rgq : sequence number of the latest message it has delivered from q
On initialization:
Sgp = 0
For process p to R-multicast message m to group g
IP-multicast (g, <m, Sgp , <q, Rg
q > >)
Sgp ++
On IP-deliver (<m, S, <q, R>>) at q from p
November 2005 Distributed systems: distributed algorithms
131
Multicast communication(cont.) Reliable multicast
Algorithm 2 with IP-multicastOn IP-deliver (<m, S, <q, R>>) at q from p
if S = Rgp + 1
then R-deliver (m)
Rgp ++
check hold-back queue
else if S > Rgp + 1
then store m in hold-back queue
request missing messages endif
endif
if R > Rg then request missing messages endif
November 2005 Distributed systems: distributed algorithms
132
Multicast communication(cont.) Reliable multicast
• Correct?– Integrity: seq numbers + checksums
– Validity: if missing messages are detected
– Agreement: if copy of message remains available
Algorithm 2 with IP-multicast
November 2005 Distributed systems: distributed algorithms
133
Multicast communication(cont.) Reliable multicast
• 3 processes in group: P, Q, R
• State of process:– S: Next_sequence_number
– Rq: Already_delivered from Q
– Stored messages
• Presentation:
Algorithm 2 with IP-multicast: example
P: 2Q: 3 R: 5< >
November 2005 Distributed systems: distributed algorithms
134
Multicast communication(cont.) Reliable multicast
• Initial state:
Algorithm 2 with IP-multicast: example
P: 0Q: -1 R: -1< >
Q: 0P: -1 R: -1< >
R: 0P: -1 Q: -1< >
November 2005 Distributed systems: distributed algorithms
135
Multicast communication(cont.) Reliable multicast
• First multicast by P:
Algorithm 2 with IP-multicast: example
P: 1Q: -1 R: -1< mp0 >
Q: 0P: -1 R: -1< >
R: 0P: -1 Q: -1< >
P: mp0, 0, <Q:-1, R:-1>
November 2005 Distributed systems: distributed algorithms
136
Multicast communication(cont.) Reliable multicast
• Arrival multicast by P at Q:
Algorithm 2 with IP-multicast: example
P: 1Q: -1 R: -1< mp0 >
Q: 0P: 0 R: -1< mp0 >
R: 0P: -1 Q: -1< >
P: mp0, 0, <Q:-1, R:-1>
November 2005 Distributed systems: distributed algorithms
137
Multicast communication(cont.) Reliable multicast
• New state:
Algorithm 2 with IP-multicast: example
P: 1Q: -1 R: -1< mp0 >
Q: 0P: 0 R: -1< mp0 >
R: 0P: -1 Q: -1< >
November 2005 Distributed systems: distributed algorithms
138
Multicast communication(cont.) Reliable multicast
• Multicast by Q:
Algorithm 2 with IP-multicast: example
P: 1Q: -1 R: -1< mp0 >
Q: 1P: 0 R: -1< mp0 ,mq0 >
R: 0P: -1 Q: -1< >
Q: mq0, 0, <P:0, R:-1>
November 2005 Distributed systems: distributed algorithms
139
Multicast communication(cont.) Reliable multicast
• Arrival of multicast by Q:
Algorithm 2 with IP-multicast: example
P: 1Q: 0 R: -1< mp0 ,mq0 >
Q: 1P: 0 R: -1< mp0 , ,mq0 >
R: 0P: -1 Q: 0< mq0 >
Q: mq0, 0, <P:0, R:-1>
November 2005 Distributed systems: distributed algorithms
140
Multicast communication(cont.) Reliable multicast
• When to delete stored messages?
Algorithm 2 with IP-multicast: example
P: 1Q: 0 R: -1< mp0 ,mq0 >
Q: 1P: 0 R: -1< mp0 , ,mq0 >
R: 0P: -1 Q: 0< mq0 >
November 2005 Distributed systems: distributed algorithms
141
Multicast communication(cont.) Ordered multicast
• FIFO
• Causal
• Total
if a correct process P:multicast(g, m);multicast(g, m’);
then for all correct processes:deliver(m’) deliver(m) before deliver(m’)
if multicast(g, m) multicast(g, m’)then for all correct processes:
deliver(m’) deliver(m) before deliver(m’)
November 2005 Distributed systems: distributed algorithms
142
Multicast communication(cont.) Ordered multicast
• Total
• FIFO-Total = FIFO + Total
• Causal-Total = Causal + Total
• Atomic = reliable + Total
if p: deliver(m) deliver( m’)then for all correct processes:
deliver(m’) deliver(m) before deliver(m’)
November 2005 Distributed systems: distributed algorithms
143
Multicast communication(cont.) Ordered multicast
F3
F1
F2
T2
T1
P1 P2 P3
Time
C3
C1
C2
Notice the consistent ordering of totally ordered messages T1 and T2,
the FIFO-related messages F1 and F2 and the causally
related messages C1 and C3 – and the otherwise arbitrary delivery ordering of messages.
November 2005 Distributed systems: distributed algorithms
144
Multicast communication(cont.) FIFO multicast
• Alg. 1: R-multicast using IP-multicast
– Correct?• Sender assigns Sg
p
• Receivers deliver in this order
• Alg. 2 on top of any B-multicast
November 2005 Distributed systems: distributed algorithms
145
Multicast communication(cont.) FIFO multicast
Algorithm 2 on top of any B-multicast
Data structures at process p:
Sgp : sequence number
Rgq : sequence number of the latest message it has delivered from q
On initialization:
Sgp = 0; Rg
q = -1
For process p to FO-multicast message m to group g
B-multicast ( g, <m, Sgp >)
Sgp ++
On B-deliver (<m, S >) at q from p
November 2005 Distributed systems: distributed algorithms
146
Multicast communication(cont.) FIFO multicast
Algorithm 2 on top of any B-multicast
On B-deliver (<m, S >) at q from p
if S = Rgp + 1
then FO-deliver (m)
Rgp ++
check hold-back queue
else if S > Rgp + 1
then store m in hold-back queue endif
endif
November 2005 Distributed systems: distributed algorithms
147
Multicast communication(cont.) TOTAL multicast
• Basic approach:– Sender: assign totally ordered identifiers iso process
ids– Receiver: deliver as for FIFO ordering
• Alg. 1: use a (single) sequencer process
• Alg. 2: participants collectively agree on the assignment of sequence numbers
November 2005 Distributed systems: distributed algorithms
148
Multicast communication(cont.) TOTAL multicast: sequencer process
November 2005 Distributed systems: distributed algorithms
149
Multicast communication(cont.) TOTAL multicast: sequencer process
• Correct?
• Problems?– A single sequencer process
• bottleneck• single point of failure
November 2005 Distributed systems: distributed algorithms
150
Multicast communication(cont.) TOTAL multicast: ISIS algorithm
• Approach:– Sender:
• B-multicasts message
– Receivers:• Propose sequence numbers to sender
– Sender: • uses returned sequence numbers to
generate agreed sequence number
November 2005 Distributed systems: distributed algorithms
151
Multicast communication(cont.) TOTAL multicast: ISIS algorithm
2
1
1
2
2
1 Message
2 Proposed Seq
P2
P3
P1
P4
3 Agreed Seq
3
3
November 2005 Distributed systems: distributed algorithms
152
Multicast communication(cont.) TOTAL multicast: ISIS algorithm
Data structures at process p:
Agp : largest agreed sequence number
Pgp : largest proposed sequence number by P
On initialization:
Pgp = 0
For process p to TO-multicast message m to group g
B-multicast ( g, <m, i >) /i = unique id for m
On B-deliver (<m, i> ) at q from p
Pgq = max (Ag
q, Pgq) + 1
send (p, < i, Pgq >)
store <m, i, Pgq > in hold-back queue
November 2005 Distributed systems: distributed algorithms
153
Multicast communication(cont.) TOTAL multicast: ISIS algorithm
On receive( i, P) at p from q
wait for all replies; a is the largest reply
B-multicast(g, < “order”, i, a >)
On B-deliver (<“order”, i, a > ) at q from p
Agq = max (Ag
q, a)
attach a to message i in hold-back queue
reorder messages in hold-back queue (increasing sequence numbers)
while message m in front of hold-back queue has been assigned
an agreed sequence number
do remove m from hold-back queue
TO-deliver (m)
November 2005 Distributed systems: distributed algorithms
154
Multicast communication(cont.) TOTAL multicast: ISIS algorithm
• Correct?– Processes will agree on sequence number for
a message– Sequence numbers are monotonically
increasing– No process can prematurely deliver a
message• Performance
– 3 serial messages!• Total ordering
– <> FIFO– <> causal
November 2005 Distributed systems: distributed algorithms
155
Multicast communication(cont.) Causal multicast
• Limitations:– Causal order only by multicast operations– Non-overlapping, closed groups
• Approach:– Use vector timestamps– Timestamp = count number of multicast
messages
November 2005 Distributed systems: distributed algorithms
156
Multicast communication(cont.) Causal multicast: vector timestamps
Meaning?
November 2005 Distributed systems: distributed algorithms
157
Multicast communication(cont.) Causal multicast: vector timestamps
• Correct?– Message timestamp
m V m’ V’
– Given multicast(g,m) multicast(g,m’)proof V < V’
November 2005 Distributed systems: distributed algorithms
158
This chapter: overview• Introduction• Logical clocks• Global states• Failure detectors• Mutual exclusion• Elections• Multicast communication• Consensus and related problems
November 2005 Distributed systems: distributed algorithms
159
Consensus & related problems• System model
– N processes pi
– Communication is reliable– Processes mail fail
• Crash• Byzantine
– No message signing• Message signing limits the harm a faulty process can do
• Problems– Consensus– Byzantine generals– Interactive consistency
November 2005 Distributed systems: distributed algorithms
160
Consensus
• Problem statement pi: undecided state & pi proposes vi
– Message exchanges
– Finally: each process pi sets decision variable di, enters decided state and may not change di
• Requirements:– Termination: eventually each correct process pi sets di
– Agreement: pi and pj correct & in decided state di = dj
– Integrity: correct processes all propose same value d any process in decided state has
chosen d
November 2005 Distributed systems: distributed algorithms
161
Consensus
• Simple algorithm (no process failures)
– Collect processes in group g– For each process pi:
• R-multicast(g, vi)• Collect N values• d = majority (v1,v2,...,vN)
• Problems with failures:– Crash: detected? not in asynchronous systems– Byzantine? Faulty process can send around
different values
November 2005 Distributed systems: distributed algorithms
162
Byzantine generals• Problem statement
– Informal: • agree to attack or to retreat• commander issues the order• lieutenants are to decide to attack or retreat• all can be ‘treacherous’
– Formal• One process proposes value• Others are to agree
• Requirements:– Termination: each process eventually decides– Agreement: all correct processes select the same value– Integrity: commander correct other correct processes
select value of commander
November 2005 Distributed systems: distributed algorithms
163
Interactive consistency
• Problem statement– Correct processes agree upon a vector of values (one
value for each process)
• Requirements:– Termination: each process eventually decides
– Agreement: all correct processes select the same value
– Integrity: if pi correct
then all correct processes decide on vi as
the i-the component of their vector
November 2005 Distributed systems: distributed algorithms
164
Related problems & solutions
• Basic solutions:– Consensus:
• Ci(v1, v2, ..., vN) = decision value of pi
– Byzantine generals• j is commander, proposing v
• BGi (j, v) = decision value of pi
– Interactive consensus• ICi(v1, v2, ..., vN)[j] = jth value in the decision vector of pi
with v1, v2, ..., vN values proposed by processes
November 2005 Distributed systems: distributed algorithms
165
Related problems & solutions
• Derived solutions:– IC from BG:
• Run BG N times once with each process as commander
• ICi(v1, v2, ..., vN)[j] = BG(j, vj)
– C from IC:• Run IC and produce vector of values
• Derive single value with appropriate function
– BG from C:
November 2005 Distributed systems: distributed algorithms
166
Related problems & solutions
• Derived solutions:– IC from BG:
– C from IC:
– BG from C:
• Commander pj sends value v to itself & other
processes
• Each process runs C with the values v1, v2, ...vN
they received
• BGi(j,v) = Ci(v1, v2, ...vN)
November 2005 Distributed systems: distributed algorithms
167
Consensus in a synchronous system
• Assumptions– Use B-multicast
– f of N processes may fail
• Approach– Proceed in f + 1 rounds
– In each round correct processes B-multicast values
• Variables
– Valuesir = set of proposed values known to process pi
at the beginning of round r
November 2005 Distributed systems: distributed algorithms
168
Consensus in a synchronous system
November 2005 Distributed systems: distributed algorithms
169
Consensus in a synchronous system
• Termination?– Synchronous system
• Correct?– Each process arrives at the same set of values at the end of the
final round
– Proof? • Assume sets are different ...
• Pi has value v & Pj doesn’t have value v
Pk : managed to send v to Pi and not to Pj
• Agreement & Integrity?– Processes apply the same function
November 2005 Distributed systems: distributed algorithms
170
Byzantine generals in a synchronous system
• Assumptions– Arbitrary failures
– f of N processes may be faulty
– Channels are reliable: no message injections
– Unsigned messages
November 2005 Distributed systems: distributed algorithms
171
Byzantine generals in a synchronous system
• Impossibility with 3 processes
• Impossibility with N <= 3f processes
p1 (Commander)
p2 p3
1:v1:v
2:1:v
3:1:u
p1 (Commander)
p2 p3
1:x1:w
2:1:w
3:1:x
Faulty processes are shown shaded
November 2005 Distributed systems: distributed algorithms
172
Byzantine generals in a synchronous system
• Solution with one faulty process– N = 4, f = 1
– 2 rounds of messages:• Commander sends its value to each of the lieutenants
• Each of the lieutenants sends the value it received to its peers
– Lieutenant receives• Value of commander
• N-2 values of peers
Use majority function
November 2005 Distributed systems: distributed algorithms
173
Byzantine generals in a synchronous system
• Solution with one faulty process
p1 (Commander)
p2 p3
1:v1:v
2:1:v
3:1:u
Faulty processes are shown shaded
p4
1:v
4:1:v
2:1:v 3:1:w
4:1:v
p1 (Commander)
p2 p3
1:w1:u
2:1:u
3:1:w
p4
1:v
4:1:v
2:1:u 3:1:w
4:1:v
November 2005 Distributed systems: distributed algorithms
174
Consensus & related problems
• Impossibility in asynchronous systems proof: No algorithm exists to reach consensus
– No guaranteed solution to• Byzantine generals problem
• Interactive consistency
• Reliable & totally ordered multicast
• Approaches for work around:– Masking faults: restart crashed process and use persistent storage
– Use failure detectors: make failure fail-silent by discarding messages
November 2005 Distributed systems: distributed algorithms
175
This chapter: overview• Introduction• Logical clocks• Global states• Failure detectors• Mutual exclusion• Elections• Multicast communication• Consensus and related problems
November 2005 Distributed systems: distributed algorithms
176
Distributed Systems:
Distributed algorithms