CS6601 DISTRIBUTED SYSTEMS
-
Upload
anand-institute-of-higher-technology-chennai -
Category
Engineering
-
view
46 -
download
0
Transcript of CS6601 DISTRIBUTED SYSTEMS
CS6601 DISTRIBUTED SYSTEMS
UNIT – IV
Dr.A.Kathirvel, Professor, Computer Science and Engg.
M N M Jain Engineering College, Chennai
Unit - IV
SYNCHRONIZATION & REPLICATION
Introduction – Clocks, events and process states – Synchronizing
physical clocks- Logical time and logical clocks – Global states –
Coordination and Agreement – Introduction – Distributed
mutual exclusion – Elections – Transactions and Concurrency
Control– Transactions -Nested transactions – Locks – Optimistic
concurrency control – Timestamp ordering – Atomic Commit
protocols -Distributed deadlocks – Replication – Case study – Coda.
Coulouris, Dollimore, Kindberg and Blair
Distributed Systems:Concepts and Design Edition 5, Addison-Wesley 2012
Fig14.1:Skew between computer clocks in a distributed system
Time and Global States
3
Fig14.2: Clock synchronization using a time server
m r
m t p Time server,S
Fig14.3:An example synchronization
subnet in an NTP implementation
1
2
3
2
3 3
Note: Arrows denote synchronization control,
numbers denote strata.
4
Fig14.4:Messages exchanged
between a pair of NTP peers
T i
T i-1 T i -2
T i - 3
Server B
Server A
Time
m m'
Time
Fig 14.9: Cuts
m 1 m 2
p 1
p 2 Physical
time
e 1
0
Consistent cut
Inconsistent cut
e 1
1 e 1
2 e 1
3
e 2 0
e 2 1
e 2 2
9
Fig14.10: Chandy and Lamport’s ‘snapshot’
algorithm
Marker receiving rule for process pi
On pi’s receipt of a marker message over channel c:
if (pi has not yet recorded its state) it
records its process state now;
records the state of c as the empty set;
turns on recording of messages arriving over other incoming channels;
else
pi records the state of c as the set of messages it has received over c
since it saved its state.
end if
Marker sending rule for process pi
After pi has recorded its state, for each outgoing channel c:
pi sends one marker message over c
(before it sends any other message over c). 10
Fig14.13:Reachability between states
in the snapshot algorithm
S init S final
S snap
actual execution e 0 ,e 1 ,...
recording recording begins ends
pre-snap: e' 0 ,e ' 1 ,...e ' R-1 post-snap: e ' R ,e ' R+1 ,...
'
13
Fig14.14:Vector timestamps and variable
values for the execution of Fig14.9
m 1 m 2
p 1
p 2 Physical
time
Cut C 1
(1,0) (2,0) (4,3)
(2,1) (2,2) (2,3)
(3,0)
x 1 = 1 x 1 = 100 x 1 = 105
x 2 = 100 x 2 = 95 x 2 = 90
x 1 = 90
Cut C 2
14
Fig14.15:The lattice of global states for
the execution of Fig14.14
Sij = global state after i events at process 1 and j events at process 2
S 00
S 10
S 20
S 21 S 30
S 31
S 32
S 22
S 23
S 33
S 43
Level 0
1
2
3
4
5
6
7
15
Fig15.2:Server managing a mutual
exclusion token for a set of processes Server
1. Request token
Queue of requests
2. Release token
3. Grant token
4
2
p 4
p 3 p
2
p 1
19
Fig15.4:Ricart and Agrawala’s algorithm On initialization state := RELEASED; To enter the section state := WANTED; Multicast request to all processes; request processing deferred here T := request’s timestamp; Wait until (number of replies received = (N – 1)); state := HELD; On receipt of a request <Ti, pi> at pj (i ≠ j) if (state = HELD or (state = WANTED and (T, pj) < (Ti, pi))) then queue request from pi without replying; else reply immediately to pi; end if To exit the critical section state := RELEASED; reply to any queued requests;
21
Fig15.6: Maekawa’s algorithm – part 1
On initialization
state := RELEASED;
voted := FALSE;
For pi to enter the critical section
state := WANTED;
Multicast request to all processes in Vi;
Wait until (number of replies received = K);
state := HELD;
On receipt of a request from pi at pj
if (state = HELD or voted = TRUE)
then
queue request from pi without replying;
else
send reply to pi;
voted := TRUE;
end if
For pi to exit the critical section
state := RELEASED;
Multicast release to all processes in Vi;
On receipt of a release from pi at pj
if (queue of requests is non-empty)
then
remove head of queue – from pk, say;
send reply to pk;
voted := TRUE;
else
voted := FALSE;
end if
23
Fig15.7:A ring-based election in progress
24
15
9
4
3
28
17
24
1
Note: The election was
started by process 17.
The highest process
identifier encountered
so far is 24.
Participant processes
are shown in a darker
colour
24
Fig15.8:The bully algorithm
p1 p2
p3
p4
p1
p2
p3
p4
C
coordinator
Stage 4
C
election
election
Stage 2
p1
p2
p3
p4
C
election
answ er
answ er
electionStage 1
timeout
Stage 3
Eventually.....
p1
p2
p3
p4
election
answ er
The election of
coordinator p2,
after the failure
of p4 and then p3
25
Fig15.10:The hold-back queue for
arriving multicast messages
Messageprocess ing
Delivery queueHold-back
queue
deliver
Incom ing
messages
When delivery guarantees aremet
27
Fig15.11:Total, FIFO and causal
ordering of multicast messages
F3
F1
F2
T2
T1
P1 P2 P3
Time
C3
C1
C2
Notice the consistent
ordering of totally ordered
messages T1 and T2,
the FIFO-related messages
F1 and F2 and the causally
related messages C1 and C3
– and the otherwise
arbitrary delivery ordering
of messages.
28
Fig15.12:Display from bulletin
board program
Bulletin board: os.interesting
Item From Subject
23 A.Hanlon Mach
24 G.Joseph Microkernels
25 A.Hanlon Re: Microkernels
26 T.L’Heureux RPC performance
27 M.Walker Re: Mach
end
29
Fig15.14:The ISIS algorithm for
total ordering
2
1
1
2
2
1 Message
P 2
P 3
P 1
P 4
3 Agreed Seq
3
3
31
Fig15.16:Consensus for three processes
1
P2
P3 (crashes)
P1
Consensus algorithm
v1=proceed
v3=abort
v2=proceed
d1:=proceed d2:=proceed
33
Fig15.18:Three Byzantine generals
p 1 (Commander)
p 2 p 3
1:v 1:v
2:1:v
3:1:u
p 1 (Commander)
p 2 p 3
1:x 1:w
2:1:w
3:1:x
Faulty processes are shown coloured
35
Fig15.19:Four Byzantine generals
p 1 (Commander)
p 2 p 3
1:v 1:v
2:1:v
3:1:u
Faulty processes are shown coloured
p 4
1:v
4:1:v
2:1:v 3:1:w
4:1:v
p 1 (Commander)
p 2 p 3
1:w 1:u
2:1:u
3:1:w
p 4
1:v
4:1:v
2:1:u 3:1:w
4:1:v
36
Fig16.1:Operations of the Account
interface
deposit(amount)
deposit amount in the account
withdraw(amount)
withdraw amount from the account
getBalance() -> amount
return the balance of the account
setBalance(amount)
set the balance of the account to amount
create(name) -> account
create a new account with a given name
lookUp(name) -> account
return a reference to the account with the given
name
branchTotal() -> amount
return the total of all the balances at the branch
Operations of the Branch interface
Transactions and Concurrency Control
37
Fig16.2:A client’s banking transaction
Transaction T:
a.withdraw(100);
b.deposit(100);
c.withdraw(200);
b.deposit(200);
Figure 16.3:Operations in Coordinator interface openTransaction() -> trans;
starts a new transaction and delivers a unique TID trans. This identifier will be
used in the other operations in the transaction.
closeTransaction(trans) -> (commit, abort);
ends a transaction: a commit return value indicates that the transaction has
committed; an abort return value indicates that it has aborted.
abortTransaction(trans);
aborts the transaction.
38
Fig16.4:Transaction life histories
Successful Aborted by client Aborted by server
openTransaction openTransaction openTransaction
operation operation operation
operation operation operation
server aborts
transaction
operation operation operation ERROR
reported to client
closeTransaction abortTransaction
39
Fig16.5:The lost update problem
Transaction T :
balance = b.getBalance();
b.setBalance(balance*1.1);
a.withdraw(balance/10)
Transaction U :
balance = b.getBalance();
b.setBalance(balance*1.1);
c.withdraw(balance/10)
balance = b.getBalance(); $200
balance = b.getBalance(); $200
b.setBalance(balance*1.1); $220
b.setBalance(balance*1.1); $220
a.withdraw(balance/10) $80
c.withdraw(balance/10) $280
40
Fig16.6:The inconsistent retrievals
problem
Transaction V :
a.withdraw(100)
b.deposit(100)
Transaction W :
aBranch.branchTotal()
a.withdraw(100); $100
total = a.getBalance() $100
total = total+b.getBalance() $300
total = total+c.getBalance()
b.deposit(100) $300
41
Fig16.7:A serially equivalent
interleaving of T and U
Transaction T :
balance = b.getBalance()
b.setBalance(balance*1.1)
a.withdraw(balance/10)
Transaction U :
balance = b.getBalance()
b.setBalance(balance*1.1)
c.withdraw(balance/10)
balance = b.getBalance() $200
b.setBalance(balance*1.1) $220
balance = b.getBalance() $220
b.setBalance(balance*1.1) $242
a.withdraw(balance/10) $80
c.withdraw(balance/10) $278
42
Fig16.8:A serially equivalent
interleaving of V and W
Transaction V :
a.withdraw(100);
b.deposit(100)
Transaction W :
aBranch.branchTotal()
a.withdraw(100); $100
b.deposit(100) $300
total = a.getBalance() $100
total = total+b.getBalance() $400
total = total+c.getBalance()
...
43
Fig16.9:Read and write operation
conflict rules
Operations of different transactions
Conflict Reason
read read No Because the effect of a pair of read operations
does not depend on the order in which they are
executed
read write Yes Because the effect of a read and a write operation
depends on the order of their execution
write write Yes Because the effect of a pair of write operations
depends on the order of their execution
44
Fig16.10:A non-serially equivalent interleaving of
operations of transactions T and U
Transaction T : Transaction U :
x = read(i)
write(i, 10) y = read(j)
write(j, 30)
write(j, 20) z = read (i)
45
Fig16.11:A dirty read when transaction
T aborts
Transaction T :
a.getBalance()
a.setBalance(balance + 10)
Transaction U :
a.getBalance()
a.setBalance(balance + 20)
balance = a.getBalance() $100
a.setBalance(balance + 10) $110
balance = a.getBalance() $110
a.setBalance(balance + 20) $130
commit transaction
abort transaction
46
Fig16.12:Overwriting uncommitted
values
Transaction T :
a.setBalance(105)
Transaction U :
a.setBalance(110)
$100
a.setBalance(105) $105
a.setBalance(110) $110
47
Fig16.13:Nested transactions
T : top-level transaction
T 1 = openSubTransaction T
2 = openSubTransaction
openSubTransaction openSubTransaction openSubTransaction
openSubTransaction
T 1 : T
2 :
T 11
: T 12
:
T 211
:
T 21 :
prov.commit
prov. commit
abort
prov. commit prov. commit
prov. commit
commit
48
Fig16.14:Transactions T and U with
exclusive locks Transaction T :
balance = b.getBalance()
b.setBalance(bal*1.1)
a.withdraw(bal/10)
Transaction U :
balance = b.getBalance() b.setBalance(bal*1.1)
c.withdraw(bal/10)
Operations Locks Operations Locks
openTransaction
bal = b.getBalance() lock B
b.setBalance(bal*1.1) openTransaction
a.withdraw(bal/10) lock A bal = b.getBalance() waits for T ’s
lock on B
closeTransaction unlock A , B
lock B
b.setBalance(bal*1.1)
c.withdraw(bal/10) lock C
closeTransaction unlock B
, C
49
Fig16.15:Lock compatibility
For one object Lock requested
read write
Lock already set none OK OK
read OK wait
write wait wait
50
Fig16.16:Use of locks in strict two-
phase locking
1. When an operation accesses an object within a transaction:
(a) If the object is not already locked, it is locked and the operation proceeds.
(b) If the object has a conflicting lock set by another transaction, the
transaction must wait until it is unlocked.
(c) If the object has a non-conflicting lock set by another transaction, the lock
is shared and the operation proceeds.
(d) If the object has already been locked in the same transaction, the lock will
be promoted if necessary and the operation proceeds. (Where promotion is
prevented by a conflicting lock, rule (b) is used.)
2. When a transaction is committed or aborted, the server unlocks all objects it
locked for the transaction.
51
Fig16.17:Lock class
public class Lock {
private Object object; // the object being protected by the lock
private Vector holders; // the TIDs of current holders
private LockType lockType; // the current type
public synchronized void acquire(TransID trans, LockType aLockType ){
while(/*another transaction holds the lock in conflicing mode*/) {
try {
wait();
}catch ( InterruptedException e){/*...*/ } }
if(holders.isEmpty()) { // no TIDs hold lock
holders.addElement(trans);
lockType = aLockType;
} else if(/*another transaction holds the lock, share it*/ ) ){
if(/* this transaction not a holder*/) holders.addElement(trans);
} else if (/* this transaction is a holder but needs a more exclusive lock*/)
lockType.promote(); } }
public synchronized void release(TransID trans ){
holders.removeElement(trans); // remove this holder
// set locktype to none
notifyAll(); } }
52
Fig16.18: LockManager class
public class LockManager { private Hashtable theLocks; public void setLock(Object object, TransID trans, LockType lockType){ Lock foundLock; synchronized(this){ // find the lock associated with object // if there isn’t one, create it and add to the hashtable } foundLock.acquire(trans, lockType); }
// synchronize this one because we want to remove all entries public synchronized void unLock(TransID trans) { Enumeration e = theLocks.elements(); while(e.hasMoreElements()){ Lock aLock = (Lock)(e.nextElement()); if(/* trans is a holder of this lock*/ ) aLock.release(trans); } }}
53
Fig6.19:Deadlock with write locks
Transaction T Transaction U
Operations Locks Operations Locks
a.deposit(100); write lock A
b.deposit(200) write lock B
b.withdraw(100)
waits for U ’s a.withdraw(200); waits for T ’s
lock on B lock on A
54
Fig16.23:Resolution of the deadlock in
Fig:15.19
Transaction T Transaction U
Operations Locks Operations Locks
a.deposit(100); write lock A
b.deposit(200) write lock B
b.withdraw(100)
waits for U ’s a.withdraw(200); waits for T’s
lock on B lock on A
(timeout elapses)
T’s lock on A becomes vulnerable,
unlock A , abort T
a.withdraw(200); write locks A
unlock A , B
58
Fig16.24:Lock compatibility (read, write and commit locks)
For one object Lock to be set
read write commit
Lock already set none OK OK OK
read OK OK wait
write OK wait
commit wait wait
Fig16.25:Lock hierarchy for the banking example
Branch
Account A B C
59
Fig16.26:Lock hierarchy for a diary Week
Monday Tuesday Wednesday Thursday Friday
9:00–10:00
time slots
10:00–11:00 11:00–12:00 12:00–13:00 13:00–14:00 14:00–15:00 15:00–16:00
Fig16.27:Lock compatibility table for hierarchic locks
For one object Lock to be set
read write I-read I-write
Lock already set none OK OK OK OK
read OK wait OK wait
write wait wait wait wait
I-read OK wait OK OK
I-write wait wait OK OK
60
Serializability of transaction T with respect to transaction Ti
Tv Ti Rule
write read 1. Ti must not read objects written by Tv
read write 2. Tv must not read objects written by Ti
write write 3. Ti must not write objects written by Tv and
Tv must not write objects written by Ti
Validation of Transactions Backward validation of transaction Tv
boolean valid = true;
for (int Ti = startTn+1; Ti <= finishTn; Ti++){
if (read set of Tv intersects write set of Ti) valid = false;
}
Forward validation of transaction Tv
boolean valid = true;
for (int Tid = active1; Tid <= activeN; Tid++){
if (write set of Tv intersects read set of Tid) valid = false;
}
61
Fig16.28:Validation of transactions
Earlier committed transactions
Working Validation Update
T 1
T v
Transaction
being validated
T 2
T 3
Later active
transactions
active 1
active 2
Fig16.29:Operation conflicts for timestamp ordering
Rule Tc Ti
1. write read Tc must not write an object that has been read by any Ti where this requires that Tc ≥ the maximum read timestamp of the object.
2. write write Tc must not write an object that has been written by any Ti where
Ti > Tc
this requires that Tc > write timestamp of the committed object.
3. read write Tc must not read an object that has been written by any Ti where this requires that Tc > write timestamp of the committed object.
Ti > Tc
Ti > Tc
62
write (b) T3 (a) write
(c) T3 write object produced
by transaction Ti
(with write timestamp Ti)
T3
write (d) T3
T1<T2<T3<T4
Time
Before
After
T 2
T 2 T 3
Time
Before
After
T 2
T 2 T 3
T 1
T 1
Time
Before
After
T 1
T 1
T 4
T 3 T 4
Time
Transaction
aborts Before
After
T 4
T 4
Tentative
Committed
T i
T i
Key:
Timestamp ordering write rule
if (Tc ≥ maximum read timestamp on D &&
Tc > write timestamp on committed version of D)
perform write operation on tentative version of D with write timestamp Tc
else /* write is too late */
Abort transaction Tc
Fig16.30:Writ
e operations &
timestamps
63
Timestamp ordering read rule
if ( Tc > write timestamp on committed version of D) {
let Dselected be the version of D with the maximum write timestamp ≤ Tc
if (Dselected is committed)
perform read operation on the version Dselected
else
Wait until the transaction that made version Dselected commits or aborts
then reapply the read rule
} else Abort transaction Tc
(b) T3 read
Time
read proceeds
Selected
T 2
Time
read proceeds
Selected
T 2 T 4
Time
read
waits
Selected
T 1 T 2
Time
Transaction aborts T 4
Key:
Tentative
Committed
T i
T i
object produced
by transaction Ti
(with write timestamp
Ti)
T1 < T2 < T3 < T4
(a) T3 read
(c) T3 read (d) T3 read
64
Fig16.32:Timestamps in transactions T &U
Timestamps and versions of objects
T U A B C
RTS WTS RTS WTS RTS WTS {} S {} S {} S
openTransaction bal = b.getBalance() {T}
openTransaction b.setBalance(bal*1.1)
bal = b.getBalance()
wait for T
a.withdraw(bal/10)
commit T T
bal = b.getBalance()
b.setBalance(bal*1.1)
c.withdraw(bal/10) S, U
T, U
S, T
S, T
{U}
65
Fig16.33:Late write operation
would invalidate a read
Time
T 4 write; T 5 read; T 3 write; T 3 read;
T 2
T 3 T 5
T 1 T 3
T 1 < T 2 < T 3 < T 4 < T 5
Key:
Tentative Committed
T i T i
T k T k
object produced by transaction
Ti (with write timestamp Ti and
read timestamp Tk)
66
Fig17.1:Distributed transactions
(a) Flat transaction (b) Nested transactions
Client
X
Y
Z
X
Y
M
N T 1
T 2
T 11
Client
P
T
T 12
T 21
T 22
T
T
Distributed transactions
67
Fig17.2:Nested banking transaction
a.withdraw(10)
c . deposit(10)
b.withdraw(20)
d.deposit(20)
Client A
B
C
T 1
T 2
T 3
T 4
T
D
X
Y
Z
T = openTransaction
openSubTransaction a.withdraw(10);
closeTransaction
openSubTransaction b.withdraw(20);
openSubTransaction c.deposit(10);
openSubTransaction d.deposit(20);
68
Fig17.3:A distributed banking transaction
. .
BranchZ
BranchX
participant
participant
C
D
Client
BranchY
B
A
participant join
join
join
T
a.withdraw(4);
c.deposit(4);
b.withdraw(3);
d.deposit(3);
openTransaction
b.withdraw(T, 3);
closeTransaction
T = openTransaction
a.withdraw(4);
c.deposit(4); b.withdraw(3); d.deposit(3);
closeTransaction
Note: the coordinator is in one of the servers, e.g. BranchX
69
Fig17.4:Operations for two-phase
commit protocol canCommit?(trans)-> Yes / No
Call from coordinator to participant to ask whether it can commit a transaction.
Participant replies with its vote.
doCommit(trans)
Call from coordinator to participant to tell participant to commit its part of a
transaction.
doAbort(trans)
Call from coordinator to participant to tell participant to abort its part of a
transaction.
haveCommitted(trans, participant)
Call from participant to coordinator to confirm that it has committed the
transaction.
getDecision(trans) -> Yes / No
Call from participant to coordinator to ask for the decision on a transaction after
it has voted Yes but has still had no reply after some delay. Used to recover from
server crash or delayed messages.
70
Fig17.5:The two-phase commit
protocol
Phase 1 (voting phase):
1. The coordinator sends a canCommit? request to each of the participants in the
transaction.
2. When a participant receives a canCommit? request it replies with its vote (Yes or No) to
the coordinator. Before voting Yes, it prepares to commit by saving objects in permanent
storage. If the vote is No the participant aborts immediately.
Phase 2 (completion according to outcome of vote):
3. The coordinator collects the votes (including its own).
(a) If there are no failures and all the votes are Yes the coordinator decides to commit the
transaction and sends a doCommit request to each of the participants.
(b) Otherwise the coordinator decides to abort the transaction and sends doAbort requests to
all participants that voted Yes.
4. Participants that voted Yes are waiting for a doCommit or doAbort request from the
coordinator. When a participant receives one of these messages it acts accordingly and in
the case of commit, makes a haveCommitted call as confirmation to the coordinator.
71
Fig17.6: Communication in two-phase commit protocol
canCommit?
Yes
doCommit
haveCommitted
Coordinator
1
3
(waiting for votes)
committed
done
prepared to commit
step
Participant
2
4
(uncertain)
prepared to commit
committed
status step status
72
Fig17.7:Operations in coordinator for nested transactions
openSubTransaction(trans) -> subTrans
Opens a new subtransaction whose parent is trans and returns a unique subtransaction
identifier.
getStatus(trans)-> committed, aborted, provisional
Asks the coordinator to report on the status of the transaction trans. Returns values
representing one of the following: committed, aborted, provisional.
1
2
T 11
T 12
T 22
T 21
abort (at M)
provisional commit (at N)
provisional commit (at X)
aborted (at Y)
provisional commit (at N)
provisional commit (at P)
T
T
T
Fig17.8: Transaction T decides whether to commit
73
Fig17.9:Information held by coordinators of nested transactions
Coordinator of
transaction
Child
transactions
Participant Provisional
commit list
Abort list
T T 1 , T 2 yes T 1 , T 12 T 11 , T 2
T 1 T 11 , T 12 yes T 1 , T 12 T 11
T 2 T 21 , T 22 no (aborted) T 2
T 11 no (aborted) T 11
T 12 , T 21 T 12 but not T 21 T 21 , T 12
T 22 no (parent aborted) T 22
*T 21’s parent has aborted
*
canCommit?(trans, subTrans) -> Yes / No
Call a coordinator to ask coordinator of
child subtransaction whether it can commit
a subtransaction subTrans. The first
argument trans is the transaction identifier
of top-level transaction. Participant replies
with its vote Yes / No.
Fig17.10:canCommit?for hierarchic
two-phase commit protocol
Fig17.11:canCommit? for flat
two-phase commit protocol
canCommit?(trans, abortList) -> Yes / No
Call from coordinator to participant to
ask whether it can commit a transaction.
Participant replies with its vote Yes / No.
74
Fig17.12: Interleavings of transactions
U, V and W
U V W
d.deposit(10) lock D
b.deposit(10) lock B
a.deposit(20) lock A
at Y
at X
c.deposit(30) lock C
b.withdraw(30) wait at Y
at Z
c.withdraw(20) wait at Z
a.withdraw(20) wait at X
75
Fig17.13:Distributed deadlock
D
Waits for
Waits
for
Held by
Held
by
B Waits for
Held
by
X
Y
Z
Held by
W
U V
A C
W
V
U
(a) (b)
76
Fig17.14:Local and global wait-for graphs
X
T U
Y
V T T
U V
local wait-for graph local wait-for graph global deadlock detector
77
Fig17.15:Probes transmitted to detect
deadlock
V
Held by
W
Waits for Held by
Waits for
Waits for
Deadlock
detected
U
C
A
B
Initiation
W U V W
W U
W U V
Z
Y
X
→ → →
→ →
→
78
Fig17.16:Two probes initiated
(a) initial situation (b) detection initiated at object
requested by T
(c) detection initiated at object
requested by W
U
T
V
W
Waits for
Waits for
V
W
U
T
T U W V T U W
T U Waits for
U
V
T
W
W V T W V T U
W V
Waits for
→
→ → → →
→ →
→ → →
→
79
Fig17.17:Probes travel downhill .
(b) Probe is forwarded when V starts waiting (a) V stores probe when U starts waiting
U
W
V
probe queue
U V
Waits for
B Waits for B
Waits
for C
V W
U V
V
U V
U V U
W probe queue
→ →
→
→
→
80
Fig17.8:Types of entry in a recovery file
Type of entry Description of contents of entry
Object A value of an object.
Transaction status
Transaction identifier, transaction status ( prepared , committed
aborted ) and other status values used for the two-phase
commit protocol.
Intentions list Transaction identifier and a sequence of intentions, each of
which consists of <objectID, Pi>, where Pi is the position in the
recovery file of the value of the object.
Fig17.19:Log for banking service P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7
Object: A Object: B Object: C Object: A Object: B Trans: T Trans: T Object: C Object: B Trans: U
100 200 300 80 220 prepared committed 278 242 prepared
< A , P 1 > < C , P 5 >
< B , P 2 > < B , P 6 >
P 0 P 3 P 4
Checkpoint End
of log
81
Fig17.20: Shadow versions
P 0 P 0 ' P 0 " P 1 P 2 P 3 P 4
Version store 100 200 300 80 220 278 242
Checkpoint
Map at start Map when T commits
A P 0 A P 1
B P 0 ' B P 2
C P 0 " C P 0 "
→
→
→
→
→
→
Fig17.21:Log with entries relating to two-phase commit protocol
Trans: T Coord’r: T Trans: T Trans: U Part’pant: U Trans: U Trans: U
prepared part’pant list: . . .
committed prepared Coord’r: . . uncertain committed
intentions list
intentions list
82
Fig17.22: Recovery of the two-phase
commit protocol Role Status Action of recovery manager
Coordinator prepared No decision had been reached before the server failed. It sends abortTransaction to all the servers in the participant list and adds the transaction status aborted in its recovery file. Same action for state aborted . If there is no participant list, the participants will eventually timeout and abort the transaction.
Coordinator committed A decision to commit had been reached before the server failed. It sends a doCommit to all the participants in its participant list (in case it had not done so before) and resumes the two-phase protocol at step 4 (Fig 13.5).
Participant committed The participant sends a haveCommitted message to the coordinator (in case this was not done before it failed). This will allow the coordinator to discard information about this transaction at the next checkpoint.
Participant uncertain The participant failed before it knew the outcome of the transaction. It cannot determine the status of the transaction until the coordinator informs it of the decision. It will send a getDecision to the coordinator to determine the status of the transaction. When it receives the reply it will commit or abort accordingly.
Participant prepared The participant has not yet voted and can abort the transaction.
Coordinator done No action is required.
83
Fig17.23:Nested transactions
A T
A 1
A 11
A 12
A 2
A 1
T 1
T 11
T 12
T 2
A 11
A 11
12
A 12
A 2
top of stack
T 1
T 2
T11
T 12
84
Fig18.1:A basic architectural model for the management of
replicated data
FE
Requests and
replies
C
Replica C
Service Clients Front ends
managers
RM
RM FE
RM
Replication
85
Fig18.2:View-synchronous group
communication
p
q
r
p crashes
view (q, r) view (p, q, r)
p
q
r
p crashes
view (q, r) view (p, q, r)
a (allowed). b (allowed).
p
q
r
view (p, q, r)
p
q
r
p crashes
view (q, r) view (p, q, r)
c (disallowed). d (disallowed).
p crashes
view (q, r)
86
Fig18.3:The passive (primary-backup)
model for fault tolerance
FE C
FE C
RM
Primary
Backup
Backup
RM
RM
87
Fig18.6:Front ends propagate
their timestamps whenever
clients communicate directly
FE
Clients
FE
Service
Vector
timestamps
RM RM
RM
gossip
Fig18.5:Query and update
operations in a gossip service
Query Val
FE
RM RM
RM
Query, prev Val, new
Update
FE
Update, prev Update id
Clients
gossip
89
Fig18.7:A gossip replica manager,
showing its main state components Other replica managers
Replica timestamp
Update log
Value timestamp
Value
Executed operation table
Stable
updates
Updates
Gossip
messages
FE
Replica timestamp
Replica log
OperationID Update Prev
FE
Replica manager
Timestamp table
90
Fig18.8:Committed and tentative
updates in Bayou
c0 c1 c2 cN t0 t1 ti
Committed Tentative
t2
Tentative update ti becomes the next committed update
and is inserted after the last committed update cN.
ti+1
91
Fig18.9:Transactions on replicated data
B
A
Client + front end
B B B A A
getBalance(A)
Client + front end
Replica managers Replica managers
deposit(B,3);
U T
92
Fig18.10:Available copies
A
X
Client + front end
P
B
Client + front end
Replica managers
deposit(A,3);
U T
deposit(B,3);
getBalance(B)
getBalance(A)
Replica managers
Y
M
B
N
A
B
93
Fig18.11:Network partition
Client + front end
B
withdraw(B, 4)
Client + front end
Replica managers
deposit(B,3);
U T Network
partition
B
B B
94
Gifford’s quorum concensus examples
Example 1 Example 2 Example 3
Latency Replica 1 75 75 75
(milliseconds) Replica 2 65 100 750
Replica 3 65 750 750
Voting Replica 1 1 2 1
configuration Replica 2 0 1 1
Replica 3 0 1 1
Quorum R 1 2 1
sizes W 1 3 3
Derived performance of file suite:
Read Latency 65 75 75
Blocking probability 0.01 0.0002 0.000001
Write Latency 75 100 750
Blocking probability 0.01 0.0101 0.03
95
Fig18.12:Two network partitions
Replica managers
Network partition
V X Y Z
T Transaction
Fig18.13:Virtual partition
X V Y Z
Replica managers
Virtual partition Network partition
96
Creating a virtual partition Phase 1:
The initiator sends a Join request to each potential member. The argument
of Join is a proposed logical timestamp for the new virtual partition.
When a replica manager receives a Join request, it compares the proposed
logical timestamp with that of its current virtual partition.
If the proposed logical timestamp is greater it agrees to join and replies Yes;
If it is less, it refuses to join and replies No.
Phase 2:
If the initiator has received sufficient Yes replies to have read and write
quora, it may complete the creation of the new virtual partition by sending
a Confirmation message to the sites that
agreed to join. The creation
timestamp and list of actual members
are sent as arguments.
Replica managers receiving the
Confirmation message join the new virtual
partition and record its creation
timestamp and list of actual members.
Fig18.14:Two overlapping virtual partitions
Virtual partition V 1
Virtual partition V 2
Y X V Z
97