CS6601 DISTRIBUTED SYSTEMS

98
CS6601 DISTRIBUTED SYSTEMS UNIT IV Dr.A.Kathirvel, Professor, Computer Science and Engg. M N M Jain Engineering College, Chennai

Transcript of CS6601 DISTRIBUTED SYSTEMS

CS6601 DISTRIBUTED SYSTEMS

UNIT – IV

Dr.A.Kathirvel, Professor, Computer Science and Engg.

M N M Jain Engineering College, Chennai

Unit - IV

SYNCHRONIZATION & REPLICATION

Introduction – Clocks, events and process states – Synchronizing

physical clocks- Logical time and logical clocks – Global states –

Coordination and Agreement – Introduction – Distributed

mutual exclusion – Elections – Transactions and Concurrency

Control– Transactions -Nested transactions – Locks – Optimistic

concurrency control – Timestamp ordering – Atomic Commit

protocols -Distributed deadlocks – Replication – Case study – Coda.

Coulouris, Dollimore, Kindberg and Blair

Distributed Systems:Concepts and Design Edition 5, Addison-Wesley 2012

Fig14.1:Skew between computer clocks in a distributed system

Time and Global States

3

Fig14.2: Clock synchronization using a time server

m r

m t p Time server,S

Fig14.3:An example synchronization

subnet in an NTP implementation

1

2

3

2

3 3

Note: Arrows denote synchronization control,

numbers denote strata.

4

Fig14.4:Messages exchanged

between a pair of NTP peers

T i

T i-1 T i -2

T i - 3

Server B

Server A

Time

m m'

Time

Fig14.5:Events occurring at three

processes

5

Fig14.6:Lamport timestamps for the

events shown in Figure 14.5

6

Fig14.7:Vector timestamps for the

events shown in Fig 14.5

7

Fig14.8: Detecting global properties

8

Fig 14.9: Cuts

m 1 m 2

p 1

p 2 Physical

time

e 1

0

Consistent cut

Inconsistent cut

e 1

1 e 1

2 e 1

3

e 2 0

e 2 1

e 2 2

9

Fig14.10: Chandy and Lamport’s ‘snapshot’

algorithm

Marker receiving rule for process pi

On pi’s receipt of a marker message over channel c:

if (pi has not yet recorded its state) it

records its process state now;

records the state of c as the empty set;

turns on recording of messages arriving over other incoming channels;

else

pi records the state of c as the set of messages it has received over c

since it saved its state.

end if

Marker sending rule for process pi

After pi has recorded its state, for each outgoing channel c:

pi sends one marker message over c

(before it sends any other message over c). 10

Fig14.11:Two processes and their

initial states

11

Fig14.12: The execution of the

processes in Fig14.11

12

Fig14.13:Reachability between states

in the snapshot algorithm

S init S final

S snap

actual execution e 0 ,e 1 ,...

recording recording begins ends

pre-snap: e' 0 ,e ' 1 ,...e ' R-1 post-snap: e ' R ,e ' R+1 ,...

'

13

Fig14.14:Vector timestamps and variable

values for the execution of Fig14.9

m 1 m 2

p 1

p 2 Physical

time

Cut C 1

(1,0) (2,0) (4,3)

(2,1) (2,2) (2,3)

(3,0)

x 1 = 1 x 1 = 100 x 1 = 105

x 2 = 100 x 2 = 95 x 2 = 90

x 1 = 90

Cut C 2

14

Fig14.15:The lattice of global states for

the execution of Fig14.14

Sij = global state after i events at process 1 and j events at process 2

S 00

S 10

S 20

S 21 S 30

S 31

S 32

S 22

S 23

S 33

S 43

Level 0

1

2

3

4

5

6

7

15

Fig14.16:Algorithms to evaluate

possibly φ and definitely φ

16

Fig14.17:Evaluating definitely φ

17

Fig 15.1A network partition

Crashedrouter

Coordination and Agreement

18

Fig15.2:Server managing a mutual

exclusion token for a set of processes Server

1. Request token

Queue of requests

2. Release token

3. Grant token

4

2

p 4

p 3 p

2

p 1

19

Fig15.3:A ring of processes

transferring a mutual exclusion token

pn

p2

p3

p4

Token

p1

20

Fig15.4:Ricart and Agrawala’s algorithm On initialization state := RELEASED; To enter the section state := WANTED; Multicast request to all processes; request processing deferred here T := request’s timestamp; Wait until (number of replies received = (N – 1)); state := HELD; On receipt of a request <Ti, pi> at pj (i ≠ j) if (state = HELD or (state = WANTED and (T, pj) < (Ti, pi))) then queue request from pi without replying; else reply immediately to pi; end if To exit the critical section state := RELEASED; reply to any queued requests;

21

Fig15.5:Multicast synchronization

p 3

34

Reply

34

41

41 41

34

p 1

p 2

Reply Reply

22

Fig15.6: Maekawa’s algorithm – part 1

On initialization

state := RELEASED;

voted := FALSE;

For pi to enter the critical section

state := WANTED;

Multicast request to all processes in Vi;

Wait until (number of replies received = K);

state := HELD;

On receipt of a request from pi at pj

if (state = HELD or voted = TRUE)

then

queue request from pi without replying;

else

send reply to pi;

voted := TRUE;

end if

For pi to exit the critical section

state := RELEASED;

Multicast release to all processes in Vi;

On receipt of a release from pi at pj

if (queue of requests is non-empty)

then

remove head of queue – from pk, say;

send reply to pk;

voted := TRUE;

else

voted := FALSE;

end if

23

Fig15.7:A ring-based election in progress

24

15

9

4

3

28

17

24

1

Note: The election was

started by process 17.

The highest process

identifier encountered

so far is 24.

Participant processes

are shown in a darker

colour

24

Fig15.8:The bully algorithm

p1 p2

p3

p4

p1

p2

p3

p4

C

coordinator

Stage 4

C

election

election

Stage 2

p1

p2

p3

p4

C

election

answ er

answ er

electionStage 1

timeout

Stage 3

Eventually.....

p1

p2

p3

p4

election

answ er

The election of

coordinator p2,

after the failure

of p4 and then p3

25

Fig15.9:Reliable multicast algorithm

26

Fig15.10:The hold-back queue for

arriving multicast messages

Messageprocess ing

Delivery queueHold-back

queue

deliver

Incom ing

messages

When delivery guarantees aremet

27

Fig15.11:Total, FIFO and causal

ordering of multicast messages

F3

F1

F2

T2

T1

P1 P2 P3

Time

C3

C1

C2

Notice the consistent

ordering of totally ordered

messages T1 and T2,

the FIFO-related messages

F1 and F2 and the causally

related messages C1 and C3

– and the otherwise

arbitrary delivery ordering

of messages.

28

Fig15.12:Display from bulletin

board program

Bulletin board: os.interesting

Item From Subject

23 A.Hanlon Mach

24 G.Joseph Microkernels

25 A.Hanlon Re: Microkernels

26 T.L’Heureux RPC performance

27 M.Walker Re: Mach

end

29

Fig15.13:Total ordering using a

sequencer

30

Fig15.14:The ISIS algorithm for

total ordering

2

1

1

2

2

1 Message

P 2

P 3

P 1

P 4

3 Agreed Seq

3

3

31

Fig15.15:Causal ordering using vector

timestamps

32

Fig15.16:Consensus for three processes

1

P2

P3 (crashes)

P1

Consensus algorithm

v1=proceed

v3=abort

v2=proceed

d1:=proceed d2:=proceed

33

Fig15.17:Consensus in a

synchronous system

34

Fig15.18:Three Byzantine generals

p 1 (Commander)

p 2 p 3

1:v 1:v

2:1:v

3:1:u

p 1 (Commander)

p 2 p 3

1:x 1:w

2:1:w

3:1:x

Faulty processes are shown coloured

35

Fig15.19:Four Byzantine generals

p 1 (Commander)

p 2 p 3

1:v 1:v

2:1:v

3:1:u

Faulty processes are shown coloured

p 4

1:v

4:1:v

2:1:v 3:1:w

4:1:v

p 1 (Commander)

p 2 p 3

1:w 1:u

2:1:u

3:1:w

p 4

1:v

4:1:v

2:1:u 3:1:w

4:1:v

36

Fig16.1:Operations of the Account

interface

deposit(amount)

deposit amount in the account

withdraw(amount)

withdraw amount from the account

getBalance() -> amount

return the balance of the account

setBalance(amount)

set the balance of the account to amount

create(name) -> account

create a new account with a given name

lookUp(name) -> account

return a reference to the account with the given

name

branchTotal() -> amount

return the total of all the balances at the branch

Operations of the Branch interface

Transactions and Concurrency Control

37

Fig16.2:A client’s banking transaction

Transaction T:

a.withdraw(100);

b.deposit(100);

c.withdraw(200);

b.deposit(200);

Figure 16.3:Operations in Coordinator interface openTransaction() -> trans;

starts a new transaction and delivers a unique TID trans. This identifier will be

used in the other operations in the transaction.

closeTransaction(trans) -> (commit, abort);

ends a transaction: a commit return value indicates that the transaction has

committed; an abort return value indicates that it has aborted.

abortTransaction(trans);

aborts the transaction.

38

Fig16.4:Transaction life histories

Successful Aborted by client Aborted by server

openTransaction openTransaction openTransaction

operation operation operation

operation operation operation

server aborts

transaction

operation operation operation ERROR

reported to client

closeTransaction abortTransaction

39

Fig16.5:The lost update problem

Transaction T :

balance = b.getBalance();

b.setBalance(balance*1.1);

a.withdraw(balance/10)

Transaction U :

balance = b.getBalance();

b.setBalance(balance*1.1);

c.withdraw(balance/10)

balance = b.getBalance(); $200

balance = b.getBalance(); $200

b.setBalance(balance*1.1); $220

b.setBalance(balance*1.1); $220

a.withdraw(balance/10) $80

c.withdraw(balance/10) $280

40

Fig16.6:The inconsistent retrievals

problem

Transaction V :

a.withdraw(100)

b.deposit(100)

Transaction W :

aBranch.branchTotal()

a.withdraw(100); $100

total = a.getBalance() $100

total = total+b.getBalance() $300

total = total+c.getBalance()

b.deposit(100) $300

41

Fig16.7:A serially equivalent

interleaving of T and U

Transaction T :

balance = b.getBalance()

b.setBalance(balance*1.1)

a.withdraw(balance/10)

Transaction U :

balance = b.getBalance()

b.setBalance(balance*1.1)

c.withdraw(balance/10)

balance = b.getBalance() $200

b.setBalance(balance*1.1) $220

balance = b.getBalance() $220

b.setBalance(balance*1.1) $242

a.withdraw(balance/10) $80

c.withdraw(balance/10) $278

42

Fig16.8:A serially equivalent

interleaving of V and W

Transaction V :

a.withdraw(100);

b.deposit(100)

Transaction W :

aBranch.branchTotal()

a.withdraw(100); $100

b.deposit(100) $300

total = a.getBalance() $100

total = total+b.getBalance() $400

total = total+c.getBalance()

...

43

Fig16.9:Read and write operation

conflict rules

Operations of different transactions

Conflict Reason

read read No Because the effect of a pair of read operations

does not depend on the order in which they are

executed

read write Yes Because the effect of a read and a write operation

depends on the order of their execution

write write Yes Because the effect of a pair of write operations

depends on the order of their execution

44

Fig16.10:A non-serially equivalent interleaving of

operations of transactions T and U

Transaction T : Transaction U :

x = read(i)

write(i, 10) y = read(j)

write(j, 30)

write(j, 20) z = read (i)

45

Fig16.11:A dirty read when transaction

T aborts

Transaction T :

a.getBalance()

a.setBalance(balance + 10)

Transaction U :

a.getBalance()

a.setBalance(balance + 20)

balance = a.getBalance() $100

a.setBalance(balance + 10) $110

balance = a.getBalance() $110

a.setBalance(balance + 20) $130

commit transaction

abort transaction

46

Fig16.12:Overwriting uncommitted

values

Transaction T :

a.setBalance(105)

Transaction U :

a.setBalance(110)

$100

a.setBalance(105) $105

a.setBalance(110) $110

47

Fig16.13:Nested transactions

T : top-level transaction

T 1 = openSubTransaction T

2 = openSubTransaction

openSubTransaction openSubTransaction openSubTransaction

openSubTransaction

T 1 : T

2 :

T 11

: T 12

:

T 211

:

T 21 :

prov.commit

prov. commit

abort

prov. commit prov. commit

prov. commit

commit

48

Fig16.14:Transactions T and U with

exclusive locks Transaction T :

balance = b.getBalance()

b.setBalance(bal*1.1)

a.withdraw(bal/10)

Transaction U :

balance = b.getBalance() b.setBalance(bal*1.1)

c.withdraw(bal/10)

Operations Locks Operations Locks

openTransaction

bal = b.getBalance() lock B

b.setBalance(bal*1.1) openTransaction

a.withdraw(bal/10) lock A bal = b.getBalance() waits for T ’s

lock on B

closeTransaction unlock A , B

lock B

b.setBalance(bal*1.1)

c.withdraw(bal/10) lock C

closeTransaction unlock B

, C

49

Fig16.15:Lock compatibility

For one object Lock requested

read write

Lock already set none OK OK

read OK wait

write wait wait

50

Fig16.16:Use of locks in strict two-

phase locking

1. When an operation accesses an object within a transaction:

(a) If the object is not already locked, it is locked and the operation proceeds.

(b) If the object has a conflicting lock set by another transaction, the

transaction must wait until it is unlocked.

(c) If the object has a non-conflicting lock set by another transaction, the lock

is shared and the operation proceeds.

(d) If the object has already been locked in the same transaction, the lock will

be promoted if necessary and the operation proceeds. (Where promotion is

prevented by a conflicting lock, rule (b) is used.)

2. When a transaction is committed or aborted, the server unlocks all objects it

locked for the transaction.

51

Fig16.17:Lock class

public class Lock {

private Object object; // the object being protected by the lock

private Vector holders; // the TIDs of current holders

private LockType lockType; // the current type

public synchronized void acquire(TransID trans, LockType aLockType ){

while(/*another transaction holds the lock in conflicing mode*/) {

try {

wait();

}catch ( InterruptedException e){/*...*/ } }

if(holders.isEmpty()) { // no TIDs hold lock

holders.addElement(trans);

lockType = aLockType;

} else if(/*another transaction holds the lock, share it*/ ) ){

if(/* this transaction not a holder*/) holders.addElement(trans);

} else if (/* this transaction is a holder but needs a more exclusive lock*/)

lockType.promote(); } }

public synchronized void release(TransID trans ){

holders.removeElement(trans); // remove this holder

// set locktype to none

notifyAll(); } }

52

Fig16.18: LockManager class

public class LockManager { private Hashtable theLocks; public void setLock(Object object, TransID trans, LockType lockType){ Lock foundLock; synchronized(this){ // find the lock associated with object // if there isn’t one, create it and add to the hashtable } foundLock.acquire(trans, lockType); }

// synchronize this one because we want to remove all entries public synchronized void unLock(TransID trans) { Enumeration e = theLocks.elements(); while(e.hasMoreElements()){ Lock aLock = (Lock)(e.nextElement()); if(/* trans is a holder of this lock*/ ) aLock.release(trans); } }}

53

Fig6.19:Deadlock with write locks

Transaction T Transaction U

Operations Locks Operations Locks

a.deposit(100); write lock A

b.deposit(200) write lock B

b.withdraw(100)

waits for U ’s a.withdraw(200); waits for T ’s

lock on B lock on A

54

Fig16.20:The wait-for graph for

Fig16.19

B

A

Waits for

Held by

Held by

T U U T

Waits for

55

Fig16.21:A cycle in a wait-for graph

U

V

T

56

Fig16.22:Another wait-for graph

C

T

U V

Held by

Held by

Held by

T

U

V

W

W

B

Held by

Waits for

57

Fig16.23:Resolution of the deadlock in

Fig:15.19

Transaction T Transaction U

Operations Locks Operations Locks

a.deposit(100); write lock A

b.deposit(200) write lock B

b.withdraw(100)

waits for U ’s a.withdraw(200); waits for T’s

lock on B lock on A

(timeout elapses)

T’s lock on A becomes vulnerable,

unlock A , abort T

a.withdraw(200); write locks A

unlock A , B

58

Fig16.24:Lock compatibility (read, write and commit locks)

For one object Lock to be set

read write commit

Lock already set none OK OK OK

read OK OK wait

write OK wait

commit wait wait

Fig16.25:Lock hierarchy for the banking example

Branch

Account A B C

59

Fig16.26:Lock hierarchy for a diary Week

Monday Tuesday Wednesday Thursday Friday

9:00–10:00

time slots

10:00–11:00 11:00–12:00 12:00–13:00 13:00–14:00 14:00–15:00 15:00–16:00

Fig16.27:Lock compatibility table for hierarchic locks

For one object Lock to be set

read write I-read I-write

Lock already set none OK OK OK OK

read OK wait OK wait

write wait wait wait wait

I-read OK wait OK OK

I-write wait wait OK OK

60

Serializability of transaction T with respect to transaction Ti

Tv Ti Rule

write read 1. Ti must not read objects written by Tv

read write 2. Tv must not read objects written by Ti

write write 3. Ti must not write objects written by Tv and

Tv must not write objects written by Ti

Validation of Transactions Backward validation of transaction Tv

boolean valid = true;

for (int Ti = startTn+1; Ti <= finishTn; Ti++){

if (read set of Tv intersects write set of Ti) valid = false;

}

Forward validation of transaction Tv

boolean valid = true;

for (int Tid = active1; Tid <= activeN; Tid++){

if (write set of Tv intersects read set of Tid) valid = false;

}

61

Fig16.28:Validation of transactions

Earlier committed transactions

Working Validation Update

T 1

T v

Transaction

being validated

T 2

T 3

Later active

transactions

active 1

active 2

Fig16.29:Operation conflicts for timestamp ordering

Rule Tc Ti

1. write read Tc must not write an object that has been read by any Ti where this requires that Tc ≥ the maximum read timestamp of the object.

2. write write Tc must not write an object that has been written by any Ti where

Ti > Tc

this requires that Tc > write timestamp of the committed object.

3. read write Tc must not read an object that has been written by any Ti where this requires that Tc > write timestamp of the committed object.

Ti > Tc

Ti > Tc

62

write (b) T3 (a) write

(c) T3 write object produced

by transaction Ti

(with write timestamp Ti)

T3

write (d) T3

T1<T2<T3<T4

Time

Before

After

T 2

T 2 T 3

Time

Before

After

T 2

T 2 T 3

T 1

T 1

Time

Before

After

T 1

T 1

T 4

T 3 T 4

Time

Transaction

aborts Before

After

T 4

T 4

Tentative

Committed

T i

T i

Key:

Timestamp ordering write rule

if (Tc ≥ maximum read timestamp on D &&

Tc > write timestamp on committed version of D)

perform write operation on tentative version of D with write timestamp Tc

else /* write is too late */

Abort transaction Tc

Fig16.30:Writ

e operations &

timestamps

63

Timestamp ordering read rule

if ( Tc > write timestamp on committed version of D) {

let Dselected be the version of D with the maximum write timestamp ≤ Tc

if (Dselected is committed)

perform read operation on the version Dselected

else

Wait until the transaction that made version Dselected commits or aborts

then reapply the read rule

} else Abort transaction Tc

(b) T3 read

Time

read proceeds

Selected

T 2

Time

read proceeds

Selected

T 2 T 4

Time

read

waits

Selected

T 1 T 2

Time

Transaction aborts T 4

Key:

Tentative

Committed

T i

T i

object produced

by transaction Ti

(with write timestamp

Ti)

T1 < T2 < T3 < T4

(a) T3 read

(c) T3 read (d) T3 read

64

Fig16.32:Timestamps in transactions T &U

Timestamps and versions of objects

T U A B C

RTS WTS RTS WTS RTS WTS {} S {} S {} S

openTransaction bal = b.getBalance() {T}

openTransaction b.setBalance(bal*1.1)

bal = b.getBalance()

wait for T

a.withdraw(bal/10)

commit T T

bal = b.getBalance()

b.setBalance(bal*1.1)

c.withdraw(bal/10) S, U

T, U

S, T

S, T

{U}

65

Fig16.33:Late write operation

would invalidate a read

Time

T 4 write; T 5 read; T 3 write; T 3 read;

T 2

T 3 T 5

T 1 T 3

T 1 < T 2 < T 3 < T 4 < T 5

Key:

Tentative Committed

T i T i

T k T k

object produced by transaction

Ti (with write timestamp Ti and

read timestamp Tk)

66

Fig17.1:Distributed transactions

(a) Flat transaction (b) Nested transactions

Client

X

Y

Z

X

Y

M

N T 1

T 2

T 11

Client

P

T

T 12

T 21

T 22

T

T

Distributed transactions

67

Fig17.2:Nested banking transaction

a.withdraw(10)

c . deposit(10)

b.withdraw(20)

d.deposit(20)

Client A

B

C

T 1

T 2

T 3

T 4

T

D

X

Y

Z

T = openTransaction

openSubTransaction a.withdraw(10);

closeTransaction

openSubTransaction b.withdraw(20);

openSubTransaction c.deposit(10);

openSubTransaction d.deposit(20);

68

Fig17.3:A distributed banking transaction

. .

BranchZ

BranchX

participant

participant

C

D

Client

BranchY

B

A

participant join

join

join

T

a.withdraw(4);

c.deposit(4);

b.withdraw(3);

d.deposit(3);

openTransaction

b.withdraw(T, 3);

closeTransaction

T = openTransaction

a.withdraw(4);

c.deposit(4); b.withdraw(3); d.deposit(3);

closeTransaction

Note: the coordinator is in one of the servers, e.g. BranchX

69

Fig17.4:Operations for two-phase

commit protocol canCommit?(trans)-> Yes / No

Call from coordinator to participant to ask whether it can commit a transaction.

Participant replies with its vote.

doCommit(trans)

Call from coordinator to participant to tell participant to commit its part of a

transaction.

doAbort(trans)

Call from coordinator to participant to tell participant to abort its part of a

transaction.

haveCommitted(trans, participant)

Call from participant to coordinator to confirm that it has committed the

transaction.

getDecision(trans) -> Yes / No

Call from participant to coordinator to ask for the decision on a transaction after

it has voted Yes but has still had no reply after some delay. Used to recover from

server crash or delayed messages.

70

Fig17.5:The two-phase commit

protocol

Phase 1 (voting phase):

1. The coordinator sends a canCommit? request to each of the participants in the

transaction.

2. When a participant receives a canCommit? request it replies with its vote (Yes or No) to

the coordinator. Before voting Yes, it prepares to commit by saving objects in permanent

storage. If the vote is No the participant aborts immediately.

Phase 2 (completion according to outcome of vote):

3. The coordinator collects the votes (including its own).

(a) If there are no failures and all the votes are Yes the coordinator decides to commit the

transaction and sends a doCommit request to each of the participants.

(b) Otherwise the coordinator decides to abort the transaction and sends doAbort requests to

all participants that voted Yes.

4. Participants that voted Yes are waiting for a doCommit or doAbort request from the

coordinator. When a participant receives one of these messages it acts accordingly and in

the case of commit, makes a haveCommitted call as confirmation to the coordinator.

71

Fig17.6: Communication in two-phase commit protocol

canCommit?

Yes

doCommit

haveCommitted

Coordinator

1

3

(waiting for votes)

committed

done

prepared to commit

step

Participant

2

4

(uncertain)

prepared to commit

committed

status step status

72

Fig17.7:Operations in coordinator for nested transactions

openSubTransaction(trans) -> subTrans

Opens a new subtransaction whose parent is trans and returns a unique subtransaction

identifier.

getStatus(trans)-> committed, aborted, provisional

Asks the coordinator to report on the status of the transaction trans. Returns values

representing one of the following: committed, aborted, provisional.

1

2

T 11

T 12

T 22

T 21

abort (at M)

provisional commit (at N)

provisional commit (at X)

aborted (at Y)

provisional commit (at N)

provisional commit (at P)

T

T

T

Fig17.8: Transaction T decides whether to commit

73

Fig17.9:Information held by coordinators of nested transactions

Coordinator of

transaction

Child

transactions

Participant Provisional

commit list

Abort list

T T 1 , T 2 yes T 1 , T 12 T 11 , T 2

T 1 T 11 , T 12 yes T 1 , T 12 T 11

T 2 T 21 , T 22 no (aborted) T 2

T 11 no (aborted) T 11

T 12 , T 21 T 12 but not T 21 T 21 , T 12

T 22 no (parent aborted) T 22

*T 21’s parent has aborted

*

canCommit?(trans, subTrans) -> Yes / No

Call a coordinator to ask coordinator of

child subtransaction whether it can commit

a subtransaction subTrans. The first

argument trans is the transaction identifier

of top-level transaction. Participant replies

with its vote Yes / No.

Fig17.10:canCommit?for hierarchic

two-phase commit protocol

Fig17.11:canCommit? for flat

two-phase commit protocol

canCommit?(trans, abortList) -> Yes / No

Call from coordinator to participant to

ask whether it can commit a transaction.

Participant replies with its vote Yes / No.

74

Fig17.12: Interleavings of transactions

U, V and W

U V W

d.deposit(10) lock D

b.deposit(10) lock B

a.deposit(20) lock A

at Y

at X

c.deposit(30) lock C

b.withdraw(30) wait at Y

at Z

c.withdraw(20) wait at Z

a.withdraw(20) wait at X

75

Fig17.13:Distributed deadlock

D

Waits for

Waits

for

Held by

Held

by

B Waits for

Held

by

X

Y

Z

Held by

W

U V

A C

W

V

U

(a) (b)

76

Fig17.14:Local and global wait-for graphs

X

T U

Y

V T T

U V

local wait-for graph local wait-for graph global deadlock detector

77

Fig17.15:Probes transmitted to detect

deadlock

V

Held by

W

Waits for Held by

Waits for

Waits for

Deadlock

detected

U

C

A

B

Initiation

W U V W

W U

W U V

Z

Y

X

→ → →

→ →

78

Fig17.16:Two probes initiated

(a) initial situation (b) detection initiated at object

requested by T

(c) detection initiated at object

requested by W

U

T

V

W

Waits for

Waits for

V

W

U

T

T U W V T U W

T U Waits for

U

V

T

W

W V T W V T U

W V

Waits for

→ → → →

→ →

→ → →

79

Fig17.17:Probes travel downhill .

(b) Probe is forwarded when V starts waiting (a) V stores probe when U starts waiting

U

W

V

probe queue

U V

Waits for

B Waits for B

Waits

for C

V W

U V

V

U V

U V U

W probe queue

→ →

80

Fig17.8:Types of entry in a recovery file

Type of entry Description of contents of entry

Object A value of an object.

Transaction status

Transaction identifier, transaction status ( prepared , committed

aborted ) and other status values used for the two-phase

commit protocol.

Intentions list Transaction identifier and a sequence of intentions, each of

which consists of <objectID, Pi>, where Pi is the position in the

recovery file of the value of the object.

Fig17.19:Log for banking service P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7

Object: A Object: B Object: C Object: A Object: B Trans: T Trans: T Object: C Object: B Trans: U

100 200 300 80 220 prepared committed 278 242 prepared

< A , P 1 > < C , P 5 >

< B , P 2 > < B , P 6 >

P 0 P 3 P 4

Checkpoint End

of log

81

Fig17.20: Shadow versions

P 0 P 0 ' P 0 " P 1 P 2 P 3 P 4

Version store 100 200 300 80 220 278 242

Checkpoint

Map at start Map when T commits

A P 0 A P 1

B P 0 ' B P 2

C P 0 " C P 0 "

Fig17.21:Log with entries relating to two-phase commit protocol

Trans: T Coord’r: T Trans: T Trans: U Part’pant: U Trans: U Trans: U

prepared part’pant list: . . .

committed prepared Coord’r: . . uncertain committed

intentions list

intentions list

82

Fig17.22: Recovery of the two-phase

commit protocol Role Status Action of recovery manager

Coordinator prepared No decision had been reached before the server failed. It sends abortTransaction to all the servers in the participant list and adds the transaction status aborted in its recovery file. Same action for state aborted . If there is no participant list, the participants will eventually timeout and abort the transaction.

Coordinator committed A decision to commit had been reached before the server failed. It sends a doCommit to all the participants in its participant list (in case it had not done so before) and resumes the two-phase protocol at step 4 (Fig 13.5).

Participant committed The participant sends a haveCommitted message to the coordinator (in case this was not done before it failed). This will allow the coordinator to discard information about this transaction at the next checkpoint.

Participant uncertain The participant failed before it knew the outcome of the transaction. It cannot determine the status of the transaction until the coordinator informs it of the decision. It will send a getDecision to the coordinator to determine the status of the transaction. When it receives the reply it will commit or abort accordingly.

Participant prepared The participant has not yet voted and can abort the transaction.

Coordinator done No action is required.

83

Fig17.23:Nested transactions

A T

A 1

A 11

A 12

A 2

A 1

T 1

T 11

T 12

T 2

A 11

A 11

12

A 12

A 2

top of stack

T 1

T 2

T11

T 12

84

Fig18.1:A basic architectural model for the management of

replicated data

FE

Requests and

replies

C

Replica C

Service Clients Front ends

managers

RM

RM FE

RM

Replication

85

Fig18.2:View-synchronous group

communication

p

q

r

p crashes

view (q, r) view (p, q, r)

p

q

r

p crashes

view (q, r) view (p, q, r)

a (allowed). b (allowed).

p

q

r

view (p, q, r)

p

q

r

p crashes

view (q, r) view (p, q, r)

c (disallowed). d (disallowed).

p crashes

view (q, r)

86

Fig18.3:The passive (primary-backup)

model for fault tolerance

FE C

FE C

RM

Primary

Backup

Backup

RM

RM

87

Fig18.4:Active replication

FE C FE C RM

RM

RM

88

Fig18.6:Front ends propagate

their timestamps whenever

clients communicate directly

FE

Clients

FE

Service

Vector

timestamps

RM RM

RM

gossip

Fig18.5:Query and update

operations in a gossip service

Query Val

FE

RM RM

RM

Query, prev Val, new

Update

FE

Update, prev Update id

Clients

gossip

89

Fig18.7:A gossip replica manager,

showing its main state components Other replica managers

Replica timestamp

Update log

Value timestamp

Value

Executed operation table

Stable

updates

Updates

Gossip

messages

FE

Replica timestamp

Replica log

OperationID Update Prev

FE

Replica manager

Timestamp table

90

Fig18.8:Committed and tentative

updates in Bayou

c0 c1 c2 cN t0 t1 ti

Committed Tentative

t2

Tentative update ti becomes the next committed update

and is inserted after the last committed update cN.

ti+1

91

Fig18.9:Transactions on replicated data

B

A

Client + front end

B B B A A

getBalance(A)

Client + front end

Replica managers Replica managers

deposit(B,3);

U T

92

Fig18.10:Available copies

A

X

Client + front end

P

B

Client + front end

Replica managers

deposit(A,3);

U T

deposit(B,3);

getBalance(B)

getBalance(A)

Replica managers

Y

M

B

N

A

B

93

Fig18.11:Network partition

Client + front end

B

withdraw(B, 4)

Client + front end

Replica managers

deposit(B,3);

U T Network

partition

B

B B

94

Gifford’s quorum concensus examples

Example 1 Example 2 Example 3

Latency Replica 1 75 75 75

(milliseconds) Replica 2 65 100 750

Replica 3 65 750 750

Voting Replica 1 1 2 1

configuration Replica 2 0 1 1

Replica 3 0 1 1

Quorum R 1 2 1

sizes W 1 3 3

Derived performance of file suite:

Read Latency 65 75 75

Blocking probability 0.01 0.0002 0.000001

Write Latency 75 100 750

Blocking probability 0.01 0.0101 0.03

95

Fig18.12:Two network partitions

Replica managers

Network partition

V X Y Z

T Transaction

Fig18.13:Virtual partition

X V Y Z

Replica managers

Virtual partition Network partition

96

Creating a virtual partition Phase 1:

The initiator sends a Join request to each potential member. The argument

of Join is a proposed logical timestamp for the new virtual partition.

When a replica manager receives a Join request, it compares the proposed

logical timestamp with that of its current virtual partition.

If the proposed logical timestamp is greater it agrees to join and replies Yes;

If it is less, it refuses to join and replies No.

Phase 2:

If the initiator has received sufficient Yes replies to have read and write

quora, it may complete the creation of the new virtual partition by sending

a Confirmation message to the sites that

agreed to join. The creation

timestamp and list of actual members

are sent as arguments.

Replica managers receiving the

Confirmation message join the new virtual

partition and record its creation

timestamp and list of actual members.

Fig18.14:Two overlapping virtual partitions

Virtual partition V 1

Virtual partition V 2

Y X V Z

97

Questions?