Chapter 13

51
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg 2001 email: [email protected] t This material is made available for private study and for direct use by individual teachers. It may not be included in any product or employed in any service without the written permission of the authors. Viewing: These slides must be viewed in slide show mode. Teaching material based on Distributed Systems: Concepts and Design, Edition 3, Addison- Wesley 2001. Distributed Systems Course Distributed transactions 13.1 Introduction 13.2 Flat and nested distributed transactions 13.3 Atomic commit protocols 13.4 Concurrency control in distributed transactions 13.5 Distributed deadlocks 13.6 Transaction recovery

Transcript of Chapter 13

Page 1: Chapter 13

Copyright © George Coulouris, Jean Dollimore, Tim Kindberg 2001 email: [email protected] material is made available for private study and for direct use by individual teachers.It may not be included in any product or employed in any service without the written permission of the authors.

Viewing: These slides must be viewed in slide show mode.

Teaching material based on Distributed Systems: Concepts and Design, Edition 3, Addison-Wesley 2001.

Distributed Systems Course Distributed transactions

13.1 Introduction13.2 Flat and nested distributed transactions13.3 Atomic commit protocols13.4 Concurrency control in distributed transactions13.5 Distributed deadlocks13.6 Transaction recovery

Page 2: Chapter 13

2

Commitment of distributed transactions - introduction

a distributed transaction refers to a flat or nested transaction that accesses objects managed by multiple servers

When a distributed transaction comes to an end– the either all of the servers commit the transaction – or all of them abort the transaction.

one of the servers is coordinator, it must ensure the same outcome at all of the servers.

the ‘two-phase commit protocol’ is the most commonly used protocol for achieving this

Page 3: Chapter 13

3

Distributed transactions

Client

X

Y

Z

X

Y

M

NT1

T2

T11

Client

P

TT

12

T21

T22

(a) Flat transaction (b) Nested transactions

T

T

Figure 13.1A flat client transaction completes each of its requests before going on to the next one. Therefore, each transaction accesses servers’ objects sequentially

In a nested transaction, the top-level transaction can open subtransactions, and each subtransaction can open further subtransactions down to any depth of nesting

In the nested case, subtransactions at the same level can run concurrently, so T1 and T2 are concurrent, and as they invoke objects in different servers, they can run in parallel.

Page 4: Chapter 13

4

Nested banking transaction

client transfers $10 from A to C and then transfers $20 from B to

a.withdraw(10)

c.deposit(10)

b.withdraw(20)

d.deposit(20)

Client A

B

C

T1

T2

T3

T4

T

D

X

Y

Z

T = openTransaction

openSubTransactiona.withdraw(10);

closeTransaction

openSubTransactionb.withdraw(20);

openSubTransactionc.deposit(10);

openSubTransactiond.deposit(20);

Figure 13.2requests can be run in parallel -with several servers, the nested transaction is more efficient

Page 5: Chapter 13

5

The coordinator of a flat distributed transaction

Servers execute requests in a distributed transaction – when it commits they must communicate with one another to

coordinate their actions– a client starts a transaction by sending an openTransaction request to

a coordinator in any server (next slide) it returns a TID unique in the distributed system(e.g. server ID + local

transaction number) at the end, it will be responsible for committing or aborting it

– each server managing an object accessed by the transaction is a participant - it joins the transaction (next slide)

a participant keeps track of objects involved in the transaction at the end it cooperates with the coordinator in carrying out the commit

protocol

– note that a participant can call abortTransaction in coordinator

Why might a participant abort a transaction?

Page 6: Chapter 13

6

A flat distributed banking transaction

Note that the TID (T) is passed with each request e.g. withdraw(T,3)

..

BranchZ

BranchX

participant

participant

C

D

Client

BranchY

B

A

participant join

join

join

T

a.withdraw(4);

c.deposit(4);

b.withdraw(3);

d.deposit(3);

openTransaction

b.withdraw(T, 3);

closeTransaction

T = openTransaction a.withdraw(4); c.deposit(4); b.withdraw(3); d.deposit(3);

closeTransaction

Note: the coordinator is in one of the servers, e.g. BranchX

Figure 13.3

a client’s (flat) banking transaction involves accounts A, B, C and D at servers BranchX, BranchY and BranchZ

openTransaction goes to the coordinator

Each server is shown with a participant, which joins the transaction by invoking the join method in the coordinator

Page 7: Chapter 13

7

The join operation

The interface for Coordinator is shown in Figure 12.3– it has openTransaction, closeTransaction and abortTransaction

– openTransaction returns a TID which is passed with each operation so that servers know which transaction is accessing its objects

The Coordinator interface provides an additional method, join, which is used whenever a new participant joins the transaction:– join(Trans, reference to participant)

– informs a coordinator that a new participant has joined the transaction Trans.

– the coordinator records the new participant in its participant list.

– the fact that the coordinator knows all the participants and each participant knows the coordinator will enable them to collect the information that will be needed at commit time.

Page 8: Chapter 13

8

Atomic commit protocols

transaction atomicity requires that at the end, – either all of its operations are carried out or none of them.

in a distributed transaction, the client has requested the operations at more than one server

one-phase atomic commit protocol– the coordinator tells the participants whether to commit or abort– what is the problem with that?– this does not allow one of the servers to decide to abort – it may have

discovered a deadlock or it may have crashed and been restarted

two-phase atomic commit protocol– is designed to allow any participant to choose to abort a transaction– phase 1 - each participant votes. If it votes to commit, it is prepared. It cannot

change its mind. In case it crashes, it must save updates in permanent store– phase 2 - the participants carry out the joint decision

•The decision could be commit or abort - participants record it in permanent store

Page 9: Chapter 13

9

Failure model for the commit protocols

Recall the failure model for transactions in Chapter 12– this applies to the two-phase commit protocol

Commit protocols are designed to work in– asynchronous system (e.g. messages may take a very long time)– servers may crash – messages may be lost. – assume corrupt and duplicated messages are removed. – no byzantine faults – servers either crash or they obey their requests

2PC is an example of a protocol for reaching a consensus. – Chapter 11 says consensus cannot be reached in an asynchronous system if

processes sometimes fail.– however, 2PC does reach consensus under those conditions. – because crash failures of processes are masked by replacing a crashed

process with a new process whose state is set from information saved in permanent storage and information held by other processes.

Page 10: Chapter 13

10

The two-phase commit protocol

During the progress of a transaction, the only communication between coordinator and participant is the join request– The client request to commit or abort goes to the coordinator

if client or participant request abort, the coordinator informs the participants immediately

if the client asks to commit, the 2PC comes into use

2PC– voting phase: coordinator asks all participants if they can commit

if yes, participant records updates in permanent storage and then votes

– completion phase: coordinator tells all participants to commit or abort– the next slide shows the operations used in carrying out the protocol

How many messages are sent between the coordinator and each participant?Why does participant record updates in permanent storage at bthis stage?

Page 11: Chapter 13

11

Operations for two-phase commit protocol

participant interface- canCommit?, doCommit, doAbort

coordinator interface- haveCommitted, getDecision

canCommit?(trans)-> Yes / NoCall from coordinator to participant to ask whether it can commit a transaction. Participant replies with its vote.

doCommit(trans) Call from coordinator to participant to tell participant to commit its part of a transaction.

doAbort(trans) Call from coordinator to participant to tell participant to abort its part of a transaction.

haveCommitted(trans, participant) Call from participant to coordinator to confirm that it has committed the transaction.

getDecision(trans) -> Yes / NoCall from participant to coordinator to ask for the decision on a transaction after it has voted Yes but has still had no reply after some delay. Used to recover from server crash or delayed messages. Figure 13.4

This is a request with a reply

These are asynchronous requests to avoid delays

Asynchronous request

Page 12: Chapter 13

12

The two-phase commit protocol

Figure 13.5

• Phase 1 (voting phase): • 1. The coordinator sends a canCommit? request to each of the participants in

the transaction.• 2. When a participant receives a canCommit? request it replies with its vote

(Yes or No) to the coordinator. Before voting Yes, it prepares to commit by saving objects in permanent storage. If the vote is No the participant aborts immediately.

• Phase 2 (completion according to outcome of vote):• 3. The coordinator collects the votes (including its own).

w (a)If there are no failures and all the votes are Yes the coordinator decides to commit the transaction and sends a doCommit request to each of the participants.

w (b)Otherwise the coordinator decides to abort the transaction and sends doAbort requests to all participants that voted Yes.

• 4. Participants that voted Yes are waiting for a doCommit or doAbort request from the coordinator. When a participant receives one of these messages it acts accordingly and in the case of commit, makes a haveCommitted call as confirmation to the coordinator.

Page 13: Chapter 13

13

Communication in two-phase commit protocol

Time-out actions in the 2PC to avoid blocking forever when a process crashes or a message is lost

– uncertain participant (step 2) has voted yes. it can’t decide on its own it uses getDecision method to ask coordinator about outcome

– participant has carried out client requests, but has not had a Commit?from the coordinator. It can abort unilaterally

– coordinator delayed in waiting for votes (step 1). It can abort and send doAbort to participants.

canCommit?

Yes

doCommit

haveCommitted

Coordinator

1

3

(waiting for votes)

committed

done

prepared to commit

step

Participant

2

4

(uncertain)prepared to commit

committed

statusstepstatus

Figure 13.6

Think about step 2 - what is the problem for the participant?Think about participant before step 2 - what is the problem?Think about the coordinator in step 1 - what is the problem?

Page 14: Chapter 13

14

Performance of the two-phase commit protocol

if there are no failures, the 2PC involving N participants requires– N canCommit? messages and replies, followed by N doCommit

messages. the cost in messages is proportional to 3N, and the cost in time is three

rounds of messages. The haveCommitted messages are not counted

– there may be arbitrarily many server and communication failures– 2PC is is guaranteed to complete eventually, but it is not possible to

specify a time limit within which it will be completed delays to participants in uncertain state some 3PCs designed to alleviate such delays

• they require more messages and more rounds for the normal case

Page 15: Chapter 13

15

13.3.2 Two-phase commit protocol for nested transactions

Recall Fig 13.1b, top-level transaction T and subtransactions T1, T2, T11, T12, T21, T22

A subtransaction starts after its parent and finishes before it When a subtransaction completes, it makes an independent

decision either to commit provisionally or to abort. – A provisional commit is not the same as being prepared: it is a local decision

and is not backed up on permanent storage.

– If the server crashes subsequently, its replacement will not be able to carry out a provisional commit.

A two-phase commit protocol is needed for nested transactions – it allows servers of provisionally committed transactions that have crashed to

abort them when they recover.

Page 16: Chapter 13

16

Figure 13.7Operations in coordinator for nested transactions

openSubTransaction(trans) -> subTransOpens a new subtransaction whose parent is trans and returns a unique subtransaction identifier.

getStatus(trans)-> committed, aborted, provisionalAsks the coordinator to report on the status of the transaction trans. Returns values representing one of the following: committed, aborted, provisional.

This is the interface of the coordinator of a subtransaction.– It allows it to open further subtransactions

– It allows its subtransactions to enquire about its status

Client starts by using OpenTransaction to open a top-level transaction.– This returns a TID for the top-level transaction

– The TID can be used to open a subtransaction The subtransaction automatically joins the parent and a TID is returned.

The TID of a subtransaction is an extension of its parent's TID, so that a subtransaction can work out the TID of the top-level transaction.The client finishes a set of nested transactions by calling closeTransaction or abortTransacation in the top-level transaction.

Page 17: Chapter 13

17

Transaction T decides whether to commit

1

2

T11

T12

T22

T21

abort (at M)

provisional commit (at N)

provisional commit (at X)

aborted (at Y)

provisional commit (at N)

provisional commit (at P)

T

T

T

Recall that 1. A parent can commit even if a subtransaction aborts2. If a parent aborts, then its subtransactions must abort

– In the figure, each subtransaction has either provisionally committed or aborted

Figure 13.8

T12 has provisionally committed and T11 has aborted, but the fate of T12 depends on its parent T1 and eventually on the top-level transaction, T. Although T21 and T22 have both provisionally committed, T2 has aborted and this means that T21 and T22 must also abort.

Suppose that T decides to commit although T2 has aborted, also that T1 decides to commit although T11 has aborted

Page 18: Chapter 13

18

Information held by coordinators of nested transactions

Coordinator oftransaction

Childtransactions

Participant Provisionalcommit list

Abort list

T T1, T2 yes T1, T12 T11, T2

T1 T11, T12 yes T1, T12 T11

T2 T21, T22 no (aborted) T2

T11 no (aborted) T11

T12, T21 T12 but notT21 T21, T12

T22 no (parent aborted)T22

When a top-level transcation commits it carries out a 2PC Each coordinator has a list of its subtransactions At provisional commit, a subtransaction reports its status and

the status of its descendents to its parent If a subtransaction aborts, it tells its parent

Figure 13.9

T12 and T21 share a coordinator as they both run at server NWhen T2 is aborted it tells T (no information about descendents)A subtransaction (e.g. T21 and T22) is called an orphan if one of its ancestors abortsan orphan uses getStatus to ask its parent about the outcome. It should abort if its parent has

Page 19: Chapter 13

19

canCommit? for hierarchic two-phase commit protocol

canCommit?(trans, subTrans) -> Yes / NoCall a coordinator to ask coordinator of child subtransaction whether it can commit a subtransaction subTrans. The first argument trans is the transaction identifier of top-level transaction. Participant replies with its vote Yes / No.

Top-level transaction is coordinator of 2PC. participant list:

– the coordinators of all the subtransactions that have provisionally committed– but do not have an aborted ancestor– E.g. T, T1 and T12 in Figure 13.8– if they vote yes, they prepare to commit by saving state in permanent store

The state is marked as belonging to the top-level transaction

The 2PC may be performed in a hierarchic or a flat manner

Figure 13.10

Hierarchic 2PC - T asks canCommit? to T1 and T1 asks canCommit? to T12The subTrans argument is use to find the subtransaction to vote on. If absent, vote no.The trans argument is used when saving the objects in permanent storage

Page 20: Chapter 13

20

canCommit? for flat two-phase commit protocol

canCommit?(trans, abortList) -> Yes / NoCall from coordinator to participant to ask whether it can commit a transaction. Participant replies with its vote Yes / No.

Flat 2PC– the coordinator of the top-level transaction sends canCommit? messages

to the coordinators of all of the subtransactions in the provisional commit list.

– in our example, T sends to the coordinators of T1 and T12.– the trans argument is the TID of the top-level transaction– the abortList argument gives all aborted subtransactions

e.g. server N has T12 prov committed and T21 aborted

– On receiving canCommit, participant looks in list of transactions for any that match trans (e.g. T12 and T21 at N) it prepares any that have provisionally committed and are not in abortList and

votes yes if it can't find any it votes no

Figure 13.11

Compare the advantages and disadvantages of the flat and nested approaches

Page 21: Chapter 13

21

Time-out actions in nested 2PC

With nested transactions delays can occur in the same three places as before– when a participant is prepared to commit– when a participant has finished but has not yet received canCommit?– when a coordinator is waiting for votes

Fourth place: – provisionally committed subtransactions of aborted subtransactions

e.g. T22 whose parent T2 has aborted– use getStatus on parent, whose coordinator should remain active for a

while– If parent does not reply, then abort

Page 22: Chapter 13

22

Summary of 2PC

a distributed transaction involves several different servers. – A nested transaction structure allows

additional concurrency and independent committing by the servers in a distributed transaction.

atomicity requires that the servers participating in a distributed transaction either all commit it or all abort it.

atomic commit protocols are designed to achieve this effect, even if servers crash during their execution.

the 2PC protocol allows a server to abort unilaterally.– it includes timeout actions to deal with delays due to servers crashing.

– 2PC protocol can take an unbounded amount of time to complete but is guaranteed to complete eventually.

Page 23: Chapter 13

23

13.4 Concurrency control in distributed transactions

Each server manages a set of objects and is responsible for ensuring that they remain consistent when accessed by concurrent transactions– therefore, each server is responsible for applying concurrency control

to its own objects. – the members of a collection of servers of distributed transactions are

jointly responsible for ensuring that they are performed in a serially equivalent manner

– therefore if transaction T is before transaction U in their conflicting access to objects at one of the servers then they must be in that order at all of the servers whose objects are accessed in a conflicting manner by both T and U

Page 24: Chapter 13

24

13.4.1 Locking

In a distributed transaction, the locks on an object are held by the server that manages it. – The local lock manager decides whether to grant a lock or make the

requesting transaction wait. – it cannot release any locks until it knows that the transaction has been

committed or aborted at all the servers involved in the transaction. – the objects remain locked and are unavailable for other transactions

during the atomic commit protocol an aborted transaction releases its locks after phase 1 of the protocol.

Page 25: Chapter 13

25

T U

Write(A) at X locks A

Write(B) at Y locks B

Read(B) at Y waits for U

Read(A) at X waits for T

Interleaving of transactions T and U at servers X and Y

in the example on page 529, we have – T before U at server X and U before T at server Y

different orderings lead to cyclic dependencies and distributed deadlock– detection and resolution of distributed deadlock in next section

Page 26: Chapter 13

26

13.4.2 Timestamp ordering concurrency control

Single server transactions – coordinator issues a unique timestamp to each transaction before it starts

– serial equivalence ensured by committing objects in order of timestamps

Distributed transactions– the first coordinator accessed by a transaction issues a globally unique

timestamp

– as before the timestamp is passed with each object access – the servers are jointly responsible for ensuring serial equivalence

that is if T access an object before U, then T is before U at all objects

– coordinators agree on timestamp ordering a timestamp consists of a pair <local timestamp, server-id>. the agreed ordering of pairs of timestamps is based on a comparison in

which the server-id part is less significant – they should relate to time

Page 27: Chapter 13

27

Timestamp ordering concurrency control (continued)

The same ordering can be achieved at all servers even if their clocks are not synchronized– for efficiency it is better if local clocks are roughly synchronized– then the ordering of transactions corresponds roughly to the real time

order in which they were started

Timestamp ordering– conflicts are resolved as each operation is performed– if this leads to an abort, the coordinator will be informed

it will abort the transaction at the participants

– any transaction that reaches the client request to commit should always be able to do so

participant will normally vote yes unless it has crashed and recovered during the transaction

Can the same ordering be achieved at all servers without clock synchronization?

Why is it better to have roughly synchronized clocks?

Page 28: Chapter 13

28

Optimistic concurrency control

each transaction is validated before it is allowed to commit– transaction numbers assigned at start of validation– transactions serialized according to transaction numbers– validation takes place in phase 1 of 2PC protocol

consider the following interleavings of T and U– T before U at X and U before T at Y

T U

Read(A) at X Read(B) at Y

Write(A) Write(B)

Read(B) at Y Read(A) at X

Write(B) Write(A)•

Use backward validation

1. write/read, 2. read/write, 3. write/write

1. satisfied

2. checked

3. paralllel

Suppose T & U start validation at about the same time

X does T first

Y does U first

No parallel

Validation –.

commitment

deadlock

Page 29: Chapter 13

29

Commitment deadlock in optimistic concurrency control

servers of distributed transactions do parallel validation– therefore rule 3 must be validated as well as rule 2

the write set of Tv is checked for overlaps with write sets of earlier transactions

– this prevents commitment deadlock– it also avoids delaying the 2PC protocol

another problem - independent servers may schedule transactions in different orders – e.g. T before U at X and U before T at Y– this must be prevented - some hints as to how on page 531

Page 30: Chapter 13

30

13.5 Distributed deadlocks

Single server transactions can experience deadlocks– prevent or detect and resolve– use of timeouts is clumsy, detection is preferable.

it uses wait-for graphs.

Distributed transactions lead to distributed deadlocks– in theory can construct global wait-for graph from local ones– a cycle in a global wait-for graph that is not in local ones is a

distributed deadlock

Page 31: Chapter 13

31

Figure 13.12Interleavings of transactions U, V and W

U V W

d.deposit(10) lock D

b.deposit(10) lock B

a.deposit(20) lock A at Y

at Xc.deposit(30) lock C

b.withdraw(30) wait at Y at Z

c.withdraw(20) wait at Z

a.withdraw(20) wait at X

objects A, B managed by X and Y ; C and D by Z– next slide has global wait-for graph

U V at Y

V W at Z

W U at X

Page 32: Chapter 13

32

Figure 13.13Distributed deadlock

D

Waits for

Waitsfor

Held by

Heldby

B Waits forHeld

by

X

Y

Z

Held by

W

UV

AC

W

V

U

(a) (b)

a deadlock cycle has alternate edges showing wait-for and held-by

wait-for added in order: U V at Y; V W at Z and W U at X

Page 33: Chapter 13

33

Deadlock detection - local wait-for graphs

Local wait-for graphs can be built, e.g. – server Y: U V added when U requests b.withdraw(30) – server Z: V W added when V requests c.withdraw(20)– server X: W U added when W requests a.withdraw(20)

to find a global cycle, communication between the servers is needed

centralized deadlock detection– one server takes on role of global deadlock detector– the other servers send it their local graphs from time to time– it detects deadlocks, makes decisions about which transactions to

abort and informs the other servers– usual problems of a centralized service - poor availability, lack of fault

tolerance and no ability to scale

Page 34: Chapter 13

34

Figure 13.14Local and global wait-for graphs

X

T U

Y

V TT

U V

local wait-for graph local wait-for graph global deadlock detector

Phantom deadlocks– a ‘deadlock’ that is detected, but is not really one– happens when there appears to be a cycle, but one of the

transactions has released a lock, due to time lags in distributing graphs

– in the figure suppose U releases the object at X then waits for V at Y and the global detector gets Y’s graph before X’s (T U V T)

Page 35: Chapter 13

35

Edge chasing - a distributed approach to deadlock detection

a global graph is not constructed, but each server knows about some of the edges– servers try to find cycles by sending probes which follow the edges of

the graph through the distributed system– when should a server send a probe (go back to Fig 13.13)– edges were added in order U V at Y; V W at Z and W U at X

when W U at X was added, U was waiting, but when V W at Z, W was not waiting

– send a probe when an edge T1 T2 when T2 is waiting– each coordinator records whether its transactions are active or waiting

the local lock manager tells coordinators if transactions start/stop waiting when a transaction is aborted to break a deadlock, the coordinator tells the

participants, locks are removed and edges taken from wait-for graphs

Page 36: Chapter 13

36

Edge-chasing algorithms

Three steps– Initiation:

When a server notes that T starts waiting for U, where U is waiting at another server, it initiates detection by sending a probe containing the edge < T U > to the server where U is blocked.

If U is sharing a lock, probes are sent to all the holders of the lock.

– Detection: Detection consists of receiving probes and deciding whether deadlock has

occurred and whether to forward the probes. • e.g. when server receives probe < T U > it checks if U is waiting,

e.g. U V, if so it forwards < T U V > to server where V waits• when a server adds a new edge, it checks whether a cycle is there

– Resolution: When a cycle is detected, a transaction in the cycle is aborted to break the

deadlock.

Page 37: Chapter 13

37

Figure 13.15Probes transmitted to detect deadlock

V

Held byW

Waits forHeld by

Waitsfor

Waits forDeadlockdetected

U

C

A

B

Initiation

W U V W

W U

W U V

Z

Y

X

example of edge chasing starts with X sending <W U>, then Y sends <W U V >, then Z sends <W U V W>

Page 38: Chapter 13

38

Edge chasing conclusion

probe to detect a cycle with N transactions will require 2(N-1) messages. – Studies of databases show that the average deadlock involves 2 transactions.

the above algorithm detects deadlock provided that– waiting transactions do not abort– no process crashes, no lost messages– to be realistic it would need to allow for the above failures

refinements of the algorithm (p 536-7)– to avoid more than one transaction causing detection to start and then more than one

being aborted – not time to study these now

Page 39: Chapter 13

41

Summary of concurrency control for distributed transactions

each server is responsible for the serializability of transactions that access its own objects.

additional protocols are required to ensure that transactions are serializable globally. – timestamp ordering requires a globally agreed timestamp ordering – optimistic concurrency control requires global validation or a means of

forcing a global ordering on transactions.– two-phase locking can lead to distributed deadlocks.

distributed deadlock detection looks for cycles in the global wait-for graph. edge chasing is a non-centralized approach to the detection of distributed

deadlocks .

Page 40: Chapter 13

42

13.6 Transaction recovery

Atomicity property of transactions– durability and failure atomicity – durability requires that objects are saved in permanent storage and will be available

indefinitely– failure atomicity requires that effects of transactions are atomic even when the server

crashes

Recovery is concerned with – ensuring that a server’s objects are durable and – that the service provides failure atomicity. – for simplicity we assume that when a server is running, all of its objects are in volatile

memory– and all of its committed objects are in a recovery file in permanent storage– recovery consists of restoring the server with the latest committed versions of all of its

objects from its recovery file

What is meant by durability?What is meant by failure atomicity?

Page 41: Chapter 13

43

Recovery manager

The task of the Recovery Manager (RM) is:– to save objects in permanent storage (in a recovery file) for committed

transactions;– to restore the server’s objects after a crash;– to reorganize the recovery file to improve the performance of recovery;– to reclaim storage space (in the recovery file).

media failures– i.e. disk failures affecting the recovery file– need another copy of the recovery file on an independent disk. e.g.

implemented as stable storage or using mirrored disks

we deal with recovery of 2PC separately (at the end)– we study logging (13.6.1) but not shadow versions (13.6.2)

Page 42: Chapter 13

44

Recovery - intentions lists

Each server records an intentions list for each of its currently active transactions – an intentions list contains a list of the object references and the values

of all the objects that are altered by a transaction– when a transaction commits, the intentions list is used to identify the

objects affected the committed version of each object is replaced by the tentative one the new value is written to the server’s recovery file

– in 2PC, when a participant says it is ready to commit, its RM must record its intentions list and its objects in the recovery file

it will be able to commit later on even if it crashes when a client has been told a transaction has committed, the recovery files

of all participating servers must show that the transaction is committed,• even if they crash between prepare to commit and commit

Page 43: Chapter 13

45

Types of entry in a recovery file

For distributed transactions we need information relating to the 2PC as well as object values, that is:

– transaction status (committed, prepared or aborted)– intentions list

Type of entry Description of contents of entry

Object A value of an object.Transaction status Transaction identifier, transaction status ( prepared , committed

aborted) and other status values used for the two-phasecommit protocol.

Intentions list Transaction identifier and a sequence of intentions, each ofwhich consists of <identifier of object>, <position in recoveryfile of value of object>.

Figure 13.18

Why is that a good idea?

Object state flattened to bytes

first entry says prepared

Note that the objects need not be next to one another in the recovery file

Page 44: Chapter 13

46

Logging - a technique for the recovery file

the recovery file represents a log of the history of all the transactions at a server– it includes objects, intentions lists and transaction status– in the order that transactions prepared, committed and aborted– a recent snapshot + a history of transactions after the snapshot– during normal operation the RM is called whenever a transaction

prepares, commits or aborts prepare - RM appends to recovery file all the objects in the intentions list

followed by status (prepared) and the intentions list commit/abort - RM appends to recovery file the corresponding status assume append operation is atomic, if server fails only the last write will be

incomplete to make efficient use of disk, buffer writes. Note: sequential writes are

more efficient than those to random locations committed status is forced to the log - in case server crashes

Page 45: Chapter 13

47

Log for banking service

Logging mechanism for Fig 12.7 (there would really be other objects in log file)– initial balances of A, B and C $100, $200, $300– T sets A and B to $80 and $220. U sets B and C to $242 and $278– entries to left of line represent a snapshot (checkpoint) of values of A, B and C before T

started. T has committed, but U is prepared.– the RM gives each object a unique identifier (A, B, C in diagram)– each status entry contains a pointer to the previous status entry, then the checkpoint

can follow transactions backwards through the file

P0 P1 P2 P3 P4 P5 P6 P7

Object:A Object:B Object:C Object:A Object:B Trans:T Trans:T Object:C Object:B Trans:U100 200 300 80 220 prepared committed 278 242 prepared

<A, P1> <C, P5><B, P2> <B, P6>P0 P3 P4

CheckpointEnd

of logFigure 13.19.

prepared status and intentions list

committed status

Page 46: Chapter 13

48

Recovery of objects - with logging

When a server is replaced after a crash– it first sets default initial values for its objects – and then hands over to its recovery manager.

The RM restores the server’s objects to include– all the effects of all the committed transactions in the correct order and – none of the effects of incomplete or aborted transactions– it ‘reads the recovery file backwards’ (by following the pointers)

restores values of objects with values from committed transactions continuing until all of the objects have been restored

– if it started at the beginning, there would generally be more work to do– to recover the effects of a transaction use the intentions list to find the value of the

objects e.g. look at previous slide (assuming the server crashed before T committed)

– the recovery procedure must be idempotent

Page 47: Chapter 13

49

Logging - reorganising the recovery file

RM is responsible for reorganizing its recovery file– so as to make the process of recovery faster and – to reduce its use of space

checkpointing– the process of writing the following to a new recovery file

the current committed values of a server’s objects, transaction status entries and intentions lists of transactions that have not

yet been fully resolved including information related to the two-phase commit protocol (see later)

– checkpointing makes recovery faster and saves disk space done after recovery and from time to time can use old recovery file until new one is ready, add a ‘mark’ to old file do as above and then copy items after the mark to new recovery file replace old recovery file by new recovery file

Page 48: Chapter 13

51

Recovery of the two-phase commit protocol

The above recovery scheme is extended to deal with transactions doing the 2PC protocol when a server fails

– it uses new transaction status values done, uncertain (see Fig 13.6) the coordinator uses committed when result is Yes; done when 2PC complete ( if a transaction is done its information may be removed

when reorganising the recovery file) the participant uses uncertain when it has voted Yes; committed when told the result

(uncertain entries must not be removed from recovery file)

– It also requires two additional types of entry:

Type of entry Description of contents of entry

Coordinator Transaction identifier, list of participants

added by RM when coordinator prepared

Participant Transaction identifier, coordinator

added by RM when participant votes yes

Page 49: Chapter 13

52

Log with entries relating to two-phase commit protocol

entries in log for– T where server is coordinator (prepared comes first, followed by the

coordinator entry, then committed – done is not shown)– and U where server is participant (prepared comes first followed by

the participant entry, then uncertain and finally committed)– these entries will be interspersed with values of objects

recovery must deal with 2PC entries as well as restoring objects– where server was coordinator find coordinator entry and status

entries. – where server was participant find participant entry and status entries

Trans:T Coord’r:T Trans:T Trans:U Part’pant:U Trans:U Trans:U

prepared part’pantlist: . . .

committed prepared Coord’r: . . uncertain committed

intentionslist

intentionslist

Figure 13.21

coordinator entry participant entry

Start at end, for U find it is committed and a participantWe have T committed and coordinatorBut if the server has crashed before the last entry we have U uncertain and participantor if the server crashed earlier we have U prepared and participant

Page 50: Chapter 13

53

Recovery of the two-phase commit protocol

Role Status Action of recovery manager

Coordinator prepared No decision had been reached before the server failed. It sendsabortTransaction to all the servers in the participant list and adds thetransaction status aborted in its recovery file. Same action for stateaborted. If there is no participant list, the participants will eventuallytimeout and abort the transaction.

Coordinator committed A decision to commit had been reached before the server failed. Itsends a doCommit to all the participants in its participant list (in caseit had not done so before) and resumes the two-phase protocol at step 4(Fig 13.5).

Participant committed The participant sends a haveCommitted message to the coordinator (incase this was not done before it failed). This will allow the coordinatorto discard information about this transaction at the next checkpoint.

Participant uncertain The participant failed before it knew the outcome of the transaction. Itcannot determine the status of the transaction until the coordinatorinforms it of the decision. It will send a getDecision to the coordinatorto determine the status of the transaction. When it receives the reply itwill commit or abort accordingly.

Participant prepared The participant has not yet voted and can abort the transaction.

Coordinator done No action is required.

Figure 13.22

the most recent entry in the recovery file determines the status of the transaction at the time of failure

the RM action for each transaction depends on whether server was coordinator or participant and the status

Page 51: Chapter 13

55

Summary of transaction recovery

Transaction-based applications have strong requirements for the long life and integrity of the information stored.

Transactions are made durable by performing checkpoints and logging in a recovery file, which is used for recovery when a server is replaced after a crash.

Users of a transaction service would experience some delay during recovery.

It is assumed that the servers of distributed transactions exhibit crash failures and run in an asynchronous system,

– but they can reach consensus about the outcome of transactions because crashed servers are replaced with new processes that can acquire all the relevant information from permanent storage or from other servers