the Paxos Commit algorithm

15
Databases 2 xx The Paxos Commit Algorithm

description

This is the presentation I used to give a seminar about the "Paxos Commit" algorithm. It is one of Leslie Lamport's works (in this case, a joint work between him and Jim Gray). You can find the original paper here: http://research.microsoft.com/users/lamport/pubs/pubs.html#paxos-commitFeel free to post comments ;)Enjoy.

Transcript of the Paxos Commit algorithm

Page 1: the Paxos Commit algorithm

Databases 2

xx The Paxos Commit Algorithm

Page 2: the Paxos Commit algorithm

The Paxos Commit Algorithmxx

Agenda

Paxos Commit Algorithm: Overview The participating processes

The resource managers The leader The acceptors

Paxos Commit Algorithm: the base version Failure scenarios Optimizations for Paxos Commit Performance Paxos Commit vs. Two-Phase Commit Using a dynamic set of resource managers

Page 3: the Paxos Commit algorithm

The Paxos Commit Algorithmxx

Paxos Commit Algorithm: Overview

Paxos was applied to Transaction Commit by L.Lamport and Jim Gray in Consensus on Transaction Commit

One instance of Paxos (consensus algorithm) is executed for each resource manager, in order to agree upon a value (Prepared/Aborted) proposed by it

“Not-synchronous” Commit algorithm Fault-tolerant (unlike 2PC)

Intended to be used in systems where failures are fail-stop only, for both processes and network

Safety is guaranteed (unlike 3PC) Formally specified and checked Can be optimized to the theoretically best performance

Page 4: the Paxos Commit algorithm

The Paxos Commit Algorithmxx

Participants: the resource managers

N resource managers (“RM”) execute the distributed transaction, then choose a value (“locally chosen value” or “LCV”; ‘p’ for prepared iff it is willing to commit)

Every RM tries to get its LCV accepted by a majority set of acceptors (“MS”: any subset with a cardinality strictly greater than half of the total).

Each RM is the first proposer in its own instance of Paxos

Participants: the leader

Coordinates the commit algorithm All the instances of Paxos share the same leader It is not a single point of failure (unlike 2PC) Assumed always defined (true, many leader-(s)election

algorithms exist) and unique (not necessarily true, but unlike 3PC safety does not rely on it)

Page 5: the Paxos Commit algorithm

The Paxos Commit Algorithmxx

p p p p p

Participants: the acceptors

A denotes the set of acceptors All the instances of Paxos share the

same set A of acceptors 2F+1 acceptors involved in order to

achieve tolerance to F failures We will consider only F+1 acceptors,

leaving F more for “spare” purposes (less communication overhead)

Each acceptors keep track of its own progress in a Nx1 vector

Vectors need to be merged into a Nx|MS| table, called aState, in order to take the global decision (we want “many” p’s)

AC4

AC5

AC1

AC2

AC3

Consensus box (MS)RM1

a

Ok!

RM2

p

Ok!

RM3

p

Ok!

a a a a a

p p p p p

3rd instance

1st instance

2nd instance

Acc1 Acc2 Acc3 Acc4 Acc5aState

Paxos

Page 6: the Paxos Commit algorithm

The Paxos Commit Algorithmxx

Paxos Commit (base) RMrmAMSacc

Not blocked iff F acceptors respond

T1

},{ apv

T2

AC0L AC1 AC2 RM1 RM2 RM3 RM4RM0

prepare(N-1) x

p2a rm 0 v(rm)(N(F+1)-1) x

rm 0 v(rm)rm 0 v(rm)p2b acc

rm 0 v(rm)rm 0 v(rm)rm 0 v(rm)Opt.F x

If (Global Commit)

then commitp3

else abortp3

x N

0 0 v(0)p2a1x BeginCommit

(N=5)

(F=2): Writes on log

Page 7: the Paxos Commit algorithm

The Paxos Commit Algorithmxx

Global Commit Condition

That is: there must be one and only one row for each RM involved in the commitment; in each row of those rows there must be at least F+1 entries that have ‘p’ as a value and refer to the same ballot

)()()()(( MSaccMSbrm p2b acc rm b p .)recsentwas

CommitGlobal

Page 8: the Paxos Commit algorithm

The Paxos Commit Algorithmxx

[T1] What if some RMs do not submit their LCV?

RMRMj missing

LeaderOne majorityof acceptors

Leader: «Has resource manager j ever proposed you a value?»

},{ apv

(Promise not to answer any bL2<bL1)

“accept?”

“promise”

“prepare?”

p1a

p1b

p2a

(1) Acceptori: «Yes, in my last session (ballot) bi with it I accepted its proposal vi»

(2) Acceptori: «No, never»

If (at least |MS| acceptors answered)

If (for ALL of them case (2) holds) then V=‘a’ [FREE]

else V=v(maximum({bi}) [FORCED]

Leader: «I am j, I propose V»

bL1 >0

Page 9: the Paxos Commit algorithm

The Paxos Commit Algorithmxx

trusted

trusted

trusted

ignored

ignored

ignored

[T2] What if the leader fails?

If the leader fails, some leader-(s)election algorithm is executed. A faulty election (2+ leaders) doesn’t preclude safety ( 3PC), but can impede progress…

MSb1 >0

b2>b1

Non-terminating example: infinite sequence of p1a-p1b-p2a messages from 2 leaders

Not really likely to happen It can be avoided (random T?)b3>b2

b4>b3

T

T

T

L2L1

trusted

Page 10: the Paxos Commit algorithm

The Paxos Commit Algorithmxx

Optimizations for Paxos Commit (1)

Co-Location: each acceptor is on the same node as a RM and the initiating RM is on the same node as the initial leader

-1 message phase (BeginCommit), -(F+2) messages

“Real-Time assumptions”: RMs can prepare spontaneously. The prepare phase is not needed anymore, RMs just “know” they have to prepare in some amount of time

-1 message phase (Prepare), -(N-1) messages

RM3RM0

AC0

p2a

BeginCommit

Lp3

RM1

AC1

p2a

RM4RM2

AC2

p2a

RM0

AC0L

RM3 RM4RM1

AC1

RM2

AC2

Not needed anymore!prepare(N-1) x

Page 11: the Paxos Commit algorithm

The Paxos Commit Algorithmxx

Optimizations for Paxos Commit (2)

Phase 3 elimination: the acceptors send their phase2b messages (the columns of aState) directly to the RMs, that evaluate the global commit condition

Paxos Commit + Phase 3 Elimination = Faster Paxos Commit (FPC) FPC + Co-location + R.T.A. = Optimal Consensus Algorithm

RM0

AC0L

RM3 RM4RM1

AC1

RM2

AC2

RM0

AC0L

RM3 RM4RM1

AC1

RM2

AC2

p2b

p3

p2b

Page 12: the Paxos Commit algorithm

The Paxos Commit Algorithmxx

Performance

2PC Paxos Commit Faster Paxos Commit

No coloc. Coloc. No coloc. Coloc. No coloc. Coloc.

Message delays* 4 3 5 4 4 3

Messages* 3N-1 3N-3 NF+F+3N-1 NF+3N-3 2NF+3N-1 2FN-2F+3N-3

Stable storage write delays**

2 2 2

Stable storage writes**

N+1 N+F+1 N+F+1

If we deploy only one acceptor for Paxos Commit (F=0), its fault tolerance and cost are the same as 2PC’s. Are they exactly the same protocol in that case?

*Not Assuming RMs’ concurrent preparation (slides-like scenario)**Assuming RMs’ concurrent preparation (r.t. constraints needed)

Page 13: the Paxos Commit algorithm

The Paxos Commit Algorithmxx

Paxos Commit vs. 2PC

Yes, but…

T1

T2

TM RM1Other RMs

2PC from Lamport and Gray’s paper

2PC from the slides of the

course

…two slightly different versions of 2PC!

Page 14: the Paxos Commit algorithm

The Paxos Commit Algorithmxx

Using a dynamic set of RM

You add one process, the registrar, that acts just like another resource manager, despite the following: pad Pad

RMs can join the transaction until the Commit Protocol begins

The global commit condition now holds on the set of resource managers proposed by the registrar and decided in its own instance of Paxos:

},{ apvregistrar }:{ ntransactiothejoinedrmrmvregistrar

)()()()(( MSaccMSbvrm registrar p2b acc rm b p .)recsentwas

DynRMCommitGlobal

AC4

AC5

AC1

AC2

AC3

MS

RM1a

Ok!

RM2p

Ok!

RM3p

Ok!

Paxos

REGRM1;RM2;RM3

Ok!

join

join

join

RM1RM2RM3

Page 15: the Paxos Commit algorithm

The Paxos Commit Algorithmxx

Thank You!

Questions?