Paxos: Asynchronous Consensus

37
Paxos: Asynchronous Consensus

Transcript of Paxos: Asynchronous Consensus

Page 1: Paxos: Asynchronous Consensus

Paxos: Asynchronous Consensus

Page 2: Paxos: Asynchronous Consensus

• We cannot “solve” consensus in asynchronous systems.

• We cannot meet both safety and liveness requirements.

• Maybe it is ok to guarantee just one requirement.

• Option 1:

• Let’s set super conservative timeout for a terminating algorithm.

• Safety violated if a process (or the network) is very, very slow.

• Option 2:

• Let’s focus on guaranteeing safety under all possible scenarios.

• If the real situation is not too dire, hopefully the algorithm will terminate.

Consensus in asynchronous systems

Page 3: Paxos: Asynchronous Consensus

No-failure consensus (Ricart-Agrawala)

1. Send proposal to all processes2. Decide once everyone replies

Page 4: Paxos: Asynchronous Consensus

Synchronous Consensus

1. Send proposal to everyone2. Decide when everyone replies

or timeout

Problem: timeout needs to be large enough to be certain that any non-replies means crashed process

Page 5: Paxos: Asynchronous Consensus

Asynchronous Consensus

1. Send proposal to everyone2. Decide when a quorum (strict

majority) replies

Note: liveness requires < N/2 failures

Paxos: • avoid deadlock• minimize livelock issues

Page 6: Paxos: Asynchronous Consensus

Deadlock

• 4 processes• P1, P2 agree to proposal X• P3, P4 agree to proposal Y

No majority possible without changing decision!

Page 7: Paxos: Asynchronous Consensus

Prioritization

Use proposal ID (say, process ID) to break ties. Highest ID wins• P1 proposes X with ID 1• P1, P2 accept X• P4 proposes Y with ID 4• P3, P4 accept Y• P1, P2 change their mind to Y

since 4 > 1No deadlock

Page 8: Paxos: Asynchronous Consensus

Protocol so far

Processes propose values to othersProcesses accept first proposal they seeMajority accepts = consensusProcesses accept second proposalif higher ID (change their mind)

Proposer:• send( (value,ID) ) to all processes• wait for responses• majority OK => declare proposal

accepted

Acceptor:• initialize: accepted_ID = -1• receive proposal (value,ID)• if ID > accepted_ID:

• accepted_ID = ID, reply OK

Page 9: Paxos: Asynchronous Consensus

Prioritization problem

• P1 proposes X with ID 1• P1, P2, P3 accept X• P1 gets majority response, sets

decision to X• P4 proposes Y with ID 4• P2, P3, P4 accept Y• P4 gets majority response, sets

decision to Y

Page 10: Paxos: Asynchronous Consensus

Two-phase protocol

Phase 1: proposes asks about previously accepted valuesPhase 2: proposes uses accepted value with largest ID (or its own if none accepted)

In both phases, wait for a majority of responses

Page 11: Paxos: Asynchronous Consensus

Two-phase protocol

• P1 calls prepare with ID 1• P1, P2, P3 reply with {} (no

accepted values)• P1 calls propose with ID 1, value X• P1, P2, P3 accept (X,1)• P4 calls prepare with ID 4• P2, P3 reply with {(X,1)}• P4 calls propose with ID 4 and

value X• P2, P3 accept (X, 4)

Page 12: Paxos: Asynchronous Consensus

Two-phase protocol

• P4 calls prepare with ID 4• P2, P3 reply with {}• P4 calls propose with ID 4 and

value Y• P1 calls prepare with ID 1• P2, P3 reply with {}• P1 calls propose with ID 1, value X• P1, P2, P3 accept (X,1)• P1 receives accept, decides on X• P2, P3 receive & accept (Y, 4)• P4 receives accept, decides on Y

Page 13: Paxos: Asynchronous Consensus

Prepare Promise

Reply to prepare with ID n includes:• accepted value with highest ID• a promise not to accept any values with id < n

Page 14: Paxos: Asynchronous Consensus

Two-phase protocol

• P4 calls prepare with ID 4• P2, P3 reply with {}, promise ID=4• P4 calls propose with ID 4 and

value Y• P1 calls prepare with ID 1• P2, P3 reply with nack• P1 calls propose with ID 1, value X• P1, P2, P3 accept (X,1)• P1 receives accept, decides on X• P2, P3 receive & accept (Y, 4)• P4 receives accept, decides on Y

Page 15: Paxos: Asynchronous Consensus

Two-phase protocol

• P4 calls prepare with ID 4• P2, P3 reply with {}, promise

ID=4• P4 crashes• P1 can no longer get a majority!

Solution: process can pick new ID (as long as its unique)

Page 16: Paxos: Asynchronous Consensus

Full codeProposer:

initialize v to input

pick unique proposal # n

multicast( prepare(n) ) to acceptors

if receive promise from a majority:

if any promises include accepted proposal

let (n’, v’) be promise with largest proposal #set v = v’

multicast( (propose(n, v) )

wait for majority accept replies

on timeout:

set n to be higher than previous proposals,

restart

Acceptor:

initialize promised_id = None,

accepted_id = None, accepted_v = None

on receive( prepare(n) ):

if promised_id = None or n > promised_id:

set promised_id = n

if accepted_id != None:

send(promise(accepted_id, accepted_v))

else:

send(promise(None))

on receive( propose(n,v) )

if n >= promised_id:

accepted_id = promised_id = n

accepted_v = v

send( (accept(n,v))

Page 17: Paxos: Asynchronous Consensus

Safety

• Suppose majority accepted proposal (n,v)• Let n’ be the first proposal with n’ > n

• Proposer must have received majority promises• Majority must intersect with majority that accepted (n,v)• The intersection must have accepted (n,v) before receiving prepare(n’)• Therefore it sent (promise(n,v)) in respose to prepare(n’)• Therefore proposal n’ must have value v

• By induction, all proposals with n’ > n must have value v

Page 18: Paxos: Asynchronous Consensus

Liveness

Livelock is possible

Optimization: use leader electionto pick a distinguished proposerKey: leader election only has to beprobabilistically correct

Page 19: Paxos: Asynchronous Consensus

Other Paxos features

• Separate learners fromproposers• Multi-paxos: agree on a

sequence of values• Crash-recovery: acceptor can

come back up as long as itremembers state (promises)

Page 20: Paxos: Asynchronous Consensus

Log Consensus

• Paxos algorithm (discussed so far) is used for deciding on a single value.

• Many practical systems need to decide on a sequence of values (log).

Page 21: Paxos: Asynchronous Consensus

• Replicated log => replicated state machine• All servers execute same commands in same order

• Consensus module ensures proper log replication

Replicated Log

add jmp mov shlLog

ConsensusModule

StateMachine

add jmp mov shlLog

ConsensusModule

StateMachine

add jmp mov shlLog

ConsensusModule

StateMachine

Servers

Clients

shl

Page 22: Paxos: Asynchronous Consensus

“The dirty little secret of the NSDI* community is that at most five people really, truly understand every part of Paxos ;-).”– Anonymous NSDI reviewer

*The USENIX Symposium on Networked SystemsDesign and Implementation

Paxos is difficult to understand

Page 23: Paxos: Asynchronous Consensus

“There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system…the final system will be based on an unproven protocol.”– Chubby authors

Paxos is difficult to implement

Page 24: Paxos: Asynchronous Consensus

Raft: A Consensus Algorithmfor Replicated Logs

Slides from Diego Ongaro and John Ousterhout, Stanford University

Page 25: Paxos: Asynchronous Consensus

• Replicated log => replicated state machine• All servers execute same commands in same order

• Consensus module ensures proper log replication• System makes progress as long as any majority of servers are up

• Failure model: fail-stop (not Byzantine), delayed/lost messages

Goal: Replicated Log

add jmp mov shlLog

ConsensusModule

StateMachine

add jmp mov shlLog

ConsensusModule

StateMachine

add jmp mov shlLog

ConsensusModule

StateMachine

Servers

Clients

shl

Page 26: Paxos: Asynchronous Consensus

Goal: Design for understandability

• Main objective of Raft’s design• Whenever possible, select the alternative that is the easiest to understand.

• Techniques that were used include• Dividing problems into smaller problems.• Reducing the number of system states to consider.

Page 27: Paxos: Asynchronous Consensus

Two general approaches to consensus:• Symmetric, leader-less:

• All servers have equal roles• Clients can contact any server

• Asymmetric, leader-based:• At any given time, one server is in charge, others accept its

decisions• Clients communicate with the leader

• Raft uses a leader:• Decomposes the problem (normal operation, leader changes)• Simplifies normal operation (no conflicts)• More efficient than leader-less approaches

Approaches to Consensus

Page 28: Paxos: Asynchronous Consensus

1. Leader election:• Select one of the servers to act as leader• Detect crashes, choose new leader

2. Normal operation (basic log replication)3. Safety and consistency after leader changes4. Neutralizing old leaders

Raft Overview

Page 29: Paxos: Asynchronous Consensus

1. Leader election:• Select one of the servers to act as leader• Detect crashes, choose new leader

2. Normal operation (basic log replication)3. Safety and consistency after leader changes4. Neutralizing old leaders

Raft Overview

Page 30: Paxos: Asynchronous Consensus

• At any given time, each server is either:• Leader: handles all client interactions, log replication

• At most 1 viable leader at a time• Follower: completely passive: issues no RPCs (requests),

responds to incoming RPCs • Candidate: used to elect a new leader

• Normal operation: 1 leader, N-1 followers

Server States

Page 31: Paxos: Asynchronous Consensus

• Raft servers communicate via RPCs. • What are RPCs?

• Remote Procedure Calls: procedure call between functions on different processes

• Convenient programming abstraction.

Quick Detour: RPCs

P1 P2

P2.call(“foo”, args, reply)

1. “foo”, args 2. foo(args) {….….return reply

}

3. reply

Page 32: Paxos: Asynchronous Consensus

• At any given time, each server is either:• Leader: handles all client interactions, log replication

• At most 1 viable leader at a time• Follower: completely passive: issues no RPCs, responds to

incoming RPCs • Candidate: used to elect a new leader

• Normal operation: 1 leader, N-1 followers

March 3, 2013

Server States

Follower Candidate Leader

starttimeout,start election

receive votes frommajority of servers

timeout,new election

discover server withhigher termdiscover current server

or higher term

“stepdown”

Page 33: Paxos: Asynchronous Consensus

• Time divided into terms:• Election• Normal operation under a single leader

• At most 1 leader per term• Some terms have no leader (failed election)• Each server maintains current term value• Key role of terms: identify obsolete information

TermsTerm 1 Term 2 Term 3 Term 4 Term 5

time

Elections Normal OperationSplit Vote

Page 34: Paxos: Asynchronous Consensus

• Servers start up as followers• Followers expect to receive RPCs from leaders or candidates• Leaders must send heartbeats (empty AppendEntries RPCs) to

maintain authority• If electionTimeout elapses with no RPCs:

• Follower assumes leader has crashed• Follower starts new election• Timeouts typically 100-500ms

Heartbeats and Timeouts

Page 35: Paxos: Asynchronous Consensus

• On timeout:• Increment current term• Change to Candidate state• Vote for self• Send RequestVote RPCs to all other servers:

1. Receive votes from majority of servers:• Become leader• Send AppendEntries heartbeats (RPCs) to all other servers

2. Receive RPC from valid leader:• Return to follower state

3. No-one wins election (election timeout elapses):• Increment term, start new election

Election Basics

Page 36: Paxos: Asynchronous Consensus

• Safety: allow at most one winner per term• Each server gives out only one vote per term (persist on

disk)• Two different candidates can’t accumulate majorities in

same term

• Liveness: some candidate must eventually win• Choose election timeouts randomly in [T, 2T]• One server usually times out and wins election before

others wake up• Works well if T >> broadcast time

• Safety is guaranteed. Liveness is not. • Election may result in a split vote – no candidate gets

majority.

Elections, cont’d

Servers

Voted for candidate A

B can’t also get majority

Page 37: Paxos: Asynchronous Consensus

• Safety: allow at most one winner per term• Each server gives out only one vote per term (persist on

disk)• Two different candidates can’t accumulate majorities in

same term

• Liveness: some candidate must eventually win• Choose election timeouts randomly in [T, 2T]• One server usually times out and wins election before

others wake up• Works well if T >> broadcast time

• Safety is guaranteed. Liveness is not. • Election may result in a split vote – no candidate gets

majority.

Elections, cont’d

Servers

Voted for candidate A

B can’t also get majority