Paxos: Asynchronous Consensus
Transcript of Paxos: Asynchronous Consensus
Paxos: Asynchronous Consensus
• We cannot “solve” consensus in asynchronous systems.
• We cannot meet both safety and liveness requirements.
• Maybe it is ok to guarantee just one requirement.
• Option 1:
• Let’s set super conservative timeout for a terminating algorithm.
• Safety violated if a process (or the network) is very, very slow.
• Option 2:
• Let’s focus on guaranteeing safety under all possible scenarios.
• If the real situation is not too dire, hopefully the algorithm will terminate.
Consensus in asynchronous systems
No-failure consensus (Ricart-Agrawala)
1. Send proposal to all processes2. Decide once everyone replies
Synchronous Consensus
1. Send proposal to everyone2. Decide when everyone replies
or timeout
Problem: timeout needs to be large enough to be certain that any non-replies means crashed process
Asynchronous Consensus
1. Send proposal to everyone2. Decide when a quorum (strict
majority) replies
Note: liveness requires < N/2 failures
Paxos: • avoid deadlock• minimize livelock issues
Deadlock
• 4 processes• P1, P2 agree to proposal X• P3, P4 agree to proposal Y
No majority possible without changing decision!
Prioritization
Use proposal ID (say, process ID) to break ties. Highest ID wins• P1 proposes X with ID 1• P1, P2 accept X• P4 proposes Y with ID 4• P3, P4 accept Y• P1, P2 change their mind to Y
since 4 > 1No deadlock
Protocol so far
Processes propose values to othersProcesses accept first proposal they seeMajority accepts = consensusProcesses accept second proposalif higher ID (change their mind)
Proposer:• send( (value,ID) ) to all processes• wait for responses• majority OK => declare proposal
accepted
Acceptor:• initialize: accepted_ID = -1• receive proposal (value,ID)• if ID > accepted_ID:
• accepted_ID = ID, reply OK
Prioritization problem
• P1 proposes X with ID 1• P1, P2, P3 accept X• P1 gets majority response, sets
decision to X• P4 proposes Y with ID 4• P2, P3, P4 accept Y• P4 gets majority response, sets
decision to Y
Two-phase protocol
Phase 1: proposes asks about previously accepted valuesPhase 2: proposes uses accepted value with largest ID (or its own if none accepted)
In both phases, wait for a majority of responses
Two-phase protocol
• P1 calls prepare with ID 1• P1, P2, P3 reply with {} (no
accepted values)• P1 calls propose with ID 1, value X• P1, P2, P3 accept (X,1)• P4 calls prepare with ID 4• P2, P3 reply with {(X,1)}• P4 calls propose with ID 4 and
value X• P2, P3 accept (X, 4)
Two-phase protocol
• P4 calls prepare with ID 4• P2, P3 reply with {}• P4 calls propose with ID 4 and
value Y• P1 calls prepare with ID 1• P2, P3 reply with {}• P1 calls propose with ID 1, value X• P1, P2, P3 accept (X,1)• P1 receives accept, decides on X• P2, P3 receive & accept (Y, 4)• P4 receives accept, decides on Y
Prepare Promise
Reply to prepare with ID n includes:• accepted value with highest ID• a promise not to accept any values with id < n
Two-phase protocol
• P4 calls prepare with ID 4• P2, P3 reply with {}, promise ID=4• P4 calls propose with ID 4 and
value Y• P1 calls prepare with ID 1• P2, P3 reply with nack• P1 calls propose with ID 1, value X• P1, P2, P3 accept (X,1)• P1 receives accept, decides on X• P2, P3 receive & accept (Y, 4)• P4 receives accept, decides on Y
Two-phase protocol
• P4 calls prepare with ID 4• P2, P3 reply with {}, promise
ID=4• P4 crashes• P1 can no longer get a majority!
Solution: process can pick new ID (as long as its unique)
Full codeProposer:
initialize v to input
pick unique proposal # n
multicast( prepare(n) ) to acceptors
if receive promise from a majority:
if any promises include accepted proposal
let (n’, v’) be promise with largest proposal #set v = v’
multicast( (propose(n, v) )
wait for majority accept replies
on timeout:
set n to be higher than previous proposals,
restart
Acceptor:
initialize promised_id = None,
accepted_id = None, accepted_v = None
on receive( prepare(n) ):
if promised_id = None or n > promised_id:
set promised_id = n
if accepted_id != None:
send(promise(accepted_id, accepted_v))
else:
send(promise(None))
on receive( propose(n,v) )
if n >= promised_id:
accepted_id = promised_id = n
accepted_v = v
send( (accept(n,v))
Safety
• Suppose majority accepted proposal (n,v)• Let n’ be the first proposal with n’ > n
• Proposer must have received majority promises• Majority must intersect with majority that accepted (n,v)• The intersection must have accepted (n,v) before receiving prepare(n’)• Therefore it sent (promise(n,v)) in respose to prepare(n’)• Therefore proposal n’ must have value v
• By induction, all proposals with n’ > n must have value v
Liveness
Livelock is possible
Optimization: use leader electionto pick a distinguished proposerKey: leader election only has to beprobabilistically correct
Other Paxos features
• Separate learners fromproposers• Multi-paxos: agree on a
sequence of values• Crash-recovery: acceptor can
come back up as long as itremembers state (promises)
Log Consensus
• Paxos algorithm (discussed so far) is used for deciding on a single value.
• Many practical systems need to decide on a sequence of values (log).
• Replicated log => replicated state machine• All servers execute same commands in same order
• Consensus module ensures proper log replication
Replicated Log
add jmp mov shlLog
ConsensusModule
StateMachine
add jmp mov shlLog
ConsensusModule
StateMachine
add jmp mov shlLog
ConsensusModule
StateMachine
Servers
Clients
shl
“The dirty little secret of the NSDI* community is that at most five people really, truly understand every part of Paxos ;-).”– Anonymous NSDI reviewer
*The USENIX Symposium on Networked SystemsDesign and Implementation
Paxos is difficult to understand
“There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system…the final system will be based on an unproven protocol.”– Chubby authors
Paxos is difficult to implement
Raft: A Consensus Algorithmfor Replicated Logs
Slides from Diego Ongaro and John Ousterhout, Stanford University
• Replicated log => replicated state machine• All servers execute same commands in same order
• Consensus module ensures proper log replication• System makes progress as long as any majority of servers are up
• Failure model: fail-stop (not Byzantine), delayed/lost messages
Goal: Replicated Log
add jmp mov shlLog
ConsensusModule
StateMachine
add jmp mov shlLog
ConsensusModule
StateMachine
add jmp mov shlLog
ConsensusModule
StateMachine
Servers
Clients
shl
Goal: Design for understandability
• Main objective of Raft’s design• Whenever possible, select the alternative that is the easiest to understand.
• Techniques that were used include• Dividing problems into smaller problems.• Reducing the number of system states to consider.
Two general approaches to consensus:• Symmetric, leader-less:
• All servers have equal roles• Clients can contact any server
• Asymmetric, leader-based:• At any given time, one server is in charge, others accept its
decisions• Clients communicate with the leader
• Raft uses a leader:• Decomposes the problem (normal operation, leader changes)• Simplifies normal operation (no conflicts)• More efficient than leader-less approaches
Approaches to Consensus
1. Leader election:• Select one of the servers to act as leader• Detect crashes, choose new leader
2. Normal operation (basic log replication)3. Safety and consistency after leader changes4. Neutralizing old leaders
Raft Overview
1. Leader election:• Select one of the servers to act as leader• Detect crashes, choose new leader
2. Normal operation (basic log replication)3. Safety and consistency after leader changes4. Neutralizing old leaders
Raft Overview
• At any given time, each server is either:• Leader: handles all client interactions, log replication
• At most 1 viable leader at a time• Follower: completely passive: issues no RPCs (requests),
responds to incoming RPCs • Candidate: used to elect a new leader
• Normal operation: 1 leader, N-1 followers
Server States
• Raft servers communicate via RPCs. • What are RPCs?
• Remote Procedure Calls: procedure call between functions on different processes
• Convenient programming abstraction.
Quick Detour: RPCs
P1 P2
P2.call(“foo”, args, reply)
1. “foo”, args 2. foo(args) {….….return reply
}
3. reply
• At any given time, each server is either:• Leader: handles all client interactions, log replication
• At most 1 viable leader at a time• Follower: completely passive: issues no RPCs, responds to
incoming RPCs • Candidate: used to elect a new leader
• Normal operation: 1 leader, N-1 followers
March 3, 2013
Server States
Follower Candidate Leader
starttimeout,start election
receive votes frommajority of servers
timeout,new election
discover server withhigher termdiscover current server
or higher term
“stepdown”
• Time divided into terms:• Election• Normal operation under a single leader
• At most 1 leader per term• Some terms have no leader (failed election)• Each server maintains current term value• Key role of terms: identify obsolete information
TermsTerm 1 Term 2 Term 3 Term 4 Term 5
time
Elections Normal OperationSplit Vote
• Servers start up as followers• Followers expect to receive RPCs from leaders or candidates• Leaders must send heartbeats (empty AppendEntries RPCs) to
maintain authority• If electionTimeout elapses with no RPCs:
• Follower assumes leader has crashed• Follower starts new election• Timeouts typically 100-500ms
Heartbeats and Timeouts
• On timeout:• Increment current term• Change to Candidate state• Vote for self• Send RequestVote RPCs to all other servers:
1. Receive votes from majority of servers:• Become leader• Send AppendEntries heartbeats (RPCs) to all other servers
2. Receive RPC from valid leader:• Return to follower state
3. No-one wins election (election timeout elapses):• Increment term, start new election
Election Basics
• Safety: allow at most one winner per term• Each server gives out only one vote per term (persist on
disk)• Two different candidates can’t accumulate majorities in
same term
• Liveness: some candidate must eventually win• Choose election timeouts randomly in [T, 2T]• One server usually times out and wins election before
others wake up• Works well if T >> broadcast time
• Safety is guaranteed. Liveness is not. • Election may result in a split vote – no candidate gets
majority.
Elections, cont’d
Servers
Voted for candidate A
B can’t also get majority
• Safety: allow at most one winner per term• Each server gives out only one vote per term (persist on
disk)• Two different candidates can’t accumulate majorities in
same term
• Liveness: some candidate must eventually win• Choose election timeouts randomly in [T, 2T]• One server usually times out and wins election before
others wake up• Works well if T >> broadcast time
• Safety is guaranteed. Liveness is not. • Election may result in a split vote – no candidate gets
majority.
Elections, cont’d
Servers
Voted for candidate A
B can’t also get majority