Paxos: Asynchronous Consensus

• We cannot “solve” consensus in asynchronous systems.

• We cannot meet both safety and liveness requirements.

• Maybe it is ok to guarantee just one requirement.

• Option 1:

• Let’s set super conservative timeout for a terminating algorithm.

• Safety violated if a process (or the network) is very, very slow.

• Option 2:

• Let’s focus on guaranteeing safety under all possible scenarios.

• If the real situation is not too dire, hopefully the algorithm will terminate.

Consensus in asynchronous systems

No-failure consensus (Ricart-Agrawala)

1. Send proposal to all processes2. Decide once everyone replies

Synchronous Consensus

1. Send proposal to everyone2. Decide when everyone replies

or timeout

Problem: timeout needs to be large enough to be certain that any non-replies means crashed process

Asynchronous Consensus

1. Send proposal to everyone2. Decide when a quorum (strict

majority) replies

Note: liveness requires < N/2 failures

Paxos: • avoid deadlock• minimize livelock issues

Deadlock

• 4 processes• P1, P2 agree to proposal X• P3, P4 agree to proposal Y

No majority possible without changing decision!

Prioritization

Use proposal ID (say, process ID) to break ties. Highest ID wins• P1 proposes X with ID 1• P1, P2 accept X• P4 proposes Y with ID 4• P3, P4 accept Y• P1, P2 change their mind to Y

since 4 > 1No deadlock

Protocol so far

Processes propose values to othersProcesses accept first proposal they seeMajority accepts = consensusProcesses accept second proposalif higher ID (change their mind)

Proposer:• send( (value,ID) ) to all processes• wait for responses• majority OK => declare proposal

accepted

Acceptor:• initialize: accepted_ID = -1• receive proposal (value,ID)• if ID > accepted_ID:

• accepted_ID = ID, reply OK

Prioritization problem

• P1 proposes X with ID 1• P1, P2, P3 accept X• P1 gets majority response, sets

decision to X• P4 proposes Y with ID 4• P2, P3, P4 accept Y• P4 gets majority response, sets

decision to Y

Two-phase protocol

Phase 1: proposes asks about previously accepted valuesPhase 2: proposes uses accepted value with largest ID (or its own if none accepted)

In both phases, wait for a majority of responses

Two-phase protocol

• P1 calls prepare with ID 1• P1, P2, P3 reply with {} (no

accepted values)• P1 calls propose with ID 1, value X• P1, P2, P3 accept (X,1)• P4 calls prepare with ID 4• P2, P3 reply with {(X,1)}• P4 calls propose with ID 4 and

value X• P2, P3 accept (X, 4)

Two-phase protocol

• P4 calls prepare with ID 4• P2, P3 reply with {}• P4 calls propose with ID 4 and

value Y• P1 calls prepare with ID 1• P2, P3 reply with {}• P1 calls propose with ID 1, value X• P1, P2, P3 accept (X,1)• P1 receives accept, decides on X• P2, P3 receive & accept (Y, 4)• P4 receives accept, decides on Y

Prepare Promise

Reply to prepare with ID n includes:• accepted value with highest ID• a promise not to accept any values with id < n

Two-phase protocol

• P4 calls prepare with ID 4• P2, P3 reply with {}, promise ID=4• P4 calls propose with ID 4 and

value Y• P1 calls prepare with ID 1• P2, P3 reply with nack• P1 calls propose with ID 1, value X• P1, P2, P3 accept (X,1)• P1 receives accept, decides on X• P2, P3 receive & accept (Y, 4)• P4 receives accept, decides on Y

Two-phase protocol

• P4 calls prepare with ID 4• P2, P3 reply with {}, promise

ID=4• P4 crashes• P1 can no longer get a majority!

Solution: process can pick new ID (as long as its unique)

Full codeProposer:

initialize v to input

pick unique proposal # n

multicast( prepare(n) ) to acceptors

if receive promise from a majority:

if any promises include accepted proposal

let (n’, v’) be promise with largest proposal #set v = v’

multicast( (propose(n, v) )

wait for majority accept replies

on timeout:

set n to be higher than previous proposals,

restart

Acceptor:

initialize promised_id = None,

accepted_id = None, accepted_v = None

on receive( prepare(n) ):

if promised_id = None or n > promised_id:

set promised_id = n

if accepted_id != None:

send(promise(accepted_id, accepted_v))

else:

send(promise(None))

on receive( propose(n,v) )

if n >= promised_id:

accepted_id = promised_id = n

accepted_v = v

send( (accept(n,v))

Safety

• Suppose majority accepted proposal (n,v)• Let n’ be the first proposal with n’ > n

• Proposer must have received majority promises• Majority must intersect with majority that accepted (n,v)• The intersection must have accepted (n,v) before receiving prepare(n’)• Therefore it sent (promise(n,v)) in respose to prepare(n’)• Therefore proposal n’ must have value v

• By induction, all proposals with n’ > n must have value v

Liveness

Livelock is possible

Optimization: use leader electionto pick a distinguished proposerKey: leader election only has to beprobabilistically correct

Other Paxos features

• Separate learners fromproposers• Multi-paxos: agree on a

sequence of values• Crash-recovery: acceptor can

come back up as long as itremembers state (promises)

Log Consensus

• Paxos algorithm (discussed so far) is used for deciding on a single value.

• Many practical systems need to decide on a sequence of values (log).

• Replicated log => replicated state machine• All servers execute same commands in same order

• Consensus module ensures proper log replication

Replicated Log

add jmp mov shlLog

ConsensusModule

StateMachine

add jmp mov shlLog

ConsensusModule

StateMachine

add jmp mov shlLog

ConsensusModule

StateMachine

Servers

Clients

shl

“The dirty little secret of the NSDI* community is that at most five people really, truly understand every part of Paxos ;-).”– Anonymous NSDI reviewer

*The USENIX Symposium on Networked SystemsDesign and Implementation

Paxos is difficult to understand

“There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system…the final system will be based on an unproven protocol.”– Chubby authors

Paxos is difficult to implement

Raft: A Consensus Algorithmfor Replicated Logs

Slides from Diego Ongaro and John Ousterhout, Stanford University

• Replicated log => replicated state machine• All servers execute same commands in same order

• Consensus module ensures proper log replication• System makes progress as long as any majority of servers are up

• Failure model: fail-stop (not Byzantine), delayed/lost messages

Goal: Replicated Log

add jmp mov shlLog

ConsensusModule

StateMachine

add jmp mov shlLog

ConsensusModule

StateMachine

add jmp mov shlLog

ConsensusModule

StateMachine

Servers

Clients

shl

Goal: Design for understandability

• Main objective of Raft’s design• Whenever possible, select the alternative that is the easiest to understand.

• Techniques that were used include• Dividing problems into smaller problems.• Reducing the number of system states to consider.

Two general approaches to consensus:• Symmetric, leader-less:

• All servers have equal roles• Clients can contact any server

• Asymmetric, leader-based:• At any given time, one server is in charge, others accept its

decisions• Clients communicate with the leader

• Raft uses a leader:• Decomposes the problem (normal operation, leader changes)• Simplifies normal operation (no conflicts)• More efficient than leader-less approaches

Approaches to Consensus

1. Leader election:• Select one of the servers to act as leader• Detect crashes, choose new leader

2. Normal operation (basic log replication)3. Safety and consistency after leader changes4. Neutralizing old leaders

Raft Overview

• At any given time, each server is either:• Leader: handles all client interactions, log replication

• At most 1 viable leader at a time• Follower: completely passive: issues no RPCs (requests),

responds to incoming RPCs • Candidate: used to elect a new leader

• Normal operation: 1 leader, N-1 followers

Server States

• Raft servers communicate via RPCs. • What are RPCs?

• Remote Procedure Calls: procedure call between functions on different processes

• Convenient programming abstraction.

Quick Detour: RPCs

P1 P2

P2.call(“foo”, args, reply)

1. “foo”, args 2. foo(args) {….….return reply

}

3. reply

• At any given time, each server is either:• Leader: handles all client interactions, log replication

• At most 1 viable leader at a time• Follower: completely passive: issues no RPCs, responds to

incoming RPCs • Candidate: used to elect a new leader

• Normal operation: 1 leader, N-1 followers

March 3, 2013

Server States

Follower Candidate Leader

starttimeout,start election

receive votes frommajority of servers

timeout,new election

discover server withhigher termdiscover current server

or higher term

“stepdown”

• Time divided into terms:• Election• Normal operation under a single leader

• At most 1 leader per term• Some terms have no leader (failed election)• Each server maintains current term value• Key role of terms: identify obsolete information

TermsTerm 1 Term 2 Term 3 Term 4 Term 5

time

Elections Normal OperationSplit Vote

• Servers start up as followers• Followers expect to receive RPCs from leaders or candidates• Leaders must send heartbeats (empty AppendEntries RPCs) to

maintain authority• If electionTimeout elapses with no RPCs:

• Follower assumes leader has crashed• Follower starts new election• Timeouts typically 100-500ms

Heartbeats and Timeouts

• On timeout:• Increment current term• Change to Candidate state• Vote for self• Send RequestVote RPCs to all other servers:

1. Receive votes from majority of servers:• Become leader• Send AppendEntries heartbeats (RPCs) to all other servers

2. Receive RPC from valid leader:• Return to follower state

3. No-one wins election (election timeout elapses):• Increment term, start new election

Election Basics

• Safety: allow at most one winner per term• Each server gives out only one vote per term (persist on

disk)• Two different candidates can’t accumulate majorities in

same term

• Liveness: some candidate must eventually win• Choose election timeouts randomly in [T, 2T]• One server usually times out and wins election before

others wake up• Works well if T >> broadcast time

• Safety is guaranteed. Liveness is not. • Election may result in a split vote – no candidate gets

majority.

Elections, cont’d

Servers

Voted for candidate A

B can’t also get majority

Paxos: Asynchronous Consensus

Documents

Transcript of Paxos: Asynchronous Consensus