Unreliable Failure Detectors for Reliable Distributed Systems

25
Unreliable Failure Detectors for Reliable Distributed Systems Tushar Deepak Chandra Sam Toueg Presentation for EECS454 Lawrence Leinweber

description

Unreliable Failure Detectors for Reliable Distributed Systems. Tushar Deepak Chandra Sam Toueg Presentation for EECS454 Lawrence Leinweber. Two-Army Problem. Unreliable Channel Can’t Guarantee Correct Communication Last Message May be Lost. Byzantine Generals Problem (1). 2. 1. - PowerPoint PPT Presentation

Transcript of Unreliable Failure Detectors for Reliable Distributed Systems

Page 1: Unreliable Failure Detectors for Reliable Distributed Systems

Unreliable Failure Detectors

for

Reliable Distributed Systems

Tushar Deepak Chandra

Sam Toueg

Presentation for EECS454

Lawrence Leinweber

Page 2: Unreliable Failure Detectors for Reliable Distributed Systems

Two-Army Problem

• Unreliable Channel

– Can’t Guarantee Correct Communication

– Last Message May be Lost

Page 3: Unreliable Failure Detectors for Reliable Distributed Systems

Byzantine Generals Problem (1)

• Unreliable Processors (Traitors)

– Report Incorrect Values (Troop Levels)

1 1

1

3

3

3

4

4 4

7

2

1

Page 4: Unreliable Failure Detectors for Reliable Distributed Systems

Byzantine Generals Problem (2)

• Loyal Generals Need to Verify Reports

– Use Reports as Votes on Correct Values– That’s About It with the Color Diagrams

1,2,3,4 1,2,3,4

1,2,3,4

1,7,3,4

1,7,3,4

1,7,3,4

1,2,3,4

1,1,3,4

1,1,3,4

1,1,3,4

4,6,6,81,1,1,1

Page 5: Unreliable Failure Detectors for Reliable Distributed Systems

Distributed System

1. System of Processors

2. Connected In a Network

3. Running Independently

4. Solving Problems Together

Page 6: Unreliable Failure Detectors for Reliable Distributed Systems

Types of Failure

1. Unreliable Communication Channels

2. Processors Crash or Create Mischief

3. Synchronizing Processors

• Atomic Broadcast

4. Problems Agreeing On Results

• Consensus

Page 7: Unreliable Failure Detectors for Reliable Distributed Systems

Scope of This Solution

1. Processors Can Crash• Crashed Processors Never Recover

• Processors are Not Malicious

2. Reliable Communication Channels

3. Asynchronous• Synchronize After a Finite Number of Steps

4. At Least One Processor is Correct• Every Down Processor is Detected By at Least One Up

Processor

• At Least One Up Processor is Detected By All Up Processors

Page 8: Unreliable Failure Detectors for Reliable Distributed Systems

Failure Detectors

• Attached to Each Processor

• Determine the Crash State of Some Processors– Processors Communicate Crash State Information

• Imperfect– Suspect Processors Crashed

– Slow Processors Might Become “Unsuspected”

– Cause Host Processor to Abandon Other Processors

Page 9: Unreliable Failure Detectors for Reliable Distributed Systems

Completeness & Accuracy

• Completeness

– Down Processors are Abandoned

• Accuracy

– Up Processors are Not Abandoned

Page 10: Unreliable Failure Detectors for Reliable Distributed Systems

Function Definitions

• abandons(p, q, t)

– Processor p Abandons Processor q

at Time t

• isDown(q, t)

– Processor q is Really Down at Time t

Page 11: Unreliable Failure Detectors for Reliable Distributed Systems

Completeness

• Strong Completeness

– Every Down Processor is Abandoned by Every Up

Processor Eventually

p, q, t0, t > t0: isDown(q, t) abandons(p, q, t)

• Weak Completeness

– Every Down Processor is Abandoned by At Least

One Up Processor Eventually

p, q, t0, t > t0: isDown(q, t) abandons(p, q, t)

Page 12: Unreliable Failure Detectors for Reliable Distributed Systems

Accuracy

• Strong Accuracy (Perpetual/Eventual)– Every Up Processor is Not Abandoned by Every Processor

Ever/Eventually

– Perpetual: p, q, t: isDown(q, t) abandons(p, q, t)

– Eventual: p, q, t0, t > t0: isDown(q, t) abandons(p, q, t)

• Weak Accuracy (Perpetual/Eventual)– At Least One Up Processor is Not Abandoned by Any Processor

Ever/Eventually

– Perpetual: p, q, t: isDown(q, t) abandons(p, q, t)

– Eventual: p, q, t0, t > t0: isDown(q, t) abandons(p, q, t)

Page 13: Unreliable Failure Detectors for Reliable Distributed Systems

Classes of Failure Detectors

Strong

Perpetual

Accuracy

Weak

Perpetual

Accuracy

Strong

Eventual

Accuracy

Weak

Eventual

Accuracy

Strong

CompletenessP S P S

Weak

CompletenessQ W Q W

• 8 Combinations of Completeness and Accuracy

Page 14: Unreliable Failure Detectors for Reliable Distributed Systems

Reducibility (Emulation)

• Some Classes are More Powerful Than Others– Strong Complete Can Emulate Weak Complete

• Some Classes Can Emulate Others Using an Algorithm:– Up Processors Share Lists of Abandoned Processors,

Exclude Themselves

– Abandoned by One Becomes Abandoned by All

– Weak Complete Can Emulate Strong Complete

Page 15: Unreliable Failure Detectors for Reliable Distributed Systems

Completeness Classes Are Equivalent

Strong

Perpetual

Accuracy

Weak

Perpetual

Accuracy

Strong

Eventual

Accuracy

Weak

Eventual

Accuracy

Strong

CompletenessP S P S

Weak

CompletenessQ W Q W

• 4 Distinct Accuracy Classes

Page 16: Unreliable Failure Detectors for Reliable Distributed Systems

Relationship of Accuracy Classes

• Perpetual is More Powerful Than Eventual

– Perpetual: t

– Eventual: t0, t > t0

• Strong is More Powerful Than Weak

– Strong: q

– Weak: q

Page 17: Unreliable Failure Detectors for Reliable Distributed Systems

Relationship of Failure Detector Classes

Strong

Perpetual

Accuracy

Weak

Perpetual

Accuracy

Strong

Eventual

Accuracy

Weak

Eventual

Accuracy

Strong

CompletenessP S P S

Weak

CompletenessQ W Q W

• P is Most Powerful; S is Least Powerful

Page 18: Unreliable Failure Detectors for Reliable Distributed Systems

The Consensus Problem

• Processors Reach Agreement on a Value– Termination: All Up Processors

– Agreement: All Agree to Same Value

– Integrity: Decision is Final

– Validity: A Proposed Value is Chosen

• If They Can Agree on One Thing,They Can Agree on Anything

• Algorithms for S and S Detectors– At Least One Up Processor Using S Detectors

– A Majority of Up Processors Using S Detectors

Page 19: Unreliable Failure Detectors for Reliable Distributed Systems

Algorithm for S Detectors

• S Detectors – At Least One Up Processor is Not

Abandoned by Any Up Processor Ever

1. Collect Proposed Values from Each Processor– or the News That the Process Crashed

2. Collect Other Processors’ Knowledge of Proposed Values– Discard Values not Known to All

3. Pick (Consistently) a Value from Known Values

• All Processors Get Phase 1 & 2 Information from the

Processor That is Never Abandoned

Page 20: Unreliable Failure Detectors for Reliable Distributed Systems

Algorithm for S Detectors

• Rotating Coordinator

– Each Processor Takes Their Turn

– Tries to Make Decision

– If the Processor is Up and is Not

Abandoned by Any Up Processor, the

Decision is Made

Page 21: Unreliable Failure Detectors for Reliable Distributed Systems

Each Round of S Algorithm

• At Least One Up Processor is Not Abandoned by Any Up Processor Eventually

1. All Processors Send Value and the Round Number to Coordinator

2. Coordinator Waits for a Majority and Sends the Value with the Latest Round Number to All Processors

3. Each Processor Indicates If It Abandoned Coordinator

4. Coordinator Waits for a Majority, If No Processor Abandoned Coordinator, the Value is Decided

• Repeat Until Coordinator is Not Abandoned Eventually

Page 22: Unreliable Failure Detectors for Reliable Distributed Systems

Atomic Broadcast

• All Processors Receive the Same

Messages in the Same Order

• Atomic Broadcast is Equivalent to

Consensus

– Each Can Be Reduced to the Other

– Solution to Consensus Applies to

Atomic Broadcast

Page 23: Unreliable Failure Detectors for Reliable Distributed Systems

Atomic Broadcast Reduces to Consensus

• Atomic Broadcast Can Be Implemented

Using a Consensus Algorithm

– Each Processor Proposes a Message

– Consensus is Used to Decide Which

Message is Recognized as the Next

Atomically Broadcast Message

Page 24: Unreliable Failure Detectors for Reliable Distributed Systems

Consensus Reduces to Atomic Broadcast

• Consensus Can Be Implemented Using

An Atomic Broadcast Algorithm

– To Decide a Value, a Process Atomically

Broadcasts It

– Go to Lunch Early

Page 25: Unreliable Failure Detectors for Reliable Distributed Systems

Summary

• Reliable Distributed Systems

• Unreliable Failure Detectors

• Relationship of Detector Classes

• Algorithms for Consensus

• Equivalence with Atomic Broadcast