Failure Detection in Distributed Systems · PRESTO, Japan Science & Tech . Agency (JST) PDCA TÕ05,...

Xavier DÉFAGO

School of Information Science,Japan Adv. Inst. of Science & Tech. (JAIST)

PRESTO, Japan Science & Tech. Agency (JST)

PDCAT’05, Dalian, China, December 5, 2005

Failure Detection in Distributed Systems:

Retrospective and recent advances


Failure Detection

• Context• Two computer programs

• Exchange messages

2

B

<script

var a=

var xl

if(xls

A

<script

var a=

var xl

if(xls


Failure Detection

• Context• A crashes

• B must detect crash

3

B

<script

var a=

var xl

if(xls

A

<script

var a=

var xl

if(xls

BOOM

A BOOM


Simple Approach

• Solution

• Use time & timeouts

• Problem

• How to set timeout?

• What implications,

• when too long.

• when too short.

4


Simple Approach

• Problems

• Depends on application needs.

• Depends on system behavior.

• Needs global time?

• Outcome

• Long failover / reconfiguration time

• Unstable applications

• Many side-effects: difficult to maintain

5 PDCAT’05, Dalian, China, December 5, 2005

Question

6

Can we do better?How?


Answer

7

• Can we do better?

• YES!

• How?

• Depends on context

PDCAT’05, Dalian, China, December 5, 2005 8

Illustration: Case 1

• Simple illustration• Set of tasks

CPU farm

Job dispatcher



• Simple illustration• Set of tasks

• Dispatch tasks; wait for results

CPU farm

Job dispatcher

BOOM



• Dispatcher• Failover

• Must keep consistency

Backup dispatcher

10

CPU farm

Job dispatcher BOOM


Outline

• I: Theory

• Agreement problems, Unreliable failure detectors

• II: QoS

• QoS metrics, Comparison of FDs

• III: Implementation

• Basic FDs, Adaptive FDs

• IV: Accrual FDs

• Novel concept

• V: Conclusion


Part ITheory


Context: Case 2

• Dispatcher• Failover

• Must keep consistency

• Requires agreement between dispatchers

Backup dispatcher

13

CPU farm

Job dispatcher BOOM


Outline: Part I

• System Model

• Consensus & Impossibility

• Unreliable FDs

• Solving Consensus w/FD

14


Outline: Part I

• System Model


• Unreliable FDs


15 PDCAT’05, Dalian, China, December 5, 2005 16

• Processes

• represents running program (or state machine)

• Communication

• message driven

• fully connected

System Model

C = {cij |cij : channel from pi to pj}

p1

p2

p3

p4

p5

p6

P = {p1, p2, . . . , pn}


Failure Models

• Crash failures

• Failed process stops executing any event.

• Omission failures

• Failed process omits executing some events.

• Arbitrary failures (Byzantine)

• Failed process can do anything.


Failure Models

• Crash failures

• Failed process stops executing any event.

• Omission failures

• Failed process omits executing some events.

• Arbitrary failures (Byzantine)

• Failed process can do anything.

18


• Synchronous

• Bound on comm. delays

• Bound on process speed

• Asynchronous

• No bounds

• Semi-synchronous (e.g.)

• GST: global stabilization time

• Unknown bounds after GST

Synchrony

p1

p2

p3

p4

p5

p6


Outline: Part I

• System Model


• Unreliable FDs


20


• Problem• All (correct) processes decide

• Decision value is same for all processes

• Decision value is one of proposed values

Consensus

Consensus

Application

propose decide

p1

p3

p2 Consensus

propose(vi) decide(v)

v = v2v1

v2

v3


• Integrity

• Every process decides at most once.

• Validity

• If a process decides v, then v was proposed by some process.

• Agreement

• Two correct processes do not decide differently

• Termination

• Every correct process eventually decides.

Consensus (specification)



• Agreement• Two correct processes do not decide differently

p1

p3

p2 Consensus


v1

v2

v3



• Validity• If a process decides v, then v was proposed by some

process.

p1

p3

p2 Consensus


v1

v2

v3



• Termination• Every correct process eventually decides.

p1

p3

p2 Consensus


v1

v2

v3



• Integrity• Every process decides at most once.

p1

p3

p2 Consensus


v1

v2

v3


Impossibility

• Model

• Asynchronous

• Some process may crash

• Result

• Consensus has no deterministic solution

• Reason

• Impossibility to distinguish crashed vs. slow process

• NB

• Cannot guarantee Termination in all cases.

• Probabilistic Termination possible (w/Prob.=1)

27

from [FLP85]


Related Agreement Problems

• Total Order Broadcast

• Deliver messages in same sequence

• See [DSU04]

• Leader Election

• Elect one correct leader

• Group Membership

• Agree on group composition

• See [CKV01]

• Atomic Commit

• Decide on issue of transaction: Commit / Abort

28

Impossible

Impossible

Impossible

Impossible


Outline: Part I

• System Model


• Unreliable FDs



• FD module

• Attached to each process

• Queried locally

• Outputs list of suspected processes

• Failure Detector

• Distributed entity

• Federation of FD modules

• Global properties

Failure Detectors (concept)

Failure Detector

Oracle

p1 p2

p3 p4

FD1 FD2

FD3 FD4

query

FD modulehost

from [CT96]


• FD Unreliable

• Can make mistakes; can change its mind

• Possible Mistakes

• suspect p and p has not crashed (wrong suspicion)

• trust p and p has crashed

• Properties

• Restrict allowed mistakes

• Completeness; Accuracy

Unreliable Failure Detectors


• Strong Completeness• Eventually every process that crashes is permanently

suspected by all processes.

Completeness

p1

p3

p2

p4

BOOM p1!Dp2

p1!Dp3

p1!Dp4

suspect

suspect

suspect


• Strong Accuracy• No process is suspected before it crashes.

Accuracy

Attention: counter!exampl"

p1

p3

p2

p4

BOOM

p1!Dp3

suspect

suspect

suspect

p4!Dp2

p2!Dp4

BOOM


• Weak Accuracy• Some correct process is never suspected

Accuracy

p1

p3

p2

p4

BOOM

p1!Dp3

suspect

suspect

suspect

p4!Dp2

p2!Dp4

BOOM


• Eventual Strong Accuracy• There is a time after which correct processes are not

suspected by any correct process.

• Eventual Weak Accuracy• There is a time after which some correct process is never

suspected by any correct process

Eventual Accuracy

p1

p3

p2Chaotic Period

no guarantee

Good Period

accuracy satisfied


Failure Detector Classes

• Remark• Other classes of failure detectors exist.

Accuracy Perpetual Eventual

Strong

Weak

P !P

!SS

(Perfect) (Eventually Perfect)

(Strong)(Eventually Strong)


Outline: Part I

• System Model


• Unreliable FDs



• System Model

• Asynchronous System

• Crash permanent failures

• Quasi-Reliable channels

• Failure detector

• Assumptions

• At least majority of processes are correct

Solving Consensus with !Sfrom [CT96]

FD ! !S

t =

!

n ! 1

2

"

where t is max. faulty processes


• Requirements• Reliable Broadcast, failure detector

• Concept• Use rotating coordinator

• Asynchronous rounds, 4 phases

Solving Consensus with !S

! !S

p1

p3

p2 Consensus


v = v2v1

v2

v3


• Selection of coordinator

• One coordinator/round

• Suspicion

• Suspect coordinator => reject round => go to next one.

• Estimate

• Processes keep estimate (of decision value)

• Initialized with initial value

• Modified in round i by coordinator of i

• Timestamp

• Estimate modified in what round

Solving Consensus with !S

round i : c = p(1+i mod n)


Solving Consensus with !S(crash-free case)

41

p1

p2

p3

p4

p5

v5

v4

v3

v2

v1

Phase 1

estimate



42

p1

p2

p3

p4

p5

v5

v4

v3

v2

v1

Phase 1

estimate

Phase 2

propose

v := v1



43

p1

p2

p3

p4

p5

v5

v4

v3

v2

v1

Phase 1

estimate

Phase 2

propose

v := v1

Phase 3

ack

v1

v1

v1

v1

v1



44

p1

p2

p3

p4

p5

v5

v4

v3

v2

v1

Phase 1

estimate

Phase 2

propose

v := v1

Phase 3

ack

v1

v1

v1

v1

v1

Phase 4

decide

v1

v1

v1

v1

v1


Solving Consensus with !S(one-crash case)

Phase 1

Phase 2

p1

p2

p3

p4

p5

estimate

Round 1

v5

v4

v3

v2

v1

v := v1



Phase 1

Phase 2

p1

p2

p3

p4

p5

estimate

Round 1

v5

v4

v3

v2

v1

v := v1 BOOM

46



Phase 1 Phase 3

Phase 2

p1

p2

p3

p4

p5

propose

estimate

Round 1

v5

v4

v3

v2

v1

v := v1 BOOM



Phase 1 Phase 3

Phase 2

p1

p2

p3

p4

p5

propose

estimate

Round 1

v5

v4

v3

v2

v1

v := v1 BOOM

48

Suspect p1

Suspect p1ack

v1

v1

Nack



Phase 1 Phase 3

Phase 2

p1

p2

p3

p4

p5

propose

estimate

Round 1

v5

v4

v3

v2

v1

v := v1 BOOM

49

ack

v1

v1

Nack

Round 2

v2

v3

Phase 1

estimate



Phase 1 Phase 3

Phase 2

p1

p2

p3

p4

p5

propose

estimate

Round 1

v5

v4

v3

v2

v1

v := v1 BOOM

50

ack

v1

v1

Nack

Round 2

v2

v3

Phase 1

estimate

Phase 2

v := v1

propose

Phase 4

v1

v1

v1

v1

ack

decide

Phase 3


Event. Accuracy in Practice

• Observations• Bad period: stagnation (progress possible)

• Good period: progress guaranteed

• Termination => finished

51

Termination

Stagnation

ProgressStagnation

Progress

p1

p3

p2

Chao

tic P

erio

d

Good P

erio

d

Agreement


Event. Accuracy in Practice

• In practice• Good period “often enough” => Termination

• FD: must ensure accuracy “often enough”

52

Termination

Stagnation

ProgressStagnation

Progress

p1

p3

p2

Chao

tic P

erio

d

Good P

erio

d

Agreement


Weakest Failure Detectorfor Consensus

• Properties: !W

• Weak Completeness:

Crash detected by some correct process.

• Eventual Weak Accuracy:

Eventually, some correct process never suspected.

• Equivalence

• ♢W ! ♢S

from [CHT96]


": Eventual Leadership

• Definition

• Eventually, one correct process becomes leader for all.

• Application

• Solve Consensus: e.g., PAXOS algorithm of Lamport.

• Equivalence

• ♢W ! !

54

from [Lam98]


Part IIQuality of Service


Outline: Part II

• Quality of Service Metrics

• Comparison

56


QoS of Failure Detectors

• Latency Metricwhen p faulty:

• Detection time

• “How long to detect?” The shorter the better.

57

trust

suspect

up

down

BOOM

detection time

monitored

process

FD

output


QoS of Failure Detectors

• Accuracy Metricswhen p correct:

• Average mistake rate

• Query accuracy prob.

• Good period duration

trust

suspect

up

mistakes!

FD

output

monitored

process

58


QoS Tradeoff

• Aggressive FD• Short latency; low accuracy

• Conservative FD• Long latency; high accuracy

detection time

(in

)acc

ura

cy

59

!{d,a}!{d',a'}

!{d",a"}

Aggressive

Conservative


detection time

(in

)acc

ura

cy

Requirements vs. Guarantees

• FD QoS• !{d,a} : effect. detection time, effect. mistakes

• Application requirements• "{D,A} : max. detect. time, max. mistakes

60

!{D,A}

!{d,a}!{d',a'}

!{d",a"}


Parametric Failure Detector

• Parametric FD protocol• Parameter value defines FD best QoS

• Tradeoff: accuracy <-> detection latency

(in

)acc

ura

cy

detection time

61

!{d,a}

!{d',a'}


QoS Coverage

• Coverage of FD• FD could be tuned to support app. req.

• Measure of FD

detection time

(in

)acc

ura

cy

62


detection time

(in

)acc

ura

cy

Dynamic QoS Coverage

• Approximate coverage• Instantiate several QoS sets

• Find minimal set; minimal change


Comparing Parametric FDs

• Simple case• 2 FDs

• A includes B

• So, A better than B

detection time

(in

)acc

ura

cy

64

FDAFDB


Comparing Parametric FDs

• Complex case• 3 FDs: aggressive, intermediate, conservative

• Which one is better?

detection time

(in

)acc

ura

cy

65

FDA

FDBFDC


Part IIIImplementations


Outline: Part III

• Basic FDs

• Adaptive FDs

• Experimental Results


Interrogation

• Advantage• Simplest

• Drawback• Longest detection time

68

!i query interval!to timeout

p

q!i

!to

BOOM

!to

“alive?”

“YES!”


Heartbeat

• Advantage• Good with broadcast medium

• Drawback• q takes active role

p

q

!i !i !i !i

BOOM

!to !to

!i heartbeat interval

!to timeout


Parameters & QoS

• Detection Time

• Heartbeat interval

• Timeout

• Transmission delays

• Accuracy

• Timeout

• Variations in transmission delays

70


Heartbeat Interval

• Observation

• Small influence on accuracy

• Limited by network admin.

• Detection time " transmission + interval

• Rule-of-thumb

• Same order as average transmission delay


Timeout

• Observation

• Influence on detection time

• Influence on accuracy

• Too short = instability

• Problem

• Depends on system behavior

• System behavior changes

• => need adaptive timeout

72


• Concept

• Periodic heartbeat

• Timeout => suspicion

• Timeout adjusted dynamically

• Parameters

• Period

• Safety margin

Adaptive Approach

Heartbeat

!to!to !to

!i !i !i

FDp

FDq

p suspects q

BOOM

p1

To1 estim.

FDq

QoS

heartbeat

sampling


Chen’s Failure Detector

• Context

• Heartbeat failure detector

• Adaptive

• Several variants

• Adaptation

• Freshness points

• Parameters

• Known heartbeat interval

• Safety margin based on QoS

74

from [CTA02]


Chen’FD: Freshness Points

• Freshness points• Samples heartbeat arrivals

• Compute normalized distribution

• Estimates future arrivals

• Adds safety margin !

75

p

q

!i !i !i

EAi EAi+1!i+1

!i+2!i

!

!i : heartbeat interval

HB i : ith heartbeat msgEAi : expected arrival for HB i

! : safety margin"i : freshness point for HB i

Samplearrivals


Chen’s FD: Mechanism

• Suspicion when...

• Freshness point i past

• No heartbeat k#i received

• QoS tuning

• Adjust safety margin based on QoS requirements

• Other

• variants based on other assumptions (e.g., synchronized clocks, etc.)

76


Bertier’s Failure Detector

• Principle

• Based on Chen’s failure detector

• Adaptive, but not tunable

• Safety margin

• Safety margin adjusted dynamically

• Use Van Jacobson’s estimation

• Very aggressive failure detector

77

from [BMS02]


PHI-FD

• PHI-FD• Estimates arrival distribution probability

• Increase level based on probability

• Sets threshold for suspicions (based on QoS)

78

App. 3

do action(!)network

Plater (t)last arrival !

Failure Detector

App. 1

! > !1 ! suspect

App. 2

! > !2 ! suspect

heartbeat arrivals

sampling window

estimation

Plater (t)

t

tnow

! log10 Plater (tnow ! Tlast)Tlast

from [HDYK04]


Experimentation: LAN

• LAN

• single FastEther hub

• Parameters

• HB interval: 20$ms

• Duration: 5% hour

• Total HB: 1’000’000

• no loss

1e-05

0.0001

0.001

0.01

0.1

1

0 0.2 0.4 0.6 0.8 1 1.2 1.4

Mis

take r

ate

[1/s

]

Detection time [s]

! FD

Chen FD

Bertier FD ! FDChen FD

Bertier FD

Bertier FD

Chen FD

! accrual FD


Experimentation: WAN

• WAN

• JAIST (JP) – EPFL (CH)

• Parameters

• HB interval: 100$ms

• Duration: 1 week

• Total HB: ~ 6’000’000

0.001

0.01

0.1

0.4

0 0.5 1 1.5 2 2.5

Mis

take r

ate

[1/s

]

Detection time [s]

Chen FD

!-FD

Bertier FD

! FDChen FD

Bertier FDBertier FD

Chen FD

! accrual FD

80


Experimentation: WAN

81

Apr 3

0:00 0:00 0:00 0:00 0:00 0:00 0:00

Apr 4 Apr 5 Apr 6 Apr 7 Apr 8 Apr 9

time (UTC)

0

50

100

150

200

250

300

350

400

450

500

0 5 10 15 20 25

Occ

ure

nce

[#

burs

ts]

Burst length [# lost messages]

2004

W32/Netsky.T@mm


Part IVAccrual FDs


Outline: Part IV

• Motivation

• Definition (accrual FDs)

• Equivalence: "P

• Relation w/QoS


Motivation

• Objective (long term)

• Offer failure detection as generic service

• E.g., NTP for clock synchronization

• Open issues

• interaction style

• notification

• self-configuration, deployment

84

Accomodate various usage patterns and

QoS requirements simultaneously


Different Patterns

• Dispatch. – Worker

• Action: release resources

• Needs stability

• => conservative FD

• Dispatch. replica

• Action: failover

• Needs quick reaction

(see Consensus)

• => mid. aggressive FD

85

Backup dispatcher


Variable Suspicion Costs

• Case 1:

• Cost varies with time:

• amount work completed

• available resources

• Case 2:

• Important task

• Most likely up machine

86

?

BOOM

BOOM


Decoupling

• Failure detection• 2 roles: monitoring, interpretation

• interpretation –> QoS

• => decoupling

87

Failure

Detection

Service

Programs,

Protocols

Monitoring

Interpretation

Action

Interpretation

ActionParametric

Action

suspicion level

suspicionssuspicions

Monitoring

Interpretation

Action Action Action

suspicions

Binary FD Accrual FD


Accrual Failure Detectors

• Abstraction

• Provides suspicion level

• Separates monitoring / interpretation

• QoS handled locally (e.g., by thresholds)

• Formally

• Well-defined behavior

• Preserves theoretical characteristics

• Relation between threshold & QoS

• Practically

• Many implementations possible

88

from [DUHK05]


Suspicion Level

• Suspicion Level

• function of time to non-negative

• means “confidence that p is faulty”: (0 = trust p)

89

slqp : T !" R+0

slqp(t)

t

slqp(t)


Property 1 (Upper bound)

• If p is correct

• slqp(t) is bounded

• bound SLmax is unknown

90

slqp(t)

t

SLmax

slqp(t)

p ! correct(F ) " #SLmax : $t (slqp(t) % SLmax )


Property 2 (Accruement)

• If p is faulty

• slqp(t) is eventually monotonic

• (unknown) minimum increase rate

91

SLmax

slqp(t)

t

slqp(t)

BOOM

monotonic

p! faulty(F ) " #K#Q$k % K

!

slqp(tqryq(k)) & slqp(tqryq

(k+1))'slqp(tqryq

(k)) < slqp(tqryq(k+Q))

"


Class !Pac

• Class !Pac

• for all pairs of processes:

- Upper bound holds

- Accruement holds

• Equivalence: !Pac ! !P

• can solve same set of problems

• implementable in same systems

92


Equivalence

• Accrual to Binary

• dynamic thresholds

• suspect: threshold & trust

• trust: rate insufficient

• Binary to Accrual

• at each query:

- p suspected => increm. by #

- p trusted => reset to 0

93

slqp(t)

slqp(t)D!Pac

t

Tsusp(t)

trust

suspectD!"P

trust

suspectD!Pslqp(t)

slqp(t)

t

D!"Pac


Equivalence

• Means...

• same computational power

- can solve same set of problems

- can be implemented over same set of systems

• but...

• loss of information

• overall performance can be different

94

!Pac ! !P


• Threshold function• function of time

• triggers suspicion

• defines binary FD:

slqp(t)

t

slqp(t)

QoS w/multiple thresholds

95

Ti : T !" R+

DTi

T1(t)

trust

DT1suspect

T2(t)

DT2

trust



• Hypotheses

• Hyp.1: At all time, T1(t) & T2(t)

• Theorem

• "T2 suspects p only if DT1 suspects p

• Detection time

• detect. time w/DT1 & detect. time w/DT2

• Query accuracy prob.

• prob. trust w/DT1 & prob. trust w/DT2

96


• Threshold function• function of time

• triggers suspicion

• defines binary FD:

slqp(t)

t

slqp(t)


97

Ti : T !" R+

DTi

T1(t)

trust

DT1suspect

T2(t)

DT2

trust


• Trust threshold function• shared by all detectors

slqp(t)

t

slqp(t)

T1(t)

trust

DT1suspect

T2(t)

DT2

trust

T0(t)


98



• Hypotheses

• Hyp.1: At all time, T1(t) & T2(t)

• Hyp.2: D#T1 & D#T2 use same trust threshold

• Theorem

• "#T2 has T-transition => D#T1 has T-transition

• Mistake rate

• mistake rate w/D#T1 >= mistake rate w/D#T2

• Good period duration

• good period duration w/D#T1 <= w/D#T2

• mistake duration: cannot say99 PDCAT’05, Dalian, China, December 5, 2005

Implementations

• Simple implementation• Partially synchronous model

• Susp. level increase with time

• Reset when receive heartbeat

100

slqp(t)

t

slqp(t)

p

q

BOOM


Implementations

• Chen-based adaptation [CTA02]

• After “expected arrival” point, increase with time


• Safety margin ! set with threshold

101

slqp(t)

t

slqp(t)

p

q

BOOM

EA1 EA2 EA3 EA4 EA5

!

!

!5

T=!


Implementations

• PHI accrual FD [HDYK04]

• Samples arrival probability

• Estimate prob. receive HB later


102

Plater (t)

t

App. 3

do action(!)network

Plater (t)last arrival !

Failure Detector

App. 1

! > !1 ! suspect

App. 2

! > !2 ! suspect

heartbeat arrivals

sampling window

estimation

Plater (t)

t

tnow

! log10 Plater (tnow ! Tlast)Tlast


Implementations

• Difference• Dominant metric for suspicion level

103

suspicion level related to...

Simple implementationdetection time

Chen-based adaptation

$ accrual FD accuracy


Part VConclusion


Other Important Issues

• Notification

• Gossiping

• Hierarchy, spanning tree

• Interface

• QoS negotiation

• Evaluation

• Representative environments

• Benchmarks


Conclusions

• Failure Detection

• Active area of research

• Theory well-developed

• Practice lagging

• Open questions...

• Better abstractions / interfaces

• Better performance (QoS)

• Lower overhead

• Self-configuration

106

Failure Detection in Distributed Systems · PRESTO, Japan Science & Tech . Agency (JST) PDCA TÕ05,...

Documents

Transcript of Failure Detection in Distributed Systems · PRESTO, Japan Science & Tech . Agency (JST) PDCA TÕ05,...