Implement Fail-Safe Duplex RedundancyUsing Communication ... · Using Communication by Time to...

28
Using Communication by Time to Using Communication by Time to Implement Fail-Safe Duplex Redundancy Implement Fail-Safe Duplex Redundancy IFIP 10.4 Workshop on Time and Dependability Les Trois Ilets, Martinique, 20-21 January 2000 David Powell [email protected] Jean Arlat [email protected] Didier Essam [email protected]

Transcript of Implement Fail-Safe Duplex RedundancyUsing Communication ... · Using Communication by Time to...

Page 1: Implement Fail-Safe Duplex RedundancyUsing Communication ... · Using Communication by Time to Implement Fail-Safe Duplex Redundancy IFIP 10.4 Workshop on Time and Dependability Les

Using Communication by Time toUsing Communication by Time toImplement Fail-Safe Duplex RedundancyImplement Fail-Safe Duplex Redundancy

IFIP 10.4 Workshop on Time and DependabilityLes Trois Ilets, Martinique, 20-21 January 2000

David [email protected]

Jean [email protected]

Didier Essam�[email protected]

Page 2: Implement Fail-Safe Duplex RedundancyUsing Communication ... · Using Communication by Time to Implement Fail-Safe Duplex Redundancy IFIP 10.4 Workshop on Time and Dependability Les

22

AnAn Early Example Early Example of Fail- of Fail-SafeSafe Design Design

gostop

gostop

Page 3: Implement Fail-Safe Duplex RedundancyUsing Communication ... · Using Communication by Time to Implement Fail-Safe Duplex Redundancy IFIP 10.4 Workshop on Time and Dependability Les

33

FaultFault--TolerantTolerant Fail- Fail-SafeSafe Design? Design?

gostop

gostop

Page 4: Implement Fail-Safe Duplex RedundancyUsing Communication ... · Using Communication by Time to Implement Fail-Safe Duplex Redundancy IFIP 10.4 Workshop on Time and Dependability Les

44

FaultFault--TolerantTolerant Fail- Fail-SafeSafe Design? Design?

gostop

gostop

?

Page 5: Implement Fail-Safe Duplex RedundancyUsing Communication ... · Using Communication by Time to Implement Fail-Safe Duplex Redundancy IFIP 10.4 Workshop on Time and Dependability Les

55

Automatic Subway SystemAutomatic Subway System

Section i-1 Section i

I/O I/O

Controller i-1

I/OI/O I/O I/OI/OI/O I/O I/OI/OI/O

Section i+1

Controller i

Controller i+1

I/O section i

Onboardcontroller

Control room

Page 6: Implement Fail-Safe Duplex RedundancyUsing Communication ... · Using Communication by Time to Implement Fail-Safe Duplex Redundancy IFIP 10.4 Workshop on Time and Dependability Les

66

Coded Processor Technique (1)Coded Processor Technique (1)

■ Arithmetic code (principle):17

+ 25

42

8 ( = 1 + 7 )

7 ( = 2 + 5 )

6 ( = 4 + 2)

Correct since: 8 + 7 = 15 ⇒ 6 ( = 1 + 5)

type SAFE_INTEGER is record

F: INTEGER; -- functional value

C: INTEGER; -- code, i.e., modulo 9 of the functional value

end record ;

function “+” (x,y : SAFE_INTEGER) return SAFE_INTEGER is

begin

return (x.F + y.F , ( x.C + y.C) mod 9 )

end “+”

Page 7: Implement Fail-Safe Duplex RedundancyUsing Communication ... · Using Communication by Time to Implement Fail-Safe Duplex Redundancy IFIP 10.4 Workshop on Time and Dependability Les

77

Coded Processor Technique (2)Coded Processor Technique (2)

■ Full code:

arithmetic code(to detect

incorrect arithmetic)

static signature(to detect

addressing errors)

dynamic signature(to detect old values )

x => ( x, r(x) + Bx + D )

8 bits 48 bits

sourceprogram

COTSprocessor

signaturedetermination

tool

COTScompiler

instrumentedsource

program

objectprogram

v1.C v2.C v3.C É

v1.F v2.F v3.F É

Outputs to plant

fail-safecode

checker

propertiesof the code

power

■ Source to runtime production chain:

Page 8: Implement Fail-Safe Duplex RedundancyUsing Communication ... · Using Communication by Time to Implement Fail-Safe Duplex Redundancy IFIP 10.4 Workshop on Time and Dependability Les

88

Lock

Principle of OperationPrinciple of OperationBlock

Section BSection A

Negative detectors

■ Target attribution● Section controllers give each train a ÒtargetÓ - the point up to which it may

advance:

➥ next station

➥ block before next train

➥ exit point of inter-section Lock

■ Train handover from one section to the next (via the Lock)● Controller B detects, interrogates and registers trains entering the Lock

● Controller A detects and unregisters trains leaving the Lock

T3 T2 T1

Controller A Controller B

Page 9: Implement Fail-Safe Duplex RedundancyUsing Communication ... · Using Communication by Time to Implement Fail-Safe Duplex Redundancy IFIP 10.4 Workshop on Time and Dependability Les

99

Handover Scenario without RedundancyHandover Scenario without Redundancy

● Controller B registers train T2 and assigns it the target Y

➥ if controller B does not register T2

➥ if controller B fails before assigning the target Y

➥ if controller B fails after assigning the target Y

Lock Section BSection A

T3 T2 T1

Y

T2 stops at exit of Lock(its last target)

T2 advances towards Y(its new target)

Controller A Controller B

Page 10: Implement Fail-Safe Duplex RedundancyUsing Communication ... · Using Communication by Time to Implement Fail-Safe Duplex Redundancy IFIP 10.4 Workshop on Time and Dependability Les

1010

Handover Scenario with RedundancyHandover Scenario with Redundancy (1) (1)Lock Section BSection A

T3 T2 T1

Y

Controller AA1 A2 B1 B2

● Unit B1 (primary) registers train T2 and assigns it the target Y

● Unit B2 (secondary) does not register T2: B1 and B2 have become inconsistent

Controller B

Page 11: Implement Fail-Safe Duplex RedundancyUsing Communication ... · Using Communication by Time to Implement Fail-Safe Duplex Redundancy IFIP 10.4 Workshop on Time and Dependability Les

1111

Handover Scenario with RedundancyHandover Scenario with Redundancy (2) (2)Lock Section BSection A

T1

Y

A1 A2 B1 B2

● Unit B1 (primary) registers train T2 and assigns it the target Y

● Unit B2 (secondary) does not register T2: B1 and B2 have become inconsistent

● Unit B1 fails after assigning the target Y

● Unit B2 becomes primary

● Train T3 enters the Lock

● Unit B2 registers train T3 and assigns it the target Yinstead of X because it is not aware of T2

● Train T3 advances towards Y, through point X ...

T3 T2

X

Controller A Controller B

T2 has becomea Òghost trainÓ

T2 advances towards Y(its new target)

Page 12: Implement Fail-Safe Duplex RedundancyUsing Communication ... · Using Communication by Time to Implement Fail-Safe Duplex Redundancy IFIP 10.4 Workshop on Time and Dependability Les

1212

Problem StatementProblem Statement

Communication system

Inconsistency ofredundant unit states

causes safety problems Use an atomicmulticast protocol ! Yes, of course, but what

distributed system modelcan be safely assumed?

Page 13: Implement Fail-Safe Duplex RedundancyUsing Communication ... · Using Communication by Time to Implement Fail-Safe Duplex Redundancy IFIP 10.4 Workshop on Time and Dependability Les

1313

Communication time(TB-TA), (TD-TC)

∃ bound ∆P in the timereference of P ?

Reaction time

(TC-TB)

∃ bound σP in the timereference of P ?

Assumptions: Timing Models (1)Assumptions: Timing Models (1)

sharednetwork QP

m2

TB

TC

TA

TD

m1

Both bounds must exist so thatP can detect that something has failed

One bound must be guaranteed so thatP can decide what has failed

Page 14: Implement Fail-Safe Duplex RedundancyUsing Communication ... · Using Communication by Time to Implement Fail-Safe Duplex Redundancy IFIP 10.4 Workshop on Time and Dependability Les

1414

sharednetwork

Assumptions: Timing Models (2)Assumptions: Timing Models (2)

Cannot solve consensus,so no atomic multicast

Assumption coverage?

m1

m2

QP

TB

TC

TA

TD

Asynchronous, or Òtime-freeÓ model

➥ either communication or reaction timebound does not exist

➥ P cannot decide if Q has stopped, or ifQ, m1 or m2 are very slow

Asynchronous, or Òtime-freeÓ model

➥ either communication or reaction timebound does not exist

➥ P cannot decide if Q has stopped, or ifQ, m1 or m2 are very slow

Synchronous, or Òbounded timeÓ model

➥ communication bound guaranteed(the network never fails)

➥ P can declare that Q has failed ifTD-TA > 2∆P+ σP

Synchronous, or Òbounded timeÓ model

➥ communication bound guaranteed(the network never fails)

➥ P can declare that Q has failed ifTD-TA > 2∆P+ σP

Page 15: Implement Fail-Safe Duplex RedundancyUsing Communication ... · Using Communication by Time to Implement Fail-Safe Duplex Redundancy IFIP 10.4 Workshop on Time and Dependability Les

1515

The Real SystemThe Real System

■ Assumptions● human lives are at stake, so must assume that communication is uncertain:

➥ messages can be lost (omission failures)

➥ messages can be delayed (performance failures)

● fail-safe processing units (coded processor technique)

● table-driven process scheduling

● fail-safe local clocks

Communication system

Page 16: Implement Fail-Safe Duplex RedundancyUsing Communication ... · Using Communication by Time to Implement Fail-Safe Duplex Redundancy IFIP 10.4 Workshop on Time and Dependability Les

1616

Timed Asynchronous ModelTimed Asynchronous Model■ The real system

● human lives are at stake, so must assume that communication is uncertain:

➥ messages can be lost (omission failure)

➥ messages can be delayed (performance failure)

● fail-safe processing units (coded processor technique)

● table-driven process scheduling

● fail-safe local clocks

■ The model● Datagram service

➥ Defined upper quantile on transmission delay (δ)

➥ Messages can only suffer omission/performance failures

● Process management service➥ Defined upper quantile on scheduling delay (σ)

➥ Processes can only suffer crash/performance failures

● Hardware clock service

➥ Each non-crashed process has access to a hardware clock with a known upper boundon drift rate (ρ) (NB. clocks are not (cannot be) deterministically synchronized)

[Cristian & Fetzer 1998]

Page 17: Implement Fail-Safe Duplex RedundancyUsing Communication ... · Using Communication by Time to Implement Fail-Safe Duplex Redundancy IFIP 10.4 Workshop on Time and Dependability Les

1717

Fail-AwareFail-Aware Datagram Datagram Service Service

■ Let td(m) be the real delay incurred by a message m

td(m) = t4 - t3 = (t4 - t1) - (t3 - t2) - (t2 - t1)

■ Upper bound ub(m) on td(m):ub(m) = (TQ(t4) - TQ(t1)).(1+ρ) - (TP(t3) - TP(t2)).(1- ρ) - δmin

■ Choose a constant ∆ so that m can be classified according to:

● if ub(m) ≤ ∆ message is fast

● if ub(m) > ∆ message is slow

■ Moreover, if periodicity of messages is ≤ τ, can calculate ∆ such that, whentd(mÕ)< δ and td (m)< δ (P and Q ÒconnectedÓ), then m is delivered as fast:

● ∆ ≥ 4 τρ + (2+4ρ) δ - δmin (ensures progress when P, Q and the channel between them are timely)

P

Q

TP(t2)

TQ(t1)

≥ δmin td(m) ≤ ub(m)

real time

mmÕ

t1 t2 t3 t4

TP(t3)

TQ(t4)

[Fetzer & Cristian 1997]

Page 18: Implement Fail-Safe Duplex RedundancyUsing Communication ... · Using Communication by Time to Implement Fail-Safe Duplex Redundancy IFIP 10.4 Workshop on Time and Dependability Les

1818

PProtocol for rotocol for AAsymmetric symmetric DDuplexuplex REREdundancydundancy

PADRE

fail-awaredatagram

unreliabledatagram

applicationmodule

PADRE

fail-awaredatagram

unreliabledatagram

applicationmodule

■ Idea:● Cannot guarantee consistency of

duplicated units sincecommunication is uncertain

● So, build a fail-aware multicastprotocol

● Indicator signals when consistencyis ensured

➥ Nominal duplex configuration

¥ primary unit in primary mode

¥ secondary unit in standby mode

● Inhibit redundancy switching whenconsistency is not ensured

➥ Safe duplex configuration

¥ primary unit in primary mode

¥ secondary unit in quarantine mode

[Essam� et al. 1999]

Page 19: Implement Fail-Safe Duplex RedundancyUsing Communication ... · Using Communication by Time to Implement Fail-Safe Duplex Redundancy IFIP 10.4 Workshop on Time and Dependability Les

1919

PADRE System ModesPADRE System Modes

(Benign failure)

Catastrophicfailure

Nominal service

Fault of primaryor secondary

Fault of secondary

RepairPotential

inconsistency(transmission fault)

Staterestoration

Unsafe

Repair

Safeduplexconfig.

Nominalduplexconfig.

Simplexconfig.

Safe

Fault of primary

Fault of primary

Page 20: Implement Fail-Safe Duplex RedundancyUsing Communication ... · Using Communication by Time to Implement Fail-Safe Duplex Redundancy IFIP 10.4 Workshop on Time and Dependability Les

2020

Protocol Protocol PropertiesProperties

■ Safety properties● Unique Primary (UP): at any instant, only one unit is in the primary mode

● Quarantine (MQ): Secondary must leave standby mode within bounded delay ifinconsistent with Primary; return to standby mode only allowed when consistent

● Prefix of History (PH): history of Primary must always be a prefix of that of Secondary

■ Progress properties● Agreement (AP): in the absence of faults, any input accepted by one unit at time t is

accepted by the other unit in the interval [ t-ω , t+ω ]

● Limited Quarantine (LQ): in the absence of faults, a unit in quarantine must eventuallyswitch to standby

Page 21: Implement Fail-Safe Duplex RedundancyUsing Communication ... · Using Communication by Time to Implement Fail-Safe Duplex Redundancy IFIP 10.4 Workshop on Time and Dependability Les

2121

Protocol Protocol PrinciplePrinciple

■ Primary only accepts an input if the secondary:➥ has accepted it, or

➥ has been placed in quarantine, or

➥ has failed

■ Secondary only accepts messages sent to it from the primary

Page 22: Implement Fail-Safe Duplex RedundancyUsing Communication ... · Using Communication by Time to Implement Fail-Safe Duplex Redundancy IFIP 10.4 Workshop on Time and Dependability Les

2222

Reception ProtocolReception Protocol

Primary

Secondary

can acceptsince

secondary hasaccepted

acknowledgement timeout interval

mr

ack

(mi)

A

mi

mi

mi

can onlyaccept when

sure thatsecondary isin quarantineor has failed

mi+1

Ami+1

mi+1

secondary,go to

quarantine!

Page 23: Implement Fail-Safe Duplex RedundancyUsing Communication ... · Using Communication by Time to Implement Fail-Safe Duplex Redundancy IFIP 10.4 Workshop on Time and Dependability Les

2323

Quarantine Control ProtocolQuarantine Control Protocol

gostop donÕt go to quarantinequarantine

Page 24: Implement Fail-Safe Duplex RedundancyUsing Communication ... · Using Communication by Time to Implement Fail-Safe Duplex Redundancy IFIP 10.4 Workshop on Time and Dependability Les

2424

Quarantine Control ProtocolQuarantine Control Protocol

R refresh period

I survival timeout interval

Q delay for certain quarantine or failure of secondary

Q

mi+1

mr

ack

(mi)

AA

mi

mi

mi mi+1

mi+1

Primary

Secondary

DonÕt goto quarantine

I

R

I

R

I

R

I

Page 25: Implement Fail-Safe Duplex RedundancyUsing Communication ... · Using Communication by Time to Implement Fail-Safe Duplex Redundancy IFIP 10.4 Workshop on Time and Dependability Les

2525

Choice of Value for Choice of Value for QQ

Primary

Secondary

real time

t1 t2 t3

Need: TP(t3) < TP(t1) + Q

Equivalently: Q ≥ TP(t3) - TP(t1)

Now: (t3-t1) = (t3-t2) + (t2 -t1)

and:

(t3-t2) ≤ I (1+ρ)

(t2 -t1) ≤ ∆ (fail-aware datagram)

so:

(t3-t1) ≤ ∆ + I (1+ρ)

but:

TP(t3) - TP(t1) ≤ (t3-t1) (1+ρ)

Therefore, must choose Q such that:

Q ≥ [∆ + I (1+ρ)] (1+ρ)

or:

TS(t2) TS(t3)

TP(t1) TP(t3)

DonÕt goto quarantine

Q

Q ≥ ∆ (1+ρ) + I (1+2ρ)Q ≥ ∆ (1+ρ) + I (1+2ρ)

I

Page 26: Implement Fail-Safe Duplex RedundancyUsing Communication ... · Using Communication by Time to Implement Fail-Safe Duplex Redundancy IFIP 10.4 Workshop on Time and Dependability Les

2626

Unique Primary PropertyUnique Primary Property

● Unique Primary (UP): at anyinstant, only one unit is in theprimary mode

➥ Software implementation wouldrequire third party to allowmajority election of a leader

➥ Hardware implementation bymeans of a bistable safety relay

PADRE

fail-awaredatagram

unreliabledatagram

applicationmodule

PADRE

fail-awaredatagram

unreliabledatagram

applicationmodule

Page 27: Implement Fail-Safe Duplex RedundancyUsing Communication ... · Using Communication by Time to Implement Fail-Safe Duplex Redundancy IFIP 10.4 Workshop on Time and Dependability Les

2727

De-quarantine ProtocolDe-quarantine Protocol

Objective: secondary to revert from quarantine to standby, to resume its role as a back-up

Principle:

● transfer state of primary to secondary

● in general case, state cannot be transferred in single message

● state of primary may be updated while transfer is being carried out

[Bondavalli et al. 1998]

0StateÊ[1]

0StateÊ[2]

1StateÊ[3]

0StateÊ[4]

1StateÊ[5]

1StateÊ[6]

1StateÊ[n-1]

1StateÊ[n]

StateÊ[1]

StateÊ[2]

StateÊ[3]

StateÊ[4]

StateÊ[5]

StateÊ[6]

StateÊ[n-1]

StateÊ[n]

Primary Secondary (in quarantine)

concurrent update

if last resume sending ÒdonÕt go to quarantineÓ

if last switch toÒstandbyÓ mode

do while ∃ tag=1

Page 28: Implement Fail-Safe Duplex RedundancyUsing Communication ... · Using Communication by Time to Implement Fail-Safe Duplex Redundancy IFIP 10.4 Workshop on Time and Dependability Les

2828

ConclusionConclusion

■ Timed asynchronous model● safety does rely on coverage of synchronous assumptions

● progress can be made when system behaves Òas ifÓ it were synchronous

● appropriate model for designing fail-safe distributed systems

■ Asymmetric redundancy management● tolerance of potential inconsistency

● fault-tolerance temporarily sacrificed to guarantee safety

■ Feasibility study● automatic subway in San Juan, Porto Rico

■ Projected applications● automatic subway in Hong Kong

● automatic subway (Canarsie Line) in New York