Implement Fail-Safe Duplex RedundancyUsing Communication ... · Using Communication by Time to...
Transcript of Implement Fail-Safe Duplex RedundancyUsing Communication ... · Using Communication by Time to...
Using Communication by Time toUsing Communication by Time toImplement Fail-Safe Duplex RedundancyImplement Fail-Safe Duplex Redundancy
IFIP 10.4 Workshop on Time and DependabilityLes Trois Ilets, Martinique, 20-21 January 2000
David [email protected]
Jean [email protected]
Didier Essam�[email protected]
22
AnAn Early Example Early Example of Fail- of Fail-SafeSafe Design Design
gostop
gostop
33
FaultFault--TolerantTolerant Fail- Fail-SafeSafe Design? Design?
gostop
gostop
44
FaultFault--TolerantTolerant Fail- Fail-SafeSafe Design? Design?
gostop
gostop
?
55
Automatic Subway SystemAutomatic Subway System
Section i-1 Section i
I/O I/O
Controller i-1
I/OI/O I/O I/OI/OI/O I/O I/OI/OI/O
Section i+1
Controller i
Controller i+1
I/O section i
Onboardcontroller
Control room
66
Coded Processor Technique (1)Coded Processor Technique (1)
■ Arithmetic code (principle):17
+ 25
42
8 ( = 1 + 7 )
7 ( = 2 + 5 )
6 ( = 4 + 2)
Correct since: 8 + 7 = 15 ⇒ 6 ( = 1 + 5)
type SAFE_INTEGER is record
F: INTEGER; -- functional value
C: INTEGER; -- code, i.e., modulo 9 of the functional value
end record ;
function “+” (x,y : SAFE_INTEGER) return SAFE_INTEGER is
begin
return (x.F + y.F , ( x.C + y.C) mod 9 )
end “+”
77
Coded Processor Technique (2)Coded Processor Technique (2)
■ Full code:
arithmetic code(to detect
incorrect arithmetic)
static signature(to detect
addressing errors)
dynamic signature(to detect old values )
x => ( x, r(x) + Bx + D )
8 bits 48 bits
sourceprogram
COTSprocessor
signaturedetermination
tool
COTScompiler
instrumentedsource
program
objectprogram
v1.C v2.C v3.C É
v1.F v2.F v3.F É
Outputs to plant
fail-safecode
checker
propertiesof the code
power
■ Source to runtime production chain:
88
Lock
Principle of OperationPrinciple of OperationBlock
Section BSection A
Negative detectors
■ Target attribution● Section controllers give each train a ÒtargetÓ - the point up to which it may
advance:
➥ next station
➥ block before next train
➥ exit point of inter-section Lock
■ Train handover from one section to the next (via the Lock)● Controller B detects, interrogates and registers trains entering the Lock
● Controller A detects and unregisters trains leaving the Lock
T3 T2 T1
Controller A Controller B
99
Handover Scenario without RedundancyHandover Scenario without Redundancy
● Controller B registers train T2 and assigns it the target Y
➥ if controller B does not register T2
➥ if controller B fails before assigning the target Y
➥ if controller B fails after assigning the target Y
Lock Section BSection A
T3 T2 T1
Y
T2 stops at exit of Lock(its last target)
T2 advances towards Y(its new target)
Controller A Controller B
1010
Handover Scenario with RedundancyHandover Scenario with Redundancy (1) (1)Lock Section BSection A
T3 T2 T1
Y
Controller AA1 A2 B1 B2
● Unit B1 (primary) registers train T2 and assigns it the target Y
● Unit B2 (secondary) does not register T2: B1 and B2 have become inconsistent
Controller B
1111
Handover Scenario with RedundancyHandover Scenario with Redundancy (2) (2)Lock Section BSection A
T1
Y
A1 A2 B1 B2
● Unit B1 (primary) registers train T2 and assigns it the target Y
● Unit B2 (secondary) does not register T2: B1 and B2 have become inconsistent
● Unit B1 fails after assigning the target Y
● Unit B2 becomes primary
● Train T3 enters the Lock
● Unit B2 registers train T3 and assigns it the target Yinstead of X because it is not aware of T2
● Train T3 advances towards Y, through point X ...
T3 T2
X
Controller A Controller B
T2 has becomea Òghost trainÓ
T2 advances towards Y(its new target)
1212
Problem StatementProblem Statement
Communication system
Inconsistency ofredundant unit states
causes safety problems Use an atomicmulticast protocol ! Yes, of course, but what
distributed system modelcan be safely assumed?
1313
Communication time(TB-TA), (TD-TC)
∃ bound ∆P in the timereference of P ?
Reaction time
(TC-TB)
∃ bound σP in the timereference of P ?
Assumptions: Timing Models (1)Assumptions: Timing Models (1)
sharednetwork QP
m2
TB
TC
TA
TD
m1
Both bounds must exist so thatP can detect that something has failed
One bound must be guaranteed so thatP can decide what has failed
1414
sharednetwork
Assumptions: Timing Models (2)Assumptions: Timing Models (2)
Cannot solve consensus,so no atomic multicast
Assumption coverage?
m1
m2
QP
TB
TC
TA
TD
Asynchronous, or Òtime-freeÓ model
➥ either communication or reaction timebound does not exist
➥ P cannot decide if Q has stopped, or ifQ, m1 or m2 are very slow
Asynchronous, or Òtime-freeÓ model
➥ either communication or reaction timebound does not exist
➥ P cannot decide if Q has stopped, or ifQ, m1 or m2 are very slow
Synchronous, or Òbounded timeÓ model
➥ communication bound guaranteed(the network never fails)
➥ P can declare that Q has failed ifTD-TA > 2∆P+ σP
Synchronous, or Òbounded timeÓ model
➥ communication bound guaranteed(the network never fails)
➥ P can declare that Q has failed ifTD-TA > 2∆P+ σP
1515
The Real SystemThe Real System
■ Assumptions● human lives are at stake, so must assume that communication is uncertain:
➥ messages can be lost (omission failures)
➥ messages can be delayed (performance failures)
● fail-safe processing units (coded processor technique)
● table-driven process scheduling
● fail-safe local clocks
Communication system
1616
Timed Asynchronous ModelTimed Asynchronous Model■ The real system
● human lives are at stake, so must assume that communication is uncertain:
➥ messages can be lost (omission failure)
➥ messages can be delayed (performance failure)
● fail-safe processing units (coded processor technique)
● table-driven process scheduling
● fail-safe local clocks
■ The model● Datagram service
➥ Defined upper quantile on transmission delay (δ)
➥ Messages can only suffer omission/performance failures
● Process management service➥ Defined upper quantile on scheduling delay (σ)
➥ Processes can only suffer crash/performance failures
● Hardware clock service
➥ Each non-crashed process has access to a hardware clock with a known upper boundon drift rate (ρ) (NB. clocks are not (cannot be) deterministically synchronized)
[Cristian & Fetzer 1998]
1717
Fail-AwareFail-Aware Datagram Datagram Service Service
■ Let td(m) be the real delay incurred by a message m
td(m) = t4 - t3 = (t4 - t1) - (t3 - t2) - (t2 - t1)
■ Upper bound ub(m) on td(m):ub(m) = (TQ(t4) - TQ(t1)).(1+ρ) - (TP(t3) - TP(t2)).(1- ρ) - δmin
■ Choose a constant ∆ so that m can be classified according to:
● if ub(m) ≤ ∆ message is fast
● if ub(m) > ∆ message is slow
■ Moreover, if periodicity of messages is ≤ τ, can calculate ∆ such that, whentd(mÕ)< δ and td (m)< δ (P and Q ÒconnectedÓ), then m is delivered as fast:
● ∆ ≥ 4 τρ + (2+4ρ) δ - δmin (ensures progress when P, Q and the channel between them are timely)
P
Q
TP(t2)
TQ(t1)
≥ δmin td(m) ≤ ub(m)
real time
mmÕ
t1 t2 t3 t4
TP(t3)
TQ(t4)
[Fetzer & Cristian 1997]
1818
PProtocol for rotocol for AAsymmetric symmetric DDuplexuplex REREdundancydundancy
PADRE
fail-awaredatagram
unreliabledatagram
applicationmodule
PADRE
fail-awaredatagram
unreliabledatagram
applicationmodule
■ Idea:● Cannot guarantee consistency of
duplicated units sincecommunication is uncertain
● So, build a fail-aware multicastprotocol
● Indicator signals when consistencyis ensured
➥ Nominal duplex configuration
¥ primary unit in primary mode
¥ secondary unit in standby mode
● Inhibit redundancy switching whenconsistency is not ensured
➥ Safe duplex configuration
¥ primary unit in primary mode
¥ secondary unit in quarantine mode
[Essam� et al. 1999]
1919
PADRE System ModesPADRE System Modes
(Benign failure)
Catastrophicfailure
Nominal service
Fault of primaryor secondary
Fault of secondary
RepairPotential
inconsistency(transmission fault)
Staterestoration
Unsafe
Repair
Safeduplexconfig.
Nominalduplexconfig.
Simplexconfig.
Safe
Fault of primary
Fault of primary
2020
Protocol Protocol PropertiesProperties
■ Safety properties● Unique Primary (UP): at any instant, only one unit is in the primary mode
● Quarantine (MQ): Secondary must leave standby mode within bounded delay ifinconsistent with Primary; return to standby mode only allowed when consistent
● Prefix of History (PH): history of Primary must always be a prefix of that of Secondary
■ Progress properties● Agreement (AP): in the absence of faults, any input accepted by one unit at time t is
accepted by the other unit in the interval [ t-ω , t+ω ]
● Limited Quarantine (LQ): in the absence of faults, a unit in quarantine must eventuallyswitch to standby
2121
Protocol Protocol PrinciplePrinciple
■ Primary only accepts an input if the secondary:➥ has accepted it, or
➥ has been placed in quarantine, or
➥ has failed
■ Secondary only accepts messages sent to it from the primary
2222
Reception ProtocolReception Protocol
Primary
Secondary
can acceptsince
secondary hasaccepted
acknowledgement timeout interval
mr
ack
(mi)
A
mi
mi
mi
can onlyaccept when
sure thatsecondary isin quarantineor has failed
mi+1
Ami+1
mi+1
secondary,go to
quarantine!
2323
Quarantine Control ProtocolQuarantine Control Protocol
gostop donÕt go to quarantinequarantine
2424
Quarantine Control ProtocolQuarantine Control Protocol
R refresh period
I survival timeout interval
Q delay for certain quarantine or failure of secondary
Q
mi+1
mr
ack
(mi)
AA
mi
mi
mi mi+1
mi+1
Primary
Secondary
DonÕt goto quarantine
I
R
I
R
I
R
I
2525
Choice of Value for Choice of Value for QQ
Primary
Secondary
real time
t1 t2 t3
Need: TP(t3) < TP(t1) + Q
Equivalently: Q ≥ TP(t3) - TP(t1)
Now: (t3-t1) = (t3-t2) + (t2 -t1)
and:
(t3-t2) ≤ I (1+ρ)
(t2 -t1) ≤ ∆ (fail-aware datagram)
so:
(t3-t1) ≤ ∆ + I (1+ρ)
but:
TP(t3) - TP(t1) ≤ (t3-t1) (1+ρ)
Therefore, must choose Q such that:
Q ≥ [∆ + I (1+ρ)] (1+ρ)
or:
TS(t2) TS(t3)
TP(t1) TP(t3)
DonÕt goto quarantine
Q
Q ≥ ∆ (1+ρ) + I (1+2ρ)Q ≥ ∆ (1+ρ) + I (1+2ρ)
I
2626
Unique Primary PropertyUnique Primary Property
● Unique Primary (UP): at anyinstant, only one unit is in theprimary mode
➥ Software implementation wouldrequire third party to allowmajority election of a leader
➥ Hardware implementation bymeans of a bistable safety relay
PADRE
fail-awaredatagram
unreliabledatagram
applicationmodule
PADRE
fail-awaredatagram
unreliabledatagram
applicationmodule
2727
De-quarantine ProtocolDe-quarantine Protocol
Objective: secondary to revert from quarantine to standby, to resume its role as a back-up
Principle:
● transfer state of primary to secondary
● in general case, state cannot be transferred in single message
● state of primary may be updated while transfer is being carried out
[Bondavalli et al. 1998]
0StateÊ[1]
0StateÊ[2]
1StateÊ[3]
0StateÊ[4]
1StateÊ[5]
1StateÊ[6]
1StateÊ[n-1]
1StateÊ[n]
StateÊ[1]
StateÊ[2]
StateÊ[3]
StateÊ[4]
StateÊ[5]
StateÊ[6]
StateÊ[n-1]
StateÊ[n]
Primary Secondary (in quarantine)
concurrent update
if last resume sending ÒdonÕt go to quarantineÓ
if last switch toÒstandbyÓ mode
do while ∃ tag=1
2828
ConclusionConclusion
■ Timed asynchronous model● safety does rely on coverage of synchronous assumptions
● progress can be made when system behaves Òas ifÓ it were synchronous
● appropriate model for designing fail-safe distributed systems
■ Asymmetric redundancy management● tolerance of potential inconsistency
● fault-tolerance temporarily sacrificed to guarantee safety
■ Feasibility study● automatic subway in San Juan, Porto Rico
■ Projected applications● automatic subway in Hong Kong
● automatic subway (Canarsie Line) in New York