A Fusion-based Approach for Tolerating Faults in Finite State Machines

46
A Fusion-based Approach for Tolerating Faults in Finite State Machines Vinit Ogale, Bharath Balasubramanian Parallel and Distributed Systems Lab Electrical and Computer Engineering Dept. University of Texas at Austin Vijay K. Garg IBM India Research Lab

description

A Fusion-based Approach for Tolerating Faults in Finite State Machines. Vinit Ogale, Bharath Balasubramanian Parallel and Distributed Systems Lab Electrical and Computer Engineering Dept. University of Texas at Austin Vijay K. Garg IBM India Research Lab. Outline. Motivation - PowerPoint PPT Presentation

Transcript of A Fusion-based Approach for Tolerating Faults in Finite State Machines

Page 1: A Fusion-based Approach for Tolerating Faults in Finite State Machines

A Fusion-based Approach for Tolerating Faults in Finite State Machines

Vinit Ogale, Bharath BalasubramanianParallel and Distributed Systems Lab

Electrical and Computer Engineering Dept. University of Texas at Austin

Vijay K. GargIBM India Research Lab

Page 2: A Fusion-based Approach for Tolerating Faults in Finite State Machines

OutlineMotivationRelated WorkQuestions and Issues AddressedModelPartition LatticeFault GraphsFault Tolerance in FSMs and (f,m) – fusionAlgorithms : Generating Backups and RecoveryImplementation ResultsConclusion and Future Work

2

Page 3: A Fusion-based Approach for Tolerating Faults in Finite State Machines

MotivationMany real applications modeled as FSMsEmbedded Systems :

Traffic controllers, home appliancesSensor networks

E.g. hundreds of multiple sensors (like temperature, pressure etc) need to be backed up

3

Page 4: A Fusion-based Approach for Tolerating Faults in Finite State Machines

Problem

4

Given a set of finite state machines (FSMs), some FSMs may either crash (fail-stop faults) or lie about their execution state (Byzantine faults)

a a

a

b b

b

Counter counting ‘b’sCounter counting ‘a’s

a0 a1 a2 b0 b1 b2

Page 5: A Fusion-based Approach for Tolerating Faults in Finite State Machines

Existing Solution - Replicate

5

n.f extra FSMs to tolerate k crash faults; 2.n.f extra FSMs to tolerate f Byzantine faults (where n is the # of original FSMs)

a a

a

b b

bCounter counting ‘b’s

Counter counting ‘a’s

a0 a1 a2

b0 b1 b2

1-crash fault tolerant setup

a a

a

b

b b

Page 6: A Fusion-based Approach for Tolerating Faults in Finite State Machines

Related WorkTraditional Approach – Redundancy

n.k backup machines to tolerate k faults in n machinesFault Tolerance in Finite State Machines using Fusion

(Balasubramanian, Ogale, Garg 08)Exponential algorithm for generating machines which can tolerate

crash faults Number of faults = Number of Machines

Fusible Data Structures (Garg, Ogale 06)Fuse common data structures such as link lists, hash tables etc – the

fused structure smaller than sum of original structuresErasure Coding

Fault Tolerance in Data

6

Page 7: A Fusion-based Approach for Tolerating Faults in Finite State Machines

Reachable Cross Product

7

a a

a

A

b b

b

B

Counter counting ‘b’s

Counter counting ‘a’s

a0 a1 a2

b0 b1 b2

0 0 0

R (A, B)

<a0, b0> <a0, b1> <a0, b2>

<a1, b0> <a1, b1> <a1,b2>

<a2, b0> <a2, b1> <a2, b2>

Reachable Cross Product of {A,B}

=

Page 8: A Fusion-based Approach for Tolerating Faults in Finite State Machines

Can We Do Better ?

a a

a

b b

b

Counter counting ‘b’s (mod 3)

Counter counting ‘a’s (mod 3)

a0 a1 a2

b0 b1 b2

F1

“a a b”

(a + b ) modulo 3

8

b b

b

a a

a

Page 9: A Fusion-based Approach for Tolerating Faults in Finite State Machines

Can We Do Better ?

a a

a

b b

b

Counter counting ‘b’s (mod 3)

Counter counting ‘a’s (mod 3)

a0 a1 a2

b0 b1 b2

F1

(a + b ) modulo 3

9

b b

b

a

a

a

F2

(a - b ) modulo 3 a

a a

b b

b

2-crash fault tolerant setup

Page 10: A Fusion-based Approach for Tolerating Faults in Finite State Machines

Questions and Issues addressedCan we do better than the cross product ?How many faults can be tolerated ? What is the minimum

number of machines required to tolerate f crash faults ?Can these machines tolerate Byzantine faults? (For

example, in previous slide, DFSMs A and B along with F1 and F2 can tolerate one Byzantine fault )

Main Aims :Develop theory to understand and define this problem Efficient algorithms based on this to generate backup

machines

10

Page 11: A Fusion-based Approach for Tolerating Faults in Finite State Machines

Application Scenario: Sensor Network1000 sensors (simple counters) each recording a

parameter (temperature, pressure etc.). Sensors will be collected later and their data analyzed offline

10 sensors are expected to crashReplication requires 1000 x 10 backup sensors to

ensure fault tolerant operationCan we use just 10 extra sensors instead of

10000?

11

Page 12: A Fusion-based Approach for Tolerating Faults in Finite State Machines

ModelFSMs (machines) execute independently (in

parallel)The inputs to a FSM are not determined by any other

FSM.FSMs act concurrently on the same set of eventsFail stop (crash) faults

Loss of current state, underlying FSM intactByzantine faults

Machines can `lie` about their current state

12

Page 13: A Fusion-based Approach for Tolerating Faults in Finite State Machines

Join of Two FSMs

13

Join (t) : Reachable cross product: 4 states in this case instead of 9

Page 14: A Fusion-based Approach for Tolerating Faults in Finite State Machines

Less Than Equal To Relation (·)Given FSMs: A and B

A · B , A t B = B

Given the state of B, we can determine the current state of A

14

Page 15: A Fusion-based Approach for Tolerating Faults in Finite State Machines

PartitionsGiven any FSM, we can partition the states into

blocks such that the transitions for all states in a block are consistentE.g. if states t0 and t3

have to be combined to form one partition

t0

t3

t1 t2

15

Input 0

Input 1

Page 16: A Fusion-based Approach for Tolerating Faults in Finite State Machines

Largest Consistent Partition Containing {t0,t3}

t0

t3

t1 t2

t0,t3 t1 t2

16

Page 17: A Fusion-based Approach for Tolerating Faults in Finite State Machines

Largest Consistent Partition Containing {t0,t1}

17

t0

t3

t1 t2

t0,t1, t2 t3

Page 18: A Fusion-based Approach for Tolerating Faults in Finite State Machines

Partition LatticeSet of all FSMs corresponding to partitions of a given

FSM (say T) forms a lattice with respect to the · relation [HarSte66].

i.e, for any two FSMs, A and B, formed by partitioning T, there exists a unique C · T such thatC = A t B : (join/ t )

A · C and B · C and C is the smallest such elementC = A u B : (meet/ u)

C · A and C · B and C is the largest such FSM

18

Page 19: A Fusion-based Approach for Tolerating Faults in Finite State Machines

t0,t2t1t1

t3 t3

t0t0

t3t3

t1t1t2t2

t0,t3t0,t3

t1 t1 t2 t2 t0t0

t1 t1 t2,t3 t2,t3 t0t0

t1,t2 t1,t2 t3 t3

t0,t2,t3t0,t2,t3t1t1 t0,t3t0,t3

t1,t2t1,t2 t0t0t1,t2,t3t1,t2,t3

t0, t1,t2t0, t1,t2t3t3

t0,t1,t2,t3t0,t1,t2,t3

F1 (A) F2 (B) F3F4

S1

S2 S3 S4

>

19

Page 20: A Fusion-based Approach for Tolerating Faults in Finite State Machines

Top Element (>)Given a set of FSMs: A = {A1, …, An}

> = A1 t A2 t … t An

All FSMs we consider henceforth are less than or equal to >

Intuitively, > has information about the state of every machine in the original set, A

20

Page 21: A Fusion-based Approach for Tolerating Faults in Finite State Machines

Bottom Element of Lattice (?)Single state FSM.

contains one partition with all the states on any input it transitions to itselfconveys no information about the current state of any

machine

21

Page 22: A Fusion-based Approach for Tolerating Faults in Finite State Machines

t0,t2t1t1

t3 t3

t0t0

t3t3

t1t1t2t2

t0,t3t0,t3

t1 t1 t2 t2 t0t0

t1 t1 t2,t3 t2,t3 t0t0

t1,t2 t1,t2 t3 t3

t0,t2,t3t0,t2,t3t1t1 t0,t3t0,t3

t1,t2t1,t2 t0t0t1,t2,t3t1,t2,t3

t0, t1,t2t0, t1,t2t3t3

t0,t1,t2,t3t0,t1,t2,t3

F1F2 F3

F4

S1

S2 S3 S4

>

22

Page 23: A Fusion-based Approach for Tolerating Faults in Finite State Machines

Tolerating Faults

F1F2

23

Page 24: A Fusion-based Approach for Tolerating Faults in Finite State Machines

Tolerating Faults

F1F2

X

t0

t3

t1 t2

>

T: Reachable cross product

24

Page 25: A Fusion-based Approach for Tolerating Faults in Finite State Machines

Fault Graph: Fault tolerance indicator

t0

t3

t1 t2

>

t0,t3 t1 t2

F1

t0 t1 t2,t3

F2

X

t3

t0 t2

t1

1 1

2 2

2

2

T: Reachable cross product Fault Graph G (A , T)A : { F1, F2} : Original machines

25

Page 26: A Fusion-based Approach for Tolerating Faults in Finite State Machines

t0,t2 t1 t3

t0

t3

t1 t2

t0,t3 t1 t2 t0 t1 t2,t3 t0 t1,t2 t3

t0,t2,t3 t1 t0,t3 t1,t2 t0 t1,t2,t3t0, t1,t2 t3

t0,t1,t2,t3

F1F2 F3

F4

S1

S2 S3 S4

>A = {FSMs in Yellow Region} t3

t0 t2

t1

1 1

2 2

2

2

26

Page 27: A Fusion-based Approach for Tolerating Faults in Finite State Machines

Hamming DistanceHamming distance d(ti, tj) : weight of the

edge separating the states (ti, tj) in the fault graphe.g. d(t0, t1) = 2

Minimum Hamming distance dmin(T, A ) : The weight of the weakest edge in the fault graphe.g. dmin(T, A ) = 1

t3

t0 t2

t1

1 1

2 2

2

2

dmin(T, A ) = 1

27

Page 28: A Fusion-based Approach for Tolerating Faults in Finite State Machines

Fault Tolerance in FSMs (crash faults)

Theorem 1 : A set of machines A can tolerate up to f crash faults iff :dmin(T(A), A ) > fe.g. A = {A,B,M1,M2}

- dmin(T(A ), A ) = 3

- can tolerate 2 crash faults

t3

t0 t2

t1

3

dmin(T(A), A ) = 3

28

3 33

4

4

Page 29: A Fusion-based Approach for Tolerating Faults in Finite State Machines

Fault Tolerance in FSMs (Byzantine faults)

Theorem 2 : A set of machines A can tolerate up to f Byzantine faults iff :dmin(T(A), A ) > 2fe.g. A = {A,B,M1,M2}

Let the machines be in the following states:A = {t0, t3}, B = {t0}, M1 = {t0, t2}, M2 ={t3}B and M1 are lying about their state (f = 2)Since dmin(T(A), A ) = 3 < 4, we cannot determine the state

of T

t3

t0 t2

t1

3

dmin(T(A), A ) = 3

29

3 33

4

4

Page 30: A Fusion-based Approach for Tolerating Faults in Finite State Machines

Fault Tolerance in FSMs (Byzantine faults)

Let the machines be in the following states:A = {t0, t3}, B = {t0}, M1 = {t3}, M2 ={t3}Only B is lying about it’s state (f = 2)Since dmin(T(A), A ) = 3 > 2, we can determine the

state of T as t3

Henceforth, dmin(T(A), A ) => dmin(A )

t3

t0 t2

t1

3

dmin(T(A), A ) = 3

30

3 33

4

4

Page 31: A Fusion-based Approach for Tolerating Faults in Finite State Machines

Fault Tolerance and (f,m)- fusionGiven a set of n machines, A , the set of m

machines, F , is an (f,m)-fusion of A, if :dmin(A F ) > f

The set of machines in A F can tolerate f crash faults or f/2 Byzantine faultsE.g. A = {A,B}, F = {M1,M2}, dmin(A F ) = 3 F = {M1,M2} is a (2,2) – fusion of A

31

Page 32: A Fusion-based Approach for Tolerating Faults in Finite State Machines

Minimal FusionGiven a set of machines A, a fusion set F is minimal if

there does not exist another (f, m)- fusion F' such that

8 F 2 F, 9 F' 2 F' : F' · F and 9( F 2 F, F' 2 F') : F' < F

32

Page 33: A Fusion-based Approach for Tolerating Faults in Finite State Machines

t0,t2 t1 t3

t0

t3

t1 t2

t0,t3 t1 t2 t0 t1 t2,t3 t0 t1,t2 t3

t0,t2,t3 t1 t0,t3 t1,t2 t0 t1,t2,t3t0, t1,t2 t3

t0,t1,t2,t3

F1F2 F3

F4

S1

S2 S3 S4

>

(1,1) fusion

Minimal (1,1) fusion

A = {FSMs in Yellow Region}n = 2

33

Page 34: A Fusion-based Approach for Tolerating Faults in Finite State Machines

Minimal Fusion: Example

t0

t3

t1 t2

>

t0,t3 t1 t2

F1

t0 t1 t2,t3

F2

t0, t1,t2 t3

S4

X

t3

t0 t2

t1

2 2

3

22

2

Fault Graph : G (A , T)A

34

Page 35: A Fusion-based Approach for Tolerating Faults in Finite State Machines

Algorithm : Generating BackupsAim: Add the least possible number of machines that

tolerate f faults

Input: Set of machines A , number of faults f

Output: Minimal fusion set with the least size

If |T|= N , size of the event set if |E|, the time complexity of the algorithm is O(N3. |E|. f)

35

Page 36: A Fusion-based Approach for Tolerating Faults in Finite State Machines

Algorithm overview f: # of faults, A : given set of machines1. While dmin (A F) f

1. M := >2. While M

1. Compute lower cover of M , i.e. LC(M)2. If machine F LC(M): dmin (F A F)> dmin (A F)

M := FElse F := F F

2. Return F

36

Page 37: A Fusion-based Approach for Tolerating Faults in Finite State Machines

t0,t2 t1 t3

t0

t3

t1 t2

t0,t3 t1 t2 t0 t1 t2,t3 t0 t1,t2 t3

t0,t2,t3 t1 t0,t3 t1,t2 t0 t1,t2,t3t0, t1,t2 t3

t0,t1,t2,t3

F1F2 F3

F4

S1

S2 S3 S4

>

A = {FSMs in Yellow Region} t3

t0 t2

t1

1 1

2 2

2

2

37

w=1

Page 38: A Fusion-based Approach for Tolerating Faults in Finite State Machines

t0,t2 t1 t3

t0

t3

t1 t2

t0,t3 t1 t2 t0 t1 t2,t3 t0 t1,t2 t3

t0,t2,t3 t1 t0,t3 t1,t2 t0 t1,t2,t3t0, t1,t2 t3

t0,t1,t2,t3

F1F2 F3

F4

S1

S2 S3 S4

>

A = {FSMs in Yellow Region} t3

t0 t2

t1

2 2

3 3

3

3

38

w=2

Page 39: A Fusion-based Approach for Tolerating Faults in Finite State Machines

t0,t2 t1 t3

t0

t3

t1 t2

t0,t3 t1 t2 t0 t1 t2,t3 t0 t1,t2 t3

t0,t2,t3 t1 t0,t3 t1,t2 t0 t1,t2,t3t0, t1,t2 t3

t0,t1,t2,t3

F1F2 F3

F4

S1

S2 S3 S4

>

A = {FSMs in Yellow Region} t3

t0 t2

t1

2 2

3 2

3

3

39

w=2

Page 40: A Fusion-based Approach for Tolerating Faults in Finite State Machines

t0,t2 t1 t3

t0

t3

t1 t2

t0,t3 t1 t2 t0 t1 t2,t3 t0 t1,t2 t3

t0,t2,t3 t1 t0,t3 t1,t2 t0 t1,t2,t3t0, t1,t2 t3

t0,t1,t2,t3

F1F2 F3

F4

S1

S2 S3 S4

>

A = {FSMs in Yellow Region} t3

t0 t2

t1

2 1

3 2

2

3

40

w=1

Page 41: A Fusion-based Approach for Tolerating Faults in Finite State Machines

t0,t2 t1 t3

t0

t3

t1 t2

t0,t3 t1 t2 t0 t1 t2,t3 t0 t1,t2 t3

t0,t2,t3 t1 t0,t3 t1,t2 t0 t1,t2,t3t0, t1,t2 t3

t0,t1,t2,t3

F1F2 F3

F4

S1

S2 S3 S4

>

A = {FSMs in Yellow Region} t3

t0 t2

t1

2 2

2 2

3

2

41

w=2

Page 42: A Fusion-based Approach for Tolerating Faults in Finite State Machines

Algorithm : RecoveryAim: Recover the state of the faulty machines for f

crash or f/2 Byzantine faults, given the state of the remaining machines

Input: Current states of all available machines in A F

Output: Correct state of T

The time complexity of the algorithm is O((n+ m) . f )

42

Page 43: A Fusion-based Approach for Tolerating Faults in Finite State Machines

Algorithm overview S: set of current states of machines in A F count : Vector of size |T|, initialized to 01. For all (s in S) do1. For all (ti in s) do

1. ++count[i]

2. return tc : 1 · c · N and count[c] is the maximal element in count

43

Page 44: A Fusion-based Approach for Tolerating Faults in Finite State Machines

Algorithm : Example

Consider machines A, B, M1,M2 :dmin ({A, B, M1,M2 }) = 3 ; they can tolerate one Byzantine

fault

Let the machines be in the following states:A = {t0, t3}, B = {t0}, M1 = {t1, t2,t3}, M2 ={t0}M1 is lying about it’s stateThe recovery algorithm will return t0 since, count[0] = 3, is greater

than, count[1] = 1, count[2] = 1 and count[3] = 2

44

Page 45: A Fusion-based Approach for Tolerating Faults in Finite State Machines

Experimental ResultsOriginal Machines f(faults) State space for

replicationState space for fusion

MESI, Counter A and B, Shift register

2 7,569 1,521

Even and Odd Parity Checkers, Toggle Switch, Pattern Generator, MESI

3 262,144 32,768

Counters A and B, Divider, Machine A , Machine B

2 6,724 504

Pattern Generator, TCP, Machine A, Machine B

2 3,136 2464

45

Page 46: A Fusion-based Approach for Tolerating Faults in Finite State Machines

Conclusion/Future WorkIt is not always necessary to have n.f backups to

tolerate f faultsPolynomial time algorithm to generate the smallest

minimal set that tolerates f faultsImplementation of this algorithm shows that many

complex state machines have efficient fusionsWill machines outside the lattice give better results?Backup Machines need to be given all events ; can we

do better?

46