A Fusion-based Approach for Tolerating Faults in Finite State Machines

Post on 03-Jan-2016

40 views 3 download

Tags:

description

A Fusion-based Approach for Tolerating Faults in Finite State Machines. Vinit Ogale, Bharath Balasubramanian Parallel and Distributed Systems Lab Electrical and Computer Engineering Dept. University of Texas at Austin Vijay K. Garg IBM India Research Lab. Outline. Motivation - PowerPoint PPT Presentation

Transcript of A Fusion-based Approach for Tolerating Faults in Finite State Machines

A Fusion-based Approach for Tolerating Faults in Finite State Machines

Vinit Ogale, Bharath BalasubramanianParallel and Distributed Systems Lab

Electrical and Computer Engineering Dept. University of Texas at Austin

Vijay K. GargIBM India Research Lab

OutlineMotivationRelated WorkQuestions and Issues AddressedModelPartition LatticeFault GraphsFault Tolerance in FSMs and (f,m) – fusionAlgorithms : Generating Backups and RecoveryImplementation ResultsConclusion and Future Work

2

MotivationMany real applications modeled as FSMsEmbedded Systems :

Traffic controllers, home appliancesSensor networks

E.g. hundreds of multiple sensors (like temperature, pressure etc) need to be backed up

3

Problem

4

Given a set of finite state machines (FSMs), some FSMs may either crash (fail-stop faults) or lie about their execution state (Byzantine faults)

a a

a

b b

b

Counter counting ‘b’sCounter counting ‘a’s

a0 a1 a2 b0 b1 b2

Existing Solution - Replicate

5

n.f extra FSMs to tolerate k crash faults; 2.n.f extra FSMs to tolerate f Byzantine faults (where n is the # of original FSMs)

a a

a

b b

bCounter counting ‘b’s

Counter counting ‘a’s

a0 a1 a2

b0 b1 b2

1-crash fault tolerant setup

a a

a

b

b b

Related WorkTraditional Approach – Redundancy

n.k backup machines to tolerate k faults in n machinesFault Tolerance in Finite State Machines using Fusion

(Balasubramanian, Ogale, Garg 08)Exponential algorithm for generating machines which can tolerate

crash faults Number of faults = Number of Machines

Fusible Data Structures (Garg, Ogale 06)Fuse common data structures such as link lists, hash tables etc – the

fused structure smaller than sum of original structuresErasure Coding

Fault Tolerance in Data

6

Reachable Cross Product

7

a a

a

A

b b

b

B

Counter counting ‘b’s

Counter counting ‘a’s

a0 a1 a2

b0 b1 b2

0 0 0

R (A, B)

<a0, b0> <a0, b1> <a0, b2>

<a1, b0> <a1, b1> <a1,b2>

<a2, b0> <a2, b1> <a2, b2>

Reachable Cross Product of {A,B}

=

Can We Do Better ?

a a

a

b b

b

Counter counting ‘b’s (mod 3)

Counter counting ‘a’s (mod 3)

a0 a1 a2

b0 b1 b2

F1

“a a b”

(a + b ) modulo 3

8

b b

b

a a

a

Can We Do Better ?

a a

a

b b

b

Counter counting ‘b’s (mod 3)

Counter counting ‘a’s (mod 3)

a0 a1 a2

b0 b1 b2

F1

(a + b ) modulo 3

9

b b

b

a

a

a

F2

(a - b ) modulo 3 a

a a

b b

b

2-crash fault tolerant setup

Questions and Issues addressedCan we do better than the cross product ?How many faults can be tolerated ? What is the minimum

number of machines required to tolerate f crash faults ?Can these machines tolerate Byzantine faults? (For

example, in previous slide, DFSMs A and B along with F1 and F2 can tolerate one Byzantine fault )

Main Aims :Develop theory to understand and define this problem Efficient algorithms based on this to generate backup

machines

10

Application Scenario: Sensor Network1000 sensors (simple counters) each recording a

parameter (temperature, pressure etc.). Sensors will be collected later and their data analyzed offline

10 sensors are expected to crashReplication requires 1000 x 10 backup sensors to

ensure fault tolerant operationCan we use just 10 extra sensors instead of

10000?

11

ModelFSMs (machines) execute independently (in

parallel)The inputs to a FSM are not determined by any other

FSM.FSMs act concurrently on the same set of eventsFail stop (crash) faults

Loss of current state, underlying FSM intactByzantine faults

Machines can `lie` about their current state

12

Join of Two FSMs

13

Join (t) : Reachable cross product: 4 states in this case instead of 9

Less Than Equal To Relation (·)Given FSMs: A and B

A · B , A t B = B

Given the state of B, we can determine the current state of A

14

PartitionsGiven any FSM, we can partition the states into

blocks such that the transitions for all states in a block are consistentE.g. if states t0 and t3

have to be combined to form one partition

t0

t3

t1 t2

15

Input 0

Input 1

Largest Consistent Partition Containing {t0,t3}

t0

t3

t1 t2

t0,t3 t1 t2

16

Largest Consistent Partition Containing {t0,t1}

17

t0

t3

t1 t2

t0,t1, t2 t3

Partition LatticeSet of all FSMs corresponding to partitions of a given

FSM (say T) forms a lattice with respect to the · relation [HarSte66].

i.e, for any two FSMs, A and B, formed by partitioning T, there exists a unique C · T such thatC = A t B : (join/ t )

A · C and B · C and C is the smallest such elementC = A u B : (meet/ u)

C · A and C · B and C is the largest such FSM

18

t0,t2t1t1

t3 t3

t0t0

t3t3

t1t1t2t2

t0,t3t0,t3

t1 t1 t2 t2 t0t0

t1 t1 t2,t3 t2,t3 t0t0

t1,t2 t1,t2 t3 t3

t0,t2,t3t0,t2,t3t1t1 t0,t3t0,t3

t1,t2t1,t2 t0t0t1,t2,t3t1,t2,t3

t0, t1,t2t0, t1,t2t3t3

t0,t1,t2,t3t0,t1,t2,t3

F1 (A) F2 (B) F3F4

S1

S2 S3 S4

>

19

Top Element (>)Given a set of FSMs: A = {A1, …, An}

> = A1 t A2 t … t An

All FSMs we consider henceforth are less than or equal to >

Intuitively, > has information about the state of every machine in the original set, A

20

Bottom Element of Lattice (?)Single state FSM.

contains one partition with all the states on any input it transitions to itselfconveys no information about the current state of any

machine

21

t0,t2t1t1

t3 t3

t0t0

t3t3

t1t1t2t2

t0,t3t0,t3

t1 t1 t2 t2 t0t0

t1 t1 t2,t3 t2,t3 t0t0

t1,t2 t1,t2 t3 t3

t0,t2,t3t0,t2,t3t1t1 t0,t3t0,t3

t1,t2t1,t2 t0t0t1,t2,t3t1,t2,t3

t0, t1,t2t0, t1,t2t3t3

t0,t1,t2,t3t0,t1,t2,t3

F1F2 F3

F4

S1

S2 S3 S4

>

22

Tolerating Faults

F1F2

23

Tolerating Faults

F1F2

X

t0

t3

t1 t2

>

T: Reachable cross product

24

Fault Graph: Fault tolerance indicator

t0

t3

t1 t2

>

t0,t3 t1 t2

F1

t0 t1 t2,t3

F2

X

t3

t0 t2

t1

1 1

2 2

2

2

T: Reachable cross product Fault Graph G (A , T)A : { F1, F2} : Original machines

25

t0,t2 t1 t3

t0

t3

t1 t2

t0,t3 t1 t2 t0 t1 t2,t3 t0 t1,t2 t3

t0,t2,t3 t1 t0,t3 t1,t2 t0 t1,t2,t3t0, t1,t2 t3

t0,t1,t2,t3

F1F2 F3

F4

S1

S2 S3 S4

>A = {FSMs in Yellow Region} t3

t0 t2

t1

1 1

2 2

2

2

26

Hamming DistanceHamming distance d(ti, tj) : weight of the

edge separating the states (ti, tj) in the fault graphe.g. d(t0, t1) = 2

Minimum Hamming distance dmin(T, A ) : The weight of the weakest edge in the fault graphe.g. dmin(T, A ) = 1

t3

t0 t2

t1

1 1

2 2

2

2

dmin(T, A ) = 1

27

Fault Tolerance in FSMs (crash faults)

Theorem 1 : A set of machines A can tolerate up to f crash faults iff :dmin(T(A), A ) > fe.g. A = {A,B,M1,M2}

- dmin(T(A ), A ) = 3

- can tolerate 2 crash faults

t3

t0 t2

t1

3

dmin(T(A), A ) = 3

28

3 33

4

4

Fault Tolerance in FSMs (Byzantine faults)

Theorem 2 : A set of machines A can tolerate up to f Byzantine faults iff :dmin(T(A), A ) > 2fe.g. A = {A,B,M1,M2}

Let the machines be in the following states:A = {t0, t3}, B = {t0}, M1 = {t0, t2}, M2 ={t3}B and M1 are lying about their state (f = 2)Since dmin(T(A), A ) = 3 < 4, we cannot determine the state

of T

t3

t0 t2

t1

3

dmin(T(A), A ) = 3

29

3 33

4

4

Fault Tolerance in FSMs (Byzantine faults)

Let the machines be in the following states:A = {t0, t3}, B = {t0}, M1 = {t3}, M2 ={t3}Only B is lying about it’s state (f = 2)Since dmin(T(A), A ) = 3 > 2, we can determine the

state of T as t3

Henceforth, dmin(T(A), A ) => dmin(A )

t3

t0 t2

t1

3

dmin(T(A), A ) = 3

30

3 33

4

4

Fault Tolerance and (f,m)- fusionGiven a set of n machines, A , the set of m

machines, F , is an (f,m)-fusion of A, if :dmin(A F ) > f

The set of machines in A F can tolerate f crash faults or f/2 Byzantine faultsE.g. A = {A,B}, F = {M1,M2}, dmin(A F ) = 3 F = {M1,M2} is a (2,2) – fusion of A

31

Minimal FusionGiven a set of machines A, a fusion set F is minimal if

there does not exist another (f, m)- fusion F' such that

8 F 2 F, 9 F' 2 F' : F' · F and 9( F 2 F, F' 2 F') : F' < F

32

t0,t2 t1 t3

t0

t3

t1 t2

t0,t3 t1 t2 t0 t1 t2,t3 t0 t1,t2 t3

t0,t2,t3 t1 t0,t3 t1,t2 t0 t1,t2,t3t0, t1,t2 t3

t0,t1,t2,t3

F1F2 F3

F4

S1

S2 S3 S4

>

(1,1) fusion

Minimal (1,1) fusion

A = {FSMs in Yellow Region}n = 2

33

Minimal Fusion: Example

t0

t3

t1 t2

>

t0,t3 t1 t2

F1

t0 t1 t2,t3

F2

t0, t1,t2 t3

S4

X

t3

t0 t2

t1

2 2

3

22

2

Fault Graph : G (A , T)A

34

Algorithm : Generating BackupsAim: Add the least possible number of machines that

tolerate f faults

Input: Set of machines A , number of faults f

Output: Minimal fusion set with the least size

If |T|= N , size of the event set if |E|, the time complexity of the algorithm is O(N3. |E|. f)

35

Algorithm overview f: # of faults, A : given set of machines1. While dmin (A F) f

1. M := >2. While M

1. Compute lower cover of M , i.e. LC(M)2. If machine F LC(M): dmin (F A F)> dmin (A F)

M := FElse F := F F

2. Return F

36

t0,t2 t1 t3

t0

t3

t1 t2

t0,t3 t1 t2 t0 t1 t2,t3 t0 t1,t2 t3

t0,t2,t3 t1 t0,t3 t1,t2 t0 t1,t2,t3t0, t1,t2 t3

t0,t1,t2,t3

F1F2 F3

F4

S1

S2 S3 S4

>

A = {FSMs in Yellow Region} t3

t0 t2

t1

1 1

2 2

2

2

37

w=1

t0,t2 t1 t3

t0

t3

t1 t2

t0,t3 t1 t2 t0 t1 t2,t3 t0 t1,t2 t3

t0,t2,t3 t1 t0,t3 t1,t2 t0 t1,t2,t3t0, t1,t2 t3

t0,t1,t2,t3

F1F2 F3

F4

S1

S2 S3 S4

>

A = {FSMs in Yellow Region} t3

t0 t2

t1

2 2

3 3

3

3

38

w=2

t0,t2 t1 t3

t0

t3

t1 t2

t0,t3 t1 t2 t0 t1 t2,t3 t0 t1,t2 t3

t0,t2,t3 t1 t0,t3 t1,t2 t0 t1,t2,t3t0, t1,t2 t3

t0,t1,t2,t3

F1F2 F3

F4

S1

S2 S3 S4

>

A = {FSMs in Yellow Region} t3

t0 t2

t1

2 2

3 2

3

3

39

w=2

t0,t2 t1 t3

t0

t3

t1 t2

t0,t3 t1 t2 t0 t1 t2,t3 t0 t1,t2 t3

t0,t2,t3 t1 t0,t3 t1,t2 t0 t1,t2,t3t0, t1,t2 t3

t0,t1,t2,t3

F1F2 F3

F4

S1

S2 S3 S4

>

A = {FSMs in Yellow Region} t3

t0 t2

t1

2 1

3 2

2

3

40

w=1

t0,t2 t1 t3

t0

t3

t1 t2

t0,t3 t1 t2 t0 t1 t2,t3 t0 t1,t2 t3

t0,t2,t3 t1 t0,t3 t1,t2 t0 t1,t2,t3t0, t1,t2 t3

t0,t1,t2,t3

F1F2 F3

F4

S1

S2 S3 S4

>

A = {FSMs in Yellow Region} t3

t0 t2

t1

2 2

2 2

3

2

41

w=2

Algorithm : RecoveryAim: Recover the state of the faulty machines for f

crash or f/2 Byzantine faults, given the state of the remaining machines

Input: Current states of all available machines in A F

Output: Correct state of T

The time complexity of the algorithm is O((n+ m) . f )

42

Algorithm overview S: set of current states of machines in A F count : Vector of size |T|, initialized to 01. For all (s in S) do1. For all (ti in s) do

1. ++count[i]

2. return tc : 1 · c · N and count[c] is the maximal element in count

43

Algorithm : Example

Consider machines A, B, M1,M2 :dmin ({A, B, M1,M2 }) = 3 ; they can tolerate one Byzantine

fault

Let the machines be in the following states:A = {t0, t3}, B = {t0}, M1 = {t1, t2,t3}, M2 ={t0}M1 is lying about it’s stateThe recovery algorithm will return t0 since, count[0] = 3, is greater

than, count[1] = 1, count[2] = 1 and count[3] = 2

44

Experimental ResultsOriginal Machines f(faults) State space for

replicationState space for fusion

MESI, Counter A and B, Shift register

2 7,569 1,521

Even and Odd Parity Checkers, Toggle Switch, Pattern Generator, MESI

3 262,144 32,768

Counters A and B, Divider, Machine A , Machine B

2 6,724 504

Pattern Generator, TCP, Machine A, Machine B

2 3,136 2464

45

Conclusion/Future WorkIt is not always necessary to have n.f backups to

tolerate f faultsPolynomial time algorithm to generate the smallest

minimal set that tolerates f faultsImplementation of this algorithm shows that many

complex state machines have efficient fusionsWill machines outside the lattice give better results?Backup Machines need to be given all events ; can we

do better?

46