6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault...

102
6. Fault Tolerance (a) Introduction. (b) Types of faults. (c) Fault models. (d) Fault coverage. (e) Redundancy. (f) Fault detection techniques. (g) Hardware fault tolerance. (h) Software fault tolerance. (i) Fault tolerant architectures (j) Example: The space shuttle CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sect. 6 6-1

Transcript of 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault...

Page 1: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

6. Fault Tolerance

(a) Introduction.

(b) Types of faults.

(c) Fault models.

(d) Fault coverage.

(e) Redundancy.

(f) Fault detection techniques.

(g) Hardware fault tolerance.

(h) Software fault tolerance.

(i) Fault tolerant architectures

(j) Example: The space shuttle

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sect. 6 6-1

Page 2: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

(a) Introduction

Faults are essentially unavoidable.

Fault tolerance aims at designing a system in such away that faults do not result in system failure.

All techniques are based on some degree ofredundancy .

Many reasons for introducing fault tolerance – it can bereliability, availability, dependability, safety.

E.g. main memory in computers has always someerror checking mechanism in order to tolerate errorsdue to radioactive particles.Goal is in this case high degree of availability (orreliability).

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (a) 6-2

Page 3: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Historic RemarkFault tolerance already present in early computers.

The EDVAC (designed 1949) had duplicated ALUs inorder to detect errors in calculations.

Von Neumann (1956) developed the theoretical basisfor fault tolerance.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (a) 6-3

Page 4: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

(b) Types of Faults

Faults can be characterised by the following criteria:Nature (random/systematic).Duration.Extent.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (b) 6-4

Page 5: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Classification by NatureFaults can be classified by their nature: random vs.systematic faults:

(i):::::::::::

Random:::::::::

faults.Random faults are faults in components of a system,which occur with a certain probability.We can predict random faults by collecting statisticaldata from large numbers of samples of similarcomponents.Random faults are usually hardware faults.

Reason for random faults are that any hardware issubject to environmental influences, which mightaffect its correct operation.· E.g. radioactivity, radiation, humidity, warmth,

cold, wear and tear.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (b) 6-5

Page 6: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Classification by NatureIn software reliability engineering one considers aswell software errors as random.Because random faults can be predicted well, theyare more easy to tolerate.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (b) 6-6

Page 7: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Classification by Nature(ii)

::::::::::::::

Systematic:::::::::

FaultsSystematic faults are faults which are not random.

Either a component has it, or it doesn’t have it.Software faults are usually considered to besystematic.Three kinds of systematic faults:

Mistakes in the specification of a system.Mistakes in the implementation of software.Mistakes in the design of hardware.

Difficult to tolerate.E.g. if two programmers write the same program,it might be that both make the same systematicmistake.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (b) 6-7

Page 8: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Classification by DurationFaults can be classified by their duration:

(i)::::::::::::::

Permanent::::::::

faults .Remain in existence indefinitely, until correctiveaction is taken.Software faults are always permanent.Many hardware component faults are permanent.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (b) 6-8

Page 9: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Classification by Duration(ii)

::::::::::::

Transient::::::::

faults .Appear, and vanish again.Typical example are effects of radioactive particleshitting a semi conductor of a memory chip.

If it happens, the state of a few bits is changed.But there is no lasting damage to the chip.

Although infrequent and not lasting, one needs totake steps to correct this error before a system erroris caused.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (b) 6-9

Page 10: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Classification by Duration(iii)

:::::::::::::::

Intermittent::::::::

faults .Appear, disappear, and then reappear after sometime.Results of

poor solder joints, corrosion on connectorcontacts.At some times connections are possible, at othersnot.electromagnetic radiation.·

:::::::::::::::::::::

Electromagnetic:::::::::::::::::

compatibility:::::::::

(EMC) is theability of a system to work correctly in thepresence of (electromagnetic) interference fromother electrical equipment, and not to interferewith other equipment or other parts itself.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (b) 6-10

Page 11: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Classification by DurationProblems of electromagnetic radiation occur· within wires of a digital circuits· between computers and other sources of noise

(usually caused by direct electromagneticradiation or by coupling through common powerlines).

· Particular problems close to car engines, jetengines, nuclear power reactors, high-powerelectric motors.

· Mobile phones and CD/DVD players causenowadays problems (e.g. not allowed in planes).

Software errors , especially those caused by raceconditions, often appear to be intermittent, but arealways permanent .

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (b) 6-11

Page 12: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Classification by ExtentFaults can be classified by their extent:

(i) Localised faults affect only a single hardware orsoftware module.

(ii) Global faults have effects that permeate throughthe entire system.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (b) 6-12

Page 13: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

(c) Fault Models

In order to analyse the effect of hardware faults on theentire system, one uses

::::::

fault::::::::::

models .

These are not perfect representations of what isphysically actually happening.

However, they help to design test procedures , tosimulate fault conditions , and to developfault tolerant systems.

Testing of safety critical software needs to take intoaccount that safety requirements are met even if oneor two hardware components fail.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (c) 6-13

Page 14: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Fault ModelsWe consider 3 fault models:

Single-stuck-at fault model .Bridging fault model .Stuck-open fault model .

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (c) 6-14

Page 15: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Single-Stuck-At Model

:::::::::::::::::::

Single-stuck-at:::::::::

model assumes that a fault within amodule causes it

to respond as if one of its inputs or outputs is stuckat logic 0 or 1.and such that the basic functionality of the circuit isotherwise unaffected.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (c) 6-15

Page 16: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Example (Single-Stuck-At Model)

Module uneffected

0 or 1

Input 1 stuck at 0 or 1

0 or 1

Input 2 stuck at 0 or 1 Output stuck at 0 or 1

0 or 1

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (c) 6-16

Page 17: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Example: Or-GateAssume for instance an or gate with inputs input1,input2 and output output:

input1

input2

output

If input1 is stuck at 0 then the output is computed asfollows:

output = 0 ∨ input2 = input2

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (c) 6-17

Page 18: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Example: Or-GateIf input1 is stuck at 1, then the output is constant 1:

output = 1 ∨ input2 = 1

This is identical to the situation where the output isstuck at 1.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (c) 6-18

Page 19: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Analysis of Single-Stuck-At ModelMajority of faults arising from broken tracks, open orshort-circuit components and shorts between trackscan be represented by the single-stuck-at model.

Single-stuck-at model cannot represent accuratelytransient or intermittent faults.

Complexity of testing :For a circuit with N nodes there are only 2Nsingle-stuck-at faults:one for each node stuck at 0,one for each node stuck at 1.Exhaustive testing feasible.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (c) 6-19

Page 20: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Bridging ModelA

:::::::::::

bridging or::::::::::::::::

short-circuit::::::

fault occurs when two ormore nodes in a circuit are accidentally joined togetherto form a permanent fault.

In positive logic, i.e.1 represented by power on,0 represented by power off,

a bridging fault between two inputs has the effect ofboth inputs being ANDed together (see next slide).

In negative logic, i.e.1 represented by power off,0 represented by power on,

a bridging fault between two inputs has the effect ofboth inputs being ORed together (see next slide).

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (c) 6-20

Page 21: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Simple Example (Bridging Fault)

Module unaffected Bridging fault

Bridge

Effect in positive Logic Effect in negative Logic

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (c) 6-21

Page 22: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

More Complex Bridging FaultsBridges between inputs and outputs might result inconverting combinatorial circuits into sequential ones,and might result in instability or oscillation.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (c) 6-22

Page 23: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Complex Example (Bridging Fault)Bridge

Unaffected Bridging fault 1

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (c) 6-23

Page 24: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Complex Example (Bridging Fault)

Bridge

Bridging fault 2

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (c) 6-24

Page 25: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Analysis of Bridging ModelBridging faults behave usually different fromsingle-stuck-at faults.

Complexity of testing more complex:

For a circuit with N nodes there are(

N

M

)

bridgingfaults between M notes at the same time.

Especially, there are(

N

2

)

= N ·(N−1)2 bridging faults

between two nodes.Makes exhaustive testing impossible in mostcases.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (c) 6-25

Page 26: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

The Stuck-Open Model

:::::::::::::::

Stuck-open fault occurs if in a CMOS gate, both outputtransistors are turned off because of an internal open-or short-circuit.

Therefore output is neither pulled to high nor to low.

Depending on the exact fault, the gate will alternatebetween

driving the outputor maintaining its previous output.

Length of time it maintains its output depends on thegate and the nature of the fault.

Therefore circuit gets a complex sequentialcharacteristic.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (c) 6-26

Page 27: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Use of Fault ModelsExhaustive testing of circuits is not feasible except forsimple combinatorial circuits.

Using fault models, test vectors can be developedwhich test for faults occurring by one of the above faultmodels.

Testing only feasible by assuming single failures.

E.g. a circuit with N nodes can have 3N− 1 multiple

stuck-at faults, which is infeasible to test.Why 3N

− 1 faults?Each of the nodes can be error free, stuck at 1 andstuck at 0, giving 3N possibilities.The only case when the circuit is correct is whenall nodes are error free. Excluding it we get 3N

− 1faulty cases.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (c) 6-27

Page 28: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Use of Fault ModelsTesting for bridging faults usually infeasible as well.Usually restriction to testing for single occurrences ofsingle-stuck-at faults.

Fault models can be used for developing strategies fortolerating faults.

Limitations:Hardware design faults (especially wrong logicdesign) are usually not be covered by those faultmodels.Software faults are usually not covered by thesemodels.

In software reliability engineering, fault models forsoftware are developed.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (c) 6-28

Page 29: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

(d) Fault Coverage

::::::

Fault:::::::::::::

coverage is the fraction of possible faults that canbe avoided, removed, detected or tolerated.

Usually it is difficult to give a numerical estimate,except when good fault models can be used.

::::::

Fault:::::::::::

removal:::::::::::::

coverage is the fraction of faults foundduring the testing phase of system development.

Testing vectors aim at 100% fault removal coveragefor the faults in the underlying fault model.However, fault models never include all possiblefaults.Especially, most models cover only single faults anddon’t cover transient or intermittent faults.Therefore, fault removal coverage is never 100%.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (d) 6-29

Page 30: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Fault Coverage

::::::

Fault:::::::::::::

detection:::::::::::::

coverage is the ability of a system todetect faults during operation.

Using fault models, fault detection coverage can beestimated, but we have the same limitations asabove.

::::::

Fault:::::::::::::

tolerance:::::::::::::

coverage is the ability of a system totolerate faults.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (d) 6-30

Page 31: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

(e) Redundancy

::::::::::::::::

Redundancy is the use of additional elements within asystem which would not be required if the system wasfree of faults.

Already the early approaches towards fault-tolerantsystem used duplicated hardware modules.

For instance by using the triple modularredundancy (TMR) system.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (e) 6-31

Page 32: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Triple Modular Redundancy (TMR)In the TMR system, we have 3 identical hardwaremodules, and one voting module.

The voting element will have as output the output of themajority of the modules.

If one module fails, the majority of the three will becorrect.

If two modules fail, TMR might make a wrong decision.

Problem: faults in the voting element are not tolerated.However, the voting element will usually be muchsimpler than the modules, and can therefore bedesigned with a higher degree of reliability.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (e) 6-32

Page 33: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Triple Modular Redundancy (TMR)

Input Voting Element

Module 1

Module 2

Module 3

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (e) 6-33

Page 34: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Forms of Redundancy

::::::::::::

Hardware::::::::::::::::

redundancy .Use of redundant hardware. E.g. the TMR above.

:::::::::::

Software:::::::::::::::::

redundancy .Use of redundant software.

:::::::::::::::

Information::::::::::::::::

redundancy .Use of redundant information.

E.g. parity bits and other error detecting/correctingcodes.E.g. extra data about persons (not only studentnumber but as well name).

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (e) 6-34

Page 35: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Forms of Redundancy

::::::::::::

Temporal:::::::::::::::::

redundancy.

Used in order to tolerate or detect transient faults .E.g. repeating calculations and comparison of theresults obtained.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (e) 6-35

Page 36: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Design DiversityProblem with TMR above is that

it doesn’t cover design faults in the modules,that if one module fails, it is likely that identicalmodules fail at the same time.

For software faults, doubling the program doesn’t helpat all.

Therefore, one aims at combining redundancy withsome degree of diversity.

E.g. Use of hardware modules from differentvendors.E.g. Use of different architectures for differentmodules.E.g. Writing of software by different groups or evencompanies, using different programming languages.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (e) 6-36

Page 37: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Limitations of Design DiversityProblem: The same typical software errors, especially(but not only) software specification errors, are madeindependently by different teams.

Has been demonstrated by studies.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (e) 6-37

Page 38: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

(f) Fault Detection Techniques

Many fault-tolerance techniques rely on detection offaults.

In practice often difficult to detect faults – one can oftenonly detect errors created by faults.

Sometimes it is not even necessary to detect the exactfaults – detecting that an error occurred as aconsequence, suffices.

E.g. in case of transient faults, detection of faults isnot necessary.One only needs to know that a fault has occurred, inorder to deal with the consequences.

We consider some software and hardware techniquesfor fault detection.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (f) 6-38

Page 39: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Functionality Checking

:::::::::::::::::

Functionality::::::::::::

checking is the use of software orhardware routines in order to check whether thehardware of a system is functioning correctly.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (f) 6-39

Page 40: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Examples of Functionality CheckingChecking of RAM done by writing into memory andreading it back again, and checking whether the valuestored is reproduced.

Problem 1 : It is in general not feasible to check thecomplete memory space.Problem 2 : Stuck-at faults might be not detected, ifthe checking test vector checks for the value at whichmemory is stuck-at (so it always returns that value).

Can be overcome by checking with two testvectors, of which the second is the negation of thefirst one.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (f) 6-40

Page 41: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Examples of Functionality CheckingProblem 3 : Even if a memory location isnon-existent, it might be that for a short time due tothe capacitance of the bus the test vector is stillreproduced.

Can be overcome by interleaving writes and readsto different locations.

Other problems, which can be overcome by usingsophisticated test routines.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (f) 6-41

Page 42: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Examples of Functionality CheckingChecking of processors done by executing sequencesof calculations and comparing them with known results(usually stored in ROM).

Checking of connections in multiprocessor systems bychecking that each processor can communicate with itsneighbours.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (f) 6-42

Page 43: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Misc. Checking Methodologies

::::::::::::::::

Consistency::::::::::::

checking uses knowledge about thenature of the information within the system, e.g. thatdata must be within a certain range (

:::::::

range:::::::::::::

checking ).

::::::::

Signal::::::::::::::::

comparison checks in systems withredundancies the signals at various points in themodules and compares them.

::::::::::::

Checking:::::::

pairs is a special case of signalcomparison. Here one checks whether the outputs ofidentical modules with the same inputs are identical.

:::::::::::::::

Information::::::::::::::::

redundancy uses error-detecting codes inorder to detect errors in the data given.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (f) 6-43

Page 44: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Instruction Monitoring

::::::::::::::

Instruction:::::::::::::::

monitoring is checking for illegalinstructions.

If the binary code is corrupted, one might (especiallyif the opcode is corrupted) obtain an illegalinstruction.Some processors immediately raise an exception,others might proceed.

Some processors use illegal instructions forinternal testing purposes (this is often notdocumented), and therefore will not raise anexception.For critical systems one should use processorswhich raise an exception in such cases.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (f) 6-44

Page 45: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Loopback Testing

:::::::::::::

Loopback:::::::::

testing means that one sends a signalarriving at its destination back to its origin and checkswhether the original signal and the signal sent backcoincide.

Finds stuck-at failures in the signal lines.If the outward and return path are too close, then theoutward path might influence the return path andtransmit its content even so the path is broken.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (f) 6-45

Page 46: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Watchdog Timers

:::::::::::::

Watchdog:::::::::

timers are used in order to check whetherthe processor has crashed.

The watchdog timer starts with a certain value anddecrements periodically.If the watchdog timer has reached zero, theprocessor will be reset.The processor regularly resets the watchdog timer,before it reaches 0.If the processor crashes, it will in most cases (but notall) not reset the watchdog timer, and therefore theprocessor will be reset.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (f) 6-46

Page 47: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Watchdog TimersLimitations:

It takes a few milliseconds before a crash isdetected.

During that period the results of the processor arewrong and this might cause hazards.

It might be that the processor crashes in such a waythat the watchdog timer is reset periodically.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (f) 6-47

Page 48: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Bus MonitoringIn

:::::

bus:::::::::::::::

monitoring one checks that the addresses sendto the bus are in a certain range.

Allows to detect, if the processor has crashed and isexecuting illegal instructions.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (f) 6-48

Page 49: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Power Supply Monitoring

::::::::

Power:::::::::

supply:::::::::::::::

monitoring is not directly a faultdetection mechanism but a fault prevention mechanism.

One checks whether the power is in a certain range,especially above a certain limit (against to highvoltage usually overvoltage protection is sufficient).If it is out of range, the processor is instructed to takemeasurements in order to protect its data (and inorder not to produce incorrect output).

In some cases one needs an uninterruptible powersource using large-capacity batteries which are all thetime reloaded in order to provide power in the event of apower supply failure.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (f) 6-49

Page 50: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

(g) Hardware Fault Tolerance

There are 3 methods for obtaining hardware faulttolerance:

(i):::::::

Static::::::::::::::::

redundancy uses fault masking , whichmeans it hides faults, so that the system works stillcorrectly even if a fault occurs.(ii)

:::::::::::

Dynamic::::::::::::::::

redundancy uses fault detection . If afault occurs, the system reconfigures itself in order tonullify the effects of faults.(iii)

:::::::::

Hybrid:::::::::::::::

approaches use fault masking in orderto prevent errors from propagating through thesystem, and fault detection in order to reconfigurethe systems so that faulty units are removed from thesystem.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (g) 6-50

Page 51: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

(i) Static RedundancyUse of voting mechanisms in order to compare theoutput of modules and mask the effects of faults.

We have seen already TMR (triple modularredundancy) systems (repeated on next slide)

TMR prevents single-point failures in the modules.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (g) 6-51

Page 52: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Triple Modular Redundancy (TMR)

Input Voting Element

Module 1

Module 2

Module 3

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (g) 6-52

Page 53: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Modular RedundancyOne problem is that sensors might fail.

Therefore one usually adds static redundancy tothe sensors using doubled or tripled sensors andvoting on them.Problem is that sensor data are digitised analoguedata,

which are therefore usually floating point values,which never coincide completely,which are usually taken at slightly different timesand physical locations, and are therefore neveridentical and usually not synchronised.

Voting therefore more complicated,and it might be impractical to perform it directly byhardware.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (g) 6-53

Page 54: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Triplicated VotingOne problem is the danger of a failure of the votingmechanism itself.

In order to prevent single point failures in the votingelements, one can

:::::::::::

triplicate:::::

the:::::::::

voting , and pass the 3outputs of the voting elements on to the next modules.

See next slide.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (g) 6-54

Page 55: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Module 1.1

Module 1.2

Module 1.3Input 3

Input 2

Input 1 Voter 1.1

Voter 1.2

Voter 1.3

Module 2.1

Module 2.2

Module 2.3

Voter 2.1

Voter 2.2

Voter 2.3

Output 1

Output 2

Output 3

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (g) 6-55

Page 56: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Limitations of TMRHelps only against random failures, not againstsystematic failures, which usually affect all modulessimultaneously.

Doesn’t help against simultaneous failures of two ormore modules.

Necessary to add monitoring of discrepancies in thevoting, in order to detect failures and be able to removethem during maintenance.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (g) 6-56

Page 57: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

N-Modular Redundancy

:::::::::::::

N-Modular:::::::::::::::::

Redundancy (::::::

NMR) uses N instead of 3modules and voting among those.

The system will be able to tolerate the failure of N−12

modules without producing a system failure.Disadvantage: Additional cost, size, weight andpower consumption.In practice the number of identical modules israrely greater than 4

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (g) 6-57

Page 58: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Implementation of the VoterImplementation of the voting element by hardware :

If one has 3 inputs i1, i2 i3, the best out of the threeis obtained by the Boolean formula

(i1 ∧ i2) ∨ (i1 ∧ i3) ∨ (i2 ∧ i2)

This can be implemented by a very simple circuit.Therefore it can be designed in a highly reliable way.Problems

This arrangement doesn’t provide the possibilityfor monitoring of discrepancies.One often has a large amount of data to compare(not only one bit), therefore the voting elementscan become still complicated.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (g) 6-58

Page 59: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Implementation of the VoterImplementation of the voting element by software .

Advantages:More complex voting possible.Monitoring of discrepancies easy.

Disadvantages:Much longer response times than hardware voting(which reacts with almost no delay).Particular problem since safety critical systems areoften real time systems (aerospace!!).Higher complexity of the underlying computersand the software results in lower reliability.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (g) 6-59

Page 60: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

(ii) Dynamic RedundancyIn dynamic redundancy one uses one unit, which isnormally in use, and one or more standby systems ,which are available, if the main unit fails.

Requires less units than static redundancy (where allunits have at least to be tripled).Necessary to have good fault detection mechanisms,as discussed in Subsect. (f).

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (g) 6-60

Page 61: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Standby Spare ArrangementIn a

::::::::::

standby::::::::

spare:::::::::::::::::

arrangement , one module isoperated with some fault detection mechanism (asdiscussed in Subsect. (f)).

Another module is on standby.

Unless a fault is detected, the output of the first moduleis taken as output of the system.

In case a fault is detected,the standby module is activated,the faulty module is deactivated,and the output is taken from the standby module.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (g) 6-61

Page 62: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Standby Spare Arrangement

Operating Module

Standby Module

Fault Detector

Activation

SwitchInput

Output

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (g) 6-62

Page 63: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Standby Spare ArrangementThe standby module can be

on::::::

cold:::::::::::

standby , which means it is switched off.Disadvantage:· In case of a fault the disruption is longer, since it

takes a while before the standby module isactivated.

· Fault detection mechanism cannot make use ofthe data processed by the standby module.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (g) 6-63

Page 64: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Standby Spare ArrangementThe standby module can be

on::::

hot:::::::::::

standbyMeans that the standby module is effectivelyprocessing the input, but the output is dismissed.Disadvantage:· More power consumption.· The standby unit is subject to the same

operating stress as main module, therefore thelikelihood of simultaneous failures of the mainand the standby module is higher.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (g) 6-64

Page 65: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Multiple standby modulesOne can extend the above by having more than onestandby module.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (g) 6-65

Page 66: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Self-Checking PairsA

:::::::::::::::::

self-checking:::::::

pairs arrangement consists ofone main module,one checking module,

both of which receive the same input,a comparator, which checks, whether the output ofthe main module coincides with the output of thechecking module.

The output of the main module (not of the checkingmodule) and the result of the comparison are passedonwards.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (g) 6-66

Page 67: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Self-Checking PairsIt is not part of the self-checking pairs arrangements toreconfigure itself or mask errors in case the output ofthe two modules doesn’t coincide.

Therefore the arrangement itself does not providefault tolerance, but only error detection.However, the output of the comparison can be usedfor instance in dynamic fault-tolerant systems inorder to carry out reconfiguration in a standby sparearrangement.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (g) 6-67

Page 68: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Self-Checking Pairs

Main Module

Checking ModuleFailure detected

Output

Input

Comparator

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (g) 6-68

Page 69: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Implementation of the ComparatorThe comparator can be implemented by hardware .

One can take the disjunction of the XOR of all theone-bit signal lines from both modules (see nextslide).

Note that the XOR of two signals is 1 iff the twosignals don’t coincide.The disjunction of the XORs is therefore true, iffthere was at least one disagreement between thebits of the signal, i.e. if there was a fault.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (g) 6-69

Page 70: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Comparator (Hardware Implement.)

Main Module

Checking

Module

Output 1

Output 2

Output 3

Output 1

Output 2

Output 3

Fault Detected

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (g) 6-70

Page 71: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Comparator, Hardware Impl.The advantages/disadvantages of the hardwareimplementation are similar to those of the TMR.In order to cover against a single-point failure in thecomparator, one can duplicate the comparator andtake the disjunction of the results of all thecomparators.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (g) 6-71

Page 72: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Implementation of the ComparatorThe comparator can as well be implemented bysoftware , in case the modules include a processor.

Then one can add a dual port memory, in which theoutput of both modules is written.Then both processors compare after they havecomputed their results these results with the resultobtained by the other processor.

Should be done by both processors in order toprotect against a single-point failures during thecomparison phase.

If there is a discrepancy, then a fail signal is outputon a special line.The disjunction of the two fail signals indicates that afailure has been detected.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (g) 6-72

Page 73: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Comparator (Software Implement.)

Main Module

Checking Module

(With Processor and Memory)

Memory)(With Processor and

Dual Port Memory

Output

Fail

Fail

Fail

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (g) 6-73

Page 74: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

(iii) Hybrid RedundancyUse of a combination of voting, fault detection andmodule switching.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (g) 6-74

Page 75: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

N-Modular Redundancy with SparesUse of N modules plus M spares connected to a voter.

Initially N modules take input in parallel, and theirresults are compared.In case there is no disagreements, the result ispassed on as output.In case of a disagreement,

the output given by the majority of the activemodules is passed on as output,the faulty module is removed,one of the spare modules is activated,and afterwards the system continues using the Nmain modules (of which one is the spare one) andM-1 spares.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (g) 6-75

Page 76: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

N-Modular Redundancy with Spares

����

����

����

����

����

����

����

����

����

����

����

����

����

����

����

Main Module 1

Main Module 2

Main Module N

Spare Module 1

Spare Module 2

Spare Module M

Switch VoterOutput

DisagreementDetector

Input

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (g) 6-76

Page 77: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

N-Modular Redundancy with Spares

System tolerates up to N−12 simultaneous faults in the

main modules, and can compensate up to M faults byusing the spare modules.

Analysis:Problem:Doesn’t tolerate single-point failures in the switch,the voter and the disagreement detector.Advantage:A good compromise between the advantages of

static redundancy· immediate fault maskingand dynamic redundancy· removal of faulty modules,· monitoring of faults in the system.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (g) 6-77

Page 78: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Module SynchronisationIn all the above techniques one needs to compare theoutputs of different modules.

Problem: If the modules don’t share the same clocktheir output will not be synchronised.

One solution is that all processors share the sameclock.

Then the modules are said to be in:::::

lock:::::::

step .Problem: Single-point failures in the clock are nottolerated and not detected.

Otherwise one needs to construct the voting andfault detection mechanisms in such a way thatproblems with synchronisation are taken intoaccount.

Easier, if this is done by software.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (g) 6-78

Page 79: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

(h) Software Fault Tolerance

In this subsection we will consider how to tolerate faultsby using software.

We will consider two techniques:(i) N-version programming.(ii) Recovery blocks.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (h) 6-79

Page 80: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

(i) N-Version ProgrammingN-version programming means that N different versionsof the software are written.

All fulfil the same specification.Written by different teams or companies, and maybeusing different tools, languages or techniques.

The N versions are run on the same input dataon the same processor (interleaved)or on separate processors (operating in parallel).

The results are compared using a softwareimplementation of one of the techniques introduced inthe section on hardware redundancy.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (h) 6-80

Page 81: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

N-Version ProgrammingProblem of N-version programming:

Additional development costs.Additional processing power needed for runningseveral versions of the program and for the votingprocess.

N-version programming only used for very criticalapplications, e.g.

Airbus 330/340,Space shuttle.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (h) 6-81

Page 82: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

(ii) Recovery Blocks

::::::::::::

Recovery::::::::

block:::::::::::::

technique based on acceptancetests .

Acceptance tests are software versions of faultdetection.Acceptance test check the consistency of the outputof one software module (a unit inside the programlike a procedure, package or a class).

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (h) 6-82

Page 83: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Examples of Acceptance TestsCheck whether output is within certain boundaries.

Check for run time errors (e.g. arithmetic over- orunderflow),

Check for excessive execution time.

Check for arithmetic correctness by reversing thecomputation.

E.g., if a module computes the square root, theacceptance test checks whether the square of theoutput is equal to the input.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (h) 6-83

Page 84: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Recovery Blocks (Cont.)For critical modules, several implementations aredeveloped, of which one is the main one.

Then the following is executed:Execute the main module.Carry out acceptance test.If this fails, switch to an alternative module.Carry out acceptance test.If this fails, switch to the second alternative module.Etc.If everything fails, raise an error.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (h) 6-84

Page 85: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Recovery BlocksProblem: if one module fails, one has to make sure thatany side-effects carried out by it are reversed.

Done by establishing a recovery point at the beginningof the module, and by introducing a mechanism in orderto make sure that one can switch back to the recoverypoint in case of failure of the acceptance test.

Storing the state of all variables when reaching therecovery point too expensive and time consuming ingeneral.

Instead one stores only those variables which arepossibly changed by the module.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (h) 6-85

Page 86: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Recovery BlocksIt was suggested to provide a special languageconstruct as follows:

ensure <acceptance test>by <main module>

else by <alternative module 1>

else by <alternative module 2>

· · ·

else by <alternative module n>

else error

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (h) 6-86

Page 87: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

(i) Fault-Tolerant Architectures

The use of:::::::::::::::::

Fault-tolerant::::::::::::::::::

architectures means that weuse for the critical parts non-computer-basedmechanisms as additional safe guards.

Example on next slide:Use of a computerised control method in order tocontrol the pump which feeds toxic liquid into a tank.Danger is that the tank overflows and the toxic liquidis spilled.The computerised control system is safety critical,and needs to be implemented with higheststandards.

Very expensive to implement this.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (i) 6-87

Page 88: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Fully Computerised Solution

Pump

Further processing

Tank

Depth gaugeControlsystem

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (i) 6-88

Page 89: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Better SolutionA better solution is to

keep the computerised control system,but add an additional non-computerised float switch,which directly switches off the pump in case the tankis full.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (i) 6-89

Page 90: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Better Solution

Pump

Further processing

Tank

Depth gaugeControlsystem

Float Switch

Switch

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (i) 6-90

Page 91: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Better SolutionResult is that

there are two independent systems which control thetank,the non-computerised float-switch can much moreeasily be designed to meet high reliability standards.

Often one can, by using simple (usuallynon-computerised) safeguards, obtain a much higherdegree of safety than using complex computerisedsolutions.

This reduces as well the cost.

The example demonstrates that one can combine acomplex computer system with a relatively simplesafety control system.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (i) 6-91

Page 92: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Better SolutionOne can even achieve with very little extra cost an evenhigher degree of safety by adding two or more floatswitches, which independently switch of the pump.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (i) 6-92

Page 93: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

(j) Example: The Space Shuttle

At certain stages of the flight many flight-criticalfunctions of the space shuttle are totally dependent onits on-board computers.

On this relythe lives of the crew,the vehicle, which costs several billion dollars,the national prestige of U.S.

Therefore the application has been developed up tohighest standards w.r.t. reliability, integrity, availabilityand fault tolerance.

A combination of redundancy, hardware and softwarevoting, fault masking, fault tolerance and designdiversity was used in order to achieve this.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (j) 6-93

Page 94: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Architecture of the Comp. SystemThe computer system consists of 5 identical computers.

Conventional processors are used.

They are connected by using an array of serial buses.5 buses provide communication between thecomputers.23 buses link the computers to other subsystems.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (j) 6-94

Page 95: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Architecture of the Comp. System

Memory Displays Sensors TelemetryControlPanels

BoostersPayloadsTelecomm−unication

CPU 1 CPU 2 CPU 4CPU 3 CPU 523 SerialData Buses

5 SerialInterprocessorBuses

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (j) 6-95

Page 96: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Explanation of Terminology

::::::::::

Payload seems to mean control of the load the spaceshuttle is carrying.

::::::::::::

Boosters = auxiliary rocket or engine for additionalspeed.

:::::::::::::

Telemetry = Process of recording the reading ofinstruments and transmitting them by radio.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (j) 6-96

Page 97: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Architecture of the Comp. SystemMost fault detection is performed using software ratherthan hardware techniques.

The configuration of the computers is controlled bysoftware.

During critical phases, 4 of the 5 computers areconfigured in a 4-modular redundancy (NMR)arrangement.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (j) 6-97

Page 98: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Architecture of the Comput. SystemThe 4 computers get similar input data and compute thesame functions.

Hardware voting is preformed by the actuators inorder to provide fault masking.Additionally, each processor

has extensive self-test facilities.· If an error is detected, it is reported to the crew,

which then can switch off the faulty unit.compares its results with those produced by itsneighbours.· If a processor detects a disagreement, it signals

this, and voting is used in order to remove theoffending computer.

has a watchdog timer, in order to detect crashes.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (j) 6-98

Page 99: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Architecture of the Comput. SystemThe 5th computer normally performs non-criticalfunctions (e.g. communications).

It contains additionally a diverse implementation of the(critical) flight control software, produced by a differentcontractor, which can be used in an emergency (seebelow).

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (j) 6-99

Page 100: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Architecture of the Comput. SystemIf one processor is switched off, one obtains a triplemodular redundancy arrangement (TMR).

If a second processor is switched off, the system isswitched into duplex mode, where the two computerscompare their results in order to detect any furtherfailure.

In case of a third failure, the system reports theinconsistencies to the crew and uses fault detectiontechniques in order to identify the offending unit.

This provides therefore protection against failures oftwo units and fault detection and limited faulttolerance against the failure of a third unit.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (j) 6-100

Page 101: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

Architecture of the Comput. SystemIn an emergency, the fifth computer can take overcritical functions

The 5th computer allows protection againstsystematic faults in the software.

If one or two computers fail, the crew or the controllerson earth might decide to abort the mission.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (j) 6-101

Page 102: 6. Fault Tolerance - Swanseacsetzer/lectures/critsys/09/critsys... · 2009-11-12 · Fault tolerance aims at designing a system in such a way that faults do not result in system failure.

AnalysisThe arrangement provides excellent fault tolerance ,and protection against systematic faults .

Problems:Heavy dependency on software for voting, faultdetection, configuration control.

Therefore heavy dependency on the correctdesign of this software .

The 5 computers are identical, therefore noprotection against systematic hardware faultsaffecting all 5 computers simultaneously.

In total the design is still very good.Due to the high costs this degree of redundancy canonly be obtained for very critical applications.

CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sec. 6 (j) 6-102