Introduction to Fault Tolerance

8/10/2019 Introduction to Fault Tolerance

1/20

BY:

ANKIT BHATTME-VLSI & EMBEDDED


2/20

What Is Failure? A system is said to fail when it cannot meetits promises.

A failure is brought about by the existenceof errors inthe system.

The causeof an error is called a fault.

2


3/20

Concept of Fault Tolerance Hardware, software and networks cannot be totally free

from failures

Fault tolerance is a non-functional (QoS) requirement

that requires a system to continue to operate, even in thepresence of faults

Fault tolerance should be achieved with minimalinvolvement of users or system administrators

Distributed systems can be more fault tolerant thancentralized systems, but with more processor hostsgenerally the occurrence of individual faults is likely to

be more frequent


4/20

Distributed Systems 4

Attributes

Availability

Reliability

Safety

Confidentiality

Integrity

Maintainability Consequences

Fault

Error

FailureStrategies

Fault preventionFault tolerance

Fault recovery

Fault forcasting

Attributes Consequences and Strategies

What is a

Dependable

system

How to

distinguish

faults

How to

handle

faults?


5/20


results incausesFault Error Failure

Faultis a defect within the systemErroris observed by a deviation from the expected

behaviour of the system

Failureoccurs when the system can no longer perform as

required (does not meet spec)

Fault Toleranceis ability of system to provide a service,

even in the presence of errors

Terminology of Fault Tolerance


6/20

Strategies to Handle Faults


Actions to identify andremove errors:

Design reviews

Testing

Use certified tools

Analysis:

Hazard analysis

Formal methods -proof & refinement

No non-trivial system

can be guaranteed free

from errorMust have an

expectation of failure

and make appropriate

provision

Fault avoidanceTechniques aim toprevent

faults from entering thesystem during design stage

Fault removalMethods attempt to find

faults within a system beforeit enters service

Fault detectionTechniques used duringservice to detect faults withinthe operational system

Fault tolerantTechniques designed to tolerantfaults, i.e. to allow the systemoperate correctly in the presence offaults.


7/20

Fault ModelsA fault model identifies targets for testing

A fault model makes analysis possible

Effectiveness measurable by experiments

Different types

Stuck-at faults

Multiple stuck-at faults

Bridging faults


8/20

Single Stuck At Fault

Single (line) stuck-at faultThe given line has a constant value (0/1)

independent of other signal values in the circuit

Propertieso Only one line is faulty

o The faulty line is permanently set to 0 or 1

o The fault can be at an input or output of a gate

o Simple logical model is independent of technology

o It reduces the complexity of fault-detection


9/20

Example:

XOR circuit has 12 fault sites and 24 single stuck-at faults


10/20

Multiple Stuck-At Faults Multiple stuck-at fault

Several single stuck-at faults occur at the same time

Multiple stuck-at faults are usually not considered inpractice because of two reasons

o The number of multiple stuck-at faults in a circuit

with k lines is 3K-1, which is too large a number

even for circuits of moderate size

o Tests for single stuck-at faults are known to cover a

very high percentage (greater than 99.6%) of multiple stuck-atfaults when the circuit is large and

has several outputs


11/20

Bridging Fault

Two or more normally distinct points (lines) areshorted together

Two types of bridging faults:

Input bridging

Can form wired logic or voting model. Feedback (input-to-output) bridging

Can introduce feedback.

Can cause oscillation or latching.


12/20

Transistor Fault

o MOS transistor is considered an ideal switch.

o Two types of faults are modeled:-

Stuck-open -A single transistor is permanently stuck inthe open state turn the circuit into a sequential one andneed a sequence of at least 2 tests to detect a single fault.

Stuck-on - A single transistor is permanently

shorted irrespective of its gate voltage.

o Detection of a stuck-open fault requires two vectors.


13/20

Example of Transistor Stuck-Open

fault


14/20

Hardware Faults ClassificationThree types of faults:

Transient Faults-disappear after a relatively short timeExample- a memory cell whose contents are changed spuriously

due to some electromagnetic interference .

Overwriting the memory cell with the right content will makethe fault go away.

Permanent Faults-never go away, component has to berepaired or replaced.

Intermittent Faults-cycle between active and benign states Example- a loose connection


15/20

Fault Tolerance Techniques Hardware Redundancy

Software Redundancy

Information Redundancy

Time Redundancy


16/20

Hardware Redundancy

Extra hardware is added to override the effects of a failedcomponent

Static Hardware Redundancy- for immediate masking of afailure

Example: Use three processors and vote on theresult. The wrong output of a single faulty processor is masked

Dynamic Hardware Redundancy- Spare components are

activated upon the failure of a currently active component

Hybrid Hardware Redundancy- A combination of static anddynamic redundancy techniques


17/20

Software Redundancy Multiple teams of programmers

Write different versions of software for the same

function The hope is that such diversity will ensure that not all

the copies will fail on the same set of input data


18/20

Information Redundancy

Add check bits to original data bits so that an error in

the data bits can be detected and even corrected

Error detecting and correcting codes have beendeveloped and are being used

Information redundancy often requires hardware

redundancy to process the additional check bits


19/20

Time Redundancy

Provide additional time during which a failed

execution can be repeated

Most failures are transient - they go away after sometime

If enough slack time is available, failed unit can

recover and redo affected computation


20/20

THANK YOU

Introduction to Fault Tolerance

Documents

Transcript of Introduction to Fault Tolerance