Introduction to Fault Tolerance

download Introduction to Fault Tolerance

of 20

Transcript of Introduction to Fault Tolerance

  • 8/10/2019 Introduction to Fault Tolerance

    1/20

    BY:

    ANKIT BHATTME-VLSI & EMBEDDED

  • 8/10/2019 Introduction to Fault Tolerance

    2/20

    What Is Failure? A system is said to fail when it cannot meetits promises.

    A failure is brought about by the existenceof errors inthe system.

    The causeof an error is called a fault.

    2

  • 8/10/2019 Introduction to Fault Tolerance

    3/20

    Concept of Fault Tolerance Hardware, software and networks cannot be totally free

    from failures

    Fault tolerance is a non-functional (QoS) requirement

    that requires a system to continue to operate, even in thepresence of faults

    Fault tolerance should be achieved with minimalinvolvement of users or system administrators

    Distributed systems can be more fault tolerant thancentralized systems, but with more processor hostsgenerally the occurrence of individual faults is likely to

    be more frequent

  • 8/10/2019 Introduction to Fault Tolerance

    4/20

    Distributed Systems 4

    Attributes

    Availability

    Reliability

    Safety

    Confidentiality

    Integrity

    Maintainability Consequences

    Fault

    Error

    FailureStrategies

    Fault preventionFault tolerance

    Fault recovery

    Fault forcasting

    Attributes Consequences and Strategies

    What is a

    Dependable

    system

    How to

    distinguish

    faults

    How to

    handle

    faults?

  • 8/10/2019 Introduction to Fault Tolerance

    5/20

    Distributed Systems 5

    results incausesFault Error Failure

    Faultis a defect within the systemErroris observed by a deviation from the expected

    behaviour of the system

    Failureoccurs when the system can no longer perform as

    required (does not meet spec)

    Fault Toleranceis ability of system to provide a service,

    even in the presence of errors

    Terminology of Fault Tolerance

  • 8/10/2019 Introduction to Fault Tolerance

    6/20

    Strategies to Handle Faults

    Distributed Systems 6

    Actions to identify andremove errors:

    Design reviews

    Testing

    Use certified tools

    Analysis:

    Hazard analysis

    Formal methods -proof & refinement

    No non-trivial system

    can be guaranteed free

    from errorMust have an

    expectation of failure

    and make appropriate

    provision

    Fault avoidanceTechniques aim toprevent

    faults from entering thesystem during design stage

    Fault removalMethods attempt to find

    faults within a system beforeit enters service

    Fault detectionTechniques used duringservice to detect faults withinthe operational system

    Fault tolerantTechniques designed to tolerantfaults, i.e. to allow the systemoperate correctly in the presence offaults.

  • 8/10/2019 Introduction to Fault Tolerance

    7/20

    Fault ModelsA fault model identifies targets for testing

    A fault model makes analysis possible

    Effectiveness measurable by experiments

    Different types

    Stuck-at faults

    Multiple stuck-at faults

    Bridging faults

  • 8/10/2019 Introduction to Fault Tolerance

    8/20

    Single Stuck At Fault

    Single (line) stuck-at faultThe given line has a constant value (0/1)

    independent of other signal values in the circuit

    Propertieso Only one line is faulty

    o The faulty line is permanently set to 0 or 1

    o The fault can be at an input or output of a gate

    o Simple logical model is independent of technology

    o It reduces the complexity of fault-detection

  • 8/10/2019 Introduction to Fault Tolerance

    9/20

    Example:

    XOR circuit has 12 fault sites and 24 single stuck-at faults

  • 8/10/2019 Introduction to Fault Tolerance

    10/20

    Multiple Stuck-At Faults Multiple stuck-at fault

    Several single stuck-at faults occur at the same time

    Multiple stuck-at faults are usually not considered inpractice because of two reasons

    o The number of multiple stuck-at faults in a circuit

    with k lines is 3K-1, which is too large a number

    even for circuits of moderate size

    o Tests for single stuck-at faults are known to cover a

    very high percentage (greater than 99.6%) of multiple stuck-atfaults when the circuit is large and

    has several outputs

  • 8/10/2019 Introduction to Fault Tolerance

    11/20

    Bridging Fault

    Two or more normally distinct points (lines) areshorted together

    Two types of bridging faults:

    Input bridging

    Can form wired logic or voting model. Feedback (input-to-output) bridging

    Can introduce feedback.

    Can cause oscillation or latching.

  • 8/10/2019 Introduction to Fault Tolerance

    12/20

    Transistor Fault

    o MOS transistor is considered an ideal switch.

    o Two types of faults are modeled:-

    Stuck-open -A single transistor is permanently stuck inthe open state turn the circuit into a sequential one andneed a sequence of at least 2 tests to detect a single fault.

    Stuck-on - A single transistor is permanently

    shorted irrespective of its gate voltage.

    o Detection of a stuck-open fault requires two vectors.

  • 8/10/2019 Introduction to Fault Tolerance

    13/20

    Example of Transistor Stuck-Open

    fault

  • 8/10/2019 Introduction to Fault Tolerance

    14/20

    Hardware Faults ClassificationThree types of faults:

    Transient Faults-disappear after a relatively short timeExample- a memory cell whose contents are changed spuriously

    due to some electromagnetic interference .

    Overwriting the memory cell with the right content will makethe fault go away.

    Permanent Faults-never go away, component has to berepaired or replaced.

    Intermittent Faults-cycle between active and benign states Example- a loose connection

  • 8/10/2019 Introduction to Fault Tolerance

    15/20

    Fault Tolerance Techniques Hardware Redundancy

    Software Redundancy

    Information Redundancy

    Time Redundancy

  • 8/10/2019 Introduction to Fault Tolerance

    16/20

    Hardware Redundancy

    Extra hardware is added to override the effects of a failedcomponent

    Static Hardware Redundancy- for immediate masking of afailure

    Example: Use three processors and vote on theresult. The wrong output of a single faulty processor is masked

    Dynamic Hardware Redundancy- Spare components are

    activated upon the failure of a currently active component

    Hybrid Hardware Redundancy- A combination of static anddynamic redundancy techniques

  • 8/10/2019 Introduction to Fault Tolerance

    17/20

    Software Redundancy Multiple teams of programmers

    Write different versions of software for the same

    function The hope is that such diversity will ensure that not all

    the copies will fail on the same set of input data

  • 8/10/2019 Introduction to Fault Tolerance

    18/20

    Information Redundancy

    Add check bits to original data bits so that an error in

    the data bits can be detected and even corrected

    Error detecting and correcting codes have beendeveloped and are being used

    Information redundancy often requires hardware

    redundancy to process the additional check bits

  • 8/10/2019 Introduction to Fault Tolerance

    19/20

    Time Redundancy

    Provide additional time during which a failed

    execution can be repeated

    Most failures are transient - they go away after sometime

    If enough slack time is available, failed unit can

    recover and redo affected computation

  • 8/10/2019 Introduction to Fault Tolerance

    20/20

    THANK YOU