Introduction to Fault Tolerance
-
Upload
ankitbhattt -
Category
Documents
-
view
221 -
download
0
Transcript of Introduction to Fault Tolerance
-
8/10/2019 Introduction to Fault Tolerance
1/20
BY:
ANKIT BHATTME-VLSI & EMBEDDED
-
8/10/2019 Introduction to Fault Tolerance
2/20
What Is Failure? A system is said to fail when it cannot meetits promises.
A failure is brought about by the existenceof errors inthe system.
The causeof an error is called a fault.
2
-
8/10/2019 Introduction to Fault Tolerance
3/20
Concept of Fault Tolerance Hardware, software and networks cannot be totally free
from failures
Fault tolerance is a non-functional (QoS) requirement
that requires a system to continue to operate, even in thepresence of faults
Fault tolerance should be achieved with minimalinvolvement of users or system administrators
Distributed systems can be more fault tolerant thancentralized systems, but with more processor hostsgenerally the occurrence of individual faults is likely to
be more frequent
-
8/10/2019 Introduction to Fault Tolerance
4/20
Distributed Systems 4
Attributes
Availability
Reliability
Safety
Confidentiality
Integrity
Maintainability Consequences
Fault
Error
FailureStrategies
Fault preventionFault tolerance
Fault recovery
Fault forcasting
Attributes Consequences and Strategies
What is a
Dependable
system
How to
distinguish
faults
How to
handle
faults?
-
8/10/2019 Introduction to Fault Tolerance
5/20
Distributed Systems 5
results incausesFault Error Failure
Faultis a defect within the systemErroris observed by a deviation from the expected
behaviour of the system
Failureoccurs when the system can no longer perform as
required (does not meet spec)
Fault Toleranceis ability of system to provide a service,
even in the presence of errors
Terminology of Fault Tolerance
-
8/10/2019 Introduction to Fault Tolerance
6/20
Strategies to Handle Faults
Distributed Systems 6
Actions to identify andremove errors:
Design reviews
Testing
Use certified tools
Analysis:
Hazard analysis
Formal methods -proof & refinement
No non-trivial system
can be guaranteed free
from errorMust have an
expectation of failure
and make appropriate
provision
Fault avoidanceTechniques aim toprevent
faults from entering thesystem during design stage
Fault removalMethods attempt to find
faults within a system beforeit enters service
Fault detectionTechniques used duringservice to detect faults withinthe operational system
Fault tolerantTechniques designed to tolerantfaults, i.e. to allow the systemoperate correctly in the presence offaults.
-
8/10/2019 Introduction to Fault Tolerance
7/20
Fault ModelsA fault model identifies targets for testing
A fault model makes analysis possible
Effectiveness measurable by experiments
Different types
Stuck-at faults
Multiple stuck-at faults
Bridging faults
-
8/10/2019 Introduction to Fault Tolerance
8/20
Single Stuck At Fault
Single (line) stuck-at faultThe given line has a constant value (0/1)
independent of other signal values in the circuit
Propertieso Only one line is faulty
o The faulty line is permanently set to 0 or 1
o The fault can be at an input or output of a gate
o Simple logical model is independent of technology
o It reduces the complexity of fault-detection
-
8/10/2019 Introduction to Fault Tolerance
9/20
Example:
XOR circuit has 12 fault sites and 24 single stuck-at faults
-
8/10/2019 Introduction to Fault Tolerance
10/20
Multiple Stuck-At Faults Multiple stuck-at fault
Several single stuck-at faults occur at the same time
Multiple stuck-at faults are usually not considered inpractice because of two reasons
o The number of multiple stuck-at faults in a circuit
with k lines is 3K-1, which is too large a number
even for circuits of moderate size
o Tests for single stuck-at faults are known to cover a
very high percentage (greater than 99.6%) of multiple stuck-atfaults when the circuit is large and
has several outputs
-
8/10/2019 Introduction to Fault Tolerance
11/20
Bridging Fault
Two or more normally distinct points (lines) areshorted together
Two types of bridging faults:
Input bridging
Can form wired logic or voting model. Feedback (input-to-output) bridging
Can introduce feedback.
Can cause oscillation or latching.
-
8/10/2019 Introduction to Fault Tolerance
12/20
Transistor Fault
o MOS transistor is considered an ideal switch.
o Two types of faults are modeled:-
Stuck-open -A single transistor is permanently stuck inthe open state turn the circuit into a sequential one andneed a sequence of at least 2 tests to detect a single fault.
Stuck-on - A single transistor is permanently
shorted irrespective of its gate voltage.
o Detection of a stuck-open fault requires two vectors.
-
8/10/2019 Introduction to Fault Tolerance
13/20
Example of Transistor Stuck-Open
fault
-
8/10/2019 Introduction to Fault Tolerance
14/20
Hardware Faults ClassificationThree types of faults:
Transient Faults-disappear after a relatively short timeExample- a memory cell whose contents are changed spuriously
due to some electromagnetic interference .
Overwriting the memory cell with the right content will makethe fault go away.
Permanent Faults-never go away, component has to berepaired or replaced.
Intermittent Faults-cycle between active and benign states Example- a loose connection
-
8/10/2019 Introduction to Fault Tolerance
15/20
Fault Tolerance Techniques Hardware Redundancy
Software Redundancy
Information Redundancy
Time Redundancy
-
8/10/2019 Introduction to Fault Tolerance
16/20
Hardware Redundancy
Extra hardware is added to override the effects of a failedcomponent
Static Hardware Redundancy- for immediate masking of afailure
Example: Use three processors and vote on theresult. The wrong output of a single faulty processor is masked
Dynamic Hardware Redundancy- Spare components are
activated upon the failure of a currently active component
Hybrid Hardware Redundancy- A combination of static anddynamic redundancy techniques
-
8/10/2019 Introduction to Fault Tolerance
17/20
Software Redundancy Multiple teams of programmers
Write different versions of software for the same
function The hope is that such diversity will ensure that not all
the copies will fail on the same set of input data
-
8/10/2019 Introduction to Fault Tolerance
18/20
Information Redundancy
Add check bits to original data bits so that an error in
the data bits can be detected and even corrected
Error detecting and correcting codes have beendeveloped and are being used
Information redundancy often requires hardware
redundancy to process the additional check bits
-
8/10/2019 Introduction to Fault Tolerance
19/20
Time Redundancy
Provide additional time during which a failed
execution can be repeated
Most failures are transient - they go away after sometime
If enough slack time is available, failed unit can
recover and redo affected computation
-
8/10/2019 Introduction to Fault Tolerance
20/20
THANK YOU