7/31/2019 Lect 1 Intro Taxonomy
1/50
Fault Tolerant Systems
Dependable & Secure Systems
7/31/2019 Lect 1 Intro Taxonomy
2/50
Text Book to be Followed
7/31/2019 Lect 1 Intro Taxonomy
3/50
Course Outline Introduction - Basic concepts Dependability measures
Redundancy techniques Hardware fault tolerance Error detecting and correcting codes Redundant disks (RAID) Fault-tolerant networks Software fault tolerance
Checkpointing
Case studies of fault-tolerant systems Defect tolerance in VLSI circuits Fault detection in cryptographic systems Simulation techniques
7/31/2019 Lect 1 Intro Taxonomy
4/50
Need For Fault Tolerance - Critical
Applications
Aircrafts, nuclear reactors, chemical plants,medical equipment
A malfunction of a computer in suchapplications can lead to catastrophe
Their probability of failure must be
extremely low, possibly one in a billion perhour of operation
Also included - financial applications
7/31/2019 Lect 1 Intro Taxonomy
5/50
Need for Fault Tolerance - Harsh
Environments
A computing system operating in a harshenvironment where it is subjected to
electromagnetic disturbances
particle hits and alike
Very large number of failures means: thesystem will not produce useful results unless
some fault-tolerance is incorporated
7/31/2019 Lect 1 Intro Taxonomy
6/50
Need For Fault Tolerance - Highly Complex
Systems
Complex systems consist of millions of devices
Every physical device has a certain probability of
failure A very large number of devices implies that the
likelihood of failures is high
The system will experience faults at such afrequency which renders it useless
7/31/2019 Lect 1 Intro Taxonomy
7/50
Fault Taxonomy
7/31/2019 Lect 1 Intro Taxonomy
8/50
Basic Concepts of Dependability
Dependability the trustworthiness of a computersystem such that reliance can be justifiably put on
the service it delivers.
It is the system property that integrates such attributes asreliability, availability, safety, security, survivability,
maintainability.
A systematic exposition of the concepts of
dependability consists of three parts: the threats to,the attributes of, and the means by which
dependability is attained.
7/31/2019 Lect 1 Intro Taxonomy
9/50
Dependability Tree
7/31/2019 Lect 1 Intro Taxonomy
10/50
Fault-Error-Failure Model
System Under
Consideration
Unintended State:
Error
Cause of Error
(& Failure): Fault
Deviation of Actual
Service from Intended
Service: Failure
Faults and errors are states; Failures are external events.
Failuredenotes an elements inability to perform its designed functionbecause of errors in the element or its environment, which in turn arecaused by various faults.
7/31/2019 Lect 1 Intro Taxonomy
11/50
Fault, Error, Failure Examples
Cosmic ray knocks charge off of DRAM cell
Error: bit flip in memory
Failure: computation produces incorrect result
Software bug could allow for NULL pointerBug gets exercised and we get NULL pointer
Program segment faults when it tries to access pointer
7/31/2019 Lect 1 Intro Taxonomy
12/50
Duration of Faults/Errors
Transient (soft): occurs once and disappears
E.g., Cosmic ray knocks charge off transistorbit flip
Tend to be due to transient physical phenomena
Also known as Single Event Upset (SEU) Intermittent: occurs occasionally
E.g., Loose connectionoccasionally open circuit
E.g., Bug software for roundingincorrect data
Permanent (hard): occurs and does not go away
E.g., Broken connectionalways open circuit
7/31/2019 Lect 1 Intro Taxonomy
13/50
Software Faults/Errors
Types of bugs (or errors/failures that are due to bugs)
Incorrect algorithm
Array bounds violation
Memory leak (C, C++, but not Java) Allocating memory, but not de-allocating it
Reference to NULL pointer (C, C++, but not Java)
Incorrect synchronization in multithreaded code
Allowing more than 1 thread in critical section at a time Blocking when holding a lock
Inability to handle unanticipated inputs
7/31/2019 Lect 1 Intro Taxonomy
14/50
Software Failure
What happens if we exercise a software bug? Failures can occur in:
User-level software Incorrect data
Livelock/deadlock Exception that triggers OS to kill process
Segmentation fault
Bus error
Operating system software (including device drivers) Livelock/deadlock
Crash and reboot
Incorrect I/O
7/31/2019 Lect 1 Intro Taxonomy
15/50
Dependability and its Attributes
Availability: readiness for correct service
Reliability: continuity of correct service
Safety: absence of catastrophic consequences on theuser(s) and the environment
Confidentiality: absence of unauthorized disclosure ofinformation
Integrity: absence of improper system alterations
Maintainability: ability to undergo, modifications, andrepairs
Security is a composite attributes of availability,confidentiality, integrity.
7/31/2019 Lect 1 Intro Taxonomy
16/50
Traditional Measures - Reliability
Assumption: The system can be in one of two states:
up or down Examples:
Lightbulb - good or burned out
Wire - connected or broken
Reliability, R(t): Probability that the system is upduring the whole interval [0,t], given it was up at time 0
Related measure - Mean Time To Failure, MTTF :Average time the system remains up before it goes down and
has to be repaired or replaced
7/31/2019 Lect 1 Intro Taxonomy
17/50
Traditional Measures - Availability
Availability, A(t) : Fraction of time system is up during
the interval [0,t] Point Availability, Ap(t) :
Probability that the system is up at time t
Long-Term Availability, A:
Availability is used in systems with recovery/repair
Related measures:
Mean Time To Repair, MTTR
Mean Time Between Failures, MTBF = MTTF + MTTR
MTTRMTTF
MTTF
MTBF
MTTFA
+
==
(t)AlimA(t)limA ptt
==
7/31/2019 Lect 1 Intro Taxonomy
18/50
Need For More Measures
The assumption of the system being in state upor down is very limiting
Example: A processor with one of its severalhundreds of millions of gates stuck at logic value 0
and the rest is functional - may affect the outputof the processor once in every 25,000 hours of use
The processor is not fault-free, but cannot bedefined as being down
More detailed measures than the generalreliability and availability are needed
7/31/2019 Lect 1 Intro Taxonomy
19/50
Computational Capacity Measures
Example: N processors in a gracefully degrading
system System is useful as long as at least one processor
remains operational
Let Pi = Prob {i processors are operational}
Let c = computational capacity of a processor (e.g.,number of fixed-size tasks it can execute)
Computational capacity ofi processors: Ci = i c
Average computational capacity of system:
=1i
iPR(t)
i
1i
iPC
7/31/2019 Lect 1 Intro Taxonomy
20/50
Another Measure - Performability
Another approach - consider everything from theperspective of the application Application is used to define accomplishment levels
L1, L2,...,Ln
Each represents a level of quality of service delivered
by the application Example: Li indicates i system crashes during the
mission time period T
Performability is a vector (P(L1),P(L2),...,P(Ln)) whereP(Li) is the probability that the computer functionswell enough to permit the application to reach up toaccomplishment level Li
7/31/2019 Lect 1 Intro Taxonomy
21/50
Network Connectivity Measures
Focus on the network that connects the processors
Classical Node and Line Connectivity - the minimumnumber of nodes and lines, respectively, that have
to fail before the network becomes disconnected Measure indicates how vulnerable the network is to
disconnection
A network disconnected by the failure of just one(critically-positioned) node is potentially more
vulnerable than another which requires several
nodes to fail before it becomes disconnected
7/31/2019 Lect 1 Intro Taxonomy
22/50
Connectivity - Examples
7/31/2019 Lect 1 Intro Taxonomy
23/50
Network Resilience Measures
Classical connectivity distinguishes between onlytwo network states: connected and disconnected
It says nothing about how the network degrades as
nodes fail before becoming disconnected
Two possible resilience measures: Average node-pair distance
Network diameter - maximum node-pair distance
Both calculated given probability of node and/or linkfailure
7/31/2019 Lect 1 Intro Taxonomy
24/50
Means to Attain Dependability
Fault prevention: means to prevent the occurrence orintroduction of faults
Fault tolerance: means to avoid service failures in thepresence of faults
Fault removal: means to reduce the number and severity offaults Fault forecasting: means to estimate the present number,
the future incidence, and the likely consequences of faults
Note:
Fault prevention and fault tolerance aim to provide the ability to deliver a servicethat can be trusted. [Procurement]
Fault removal and fault forecasting aim to reach confidence in that ability byjustifying that the functional and dependability specifications are adequate andthat the system is likely to meet them. [Validation]
7/31/2019 Lect 1 Intro Taxonomy
25/50
Failure Modes A system does not always fail in the same way. Its
failure modes characterize incorrect serviceaccording to three viewpoints:the failure domainthe perception of a failure by system users
the detectability of failuresthe consequences of failures on the environment
7/31/2019 Lect 1 Intro Taxonomy
26/50
A Taxonomy of Faults
All faults thatmay affect a
system during its
life are classifiedaccording to
eight basic
viewpoints.
7/31/2019 Lect 1 Intro Taxonomy
27/50
Classes of Faults Tree Representation
7/31/2019 Lect 1 Intro Taxonomy
28/50
Classes of Combined Faults
7/31/2019 Lect 1 Intro Taxonomy
29/50
Key System/Functional Unit Properties
Fail Safe: In case of a fault, the system or functional unittransits to a safe state.
Fail Silent: In case of a fault, the output interfaces aredisabled in a way that no further outputs are made.
Fail Operational: It describes the ability of a system orfunctional unit to continue normal operation at itsoutput interfaces despite the presence of hardware orsoftware faults.
Graceful Degradation: the system continues to operatein the presence of errors, accepting partial degradationof performance during recovery.
EASIS Vi F il Sil t
7/31/2019 Lect 1 Intro Taxonomy
30/50
EASISs View on Fail-SilentElectronic Control Unit (ECU)
7/31/2019 Lect 1 Intro Taxonomy
31/50
CPU Faults/ErrorsProcessing core:
I. Calculating errors (e.g. HW fault, logic error )
II. Value errors (e.g. HW fault, memory/register corruption, EMI, SEU, etc )
III. Program flow errors (e.g. HW error)
IV. Interrupt errors (sequence, frequency, delay, disregarding, etc.)
V. Algorithmic errors (= Compiler/Logic Synthesizer errors / design faults)
VI. Timing errors
RAM/ROM:VII. Errors in the RAM/ROM ( memory cell defective)
VIII. Faulty RAM/ROM access (wrong memory address)
IX. Faulty memory mapping (=Compiler or linker errors / design faults)
X. Memory overflow
I/O-Interface:XI. Interface errors (errors in ADC/digital IO/ ... )
7/31/2019 Lect 1 Intro Taxonomy
32/50
Supervisor Faults/Errors
I. Internal error (the same as CPU faults/errors if the
supervisor is a processor).
II. Synchronization lost between CPU and supervisor.
III. Supervisor and CPU are getting different informationfrom the outer world.
IV. Supervisor loses the control over the enable-lines.
V. CPU and supervisor use different, but both valid,rules to judge the control.
7/31/2019 Lect 1 Intro Taxonomy
33/50
SW Related Faults/Errors
Scheduling Faults/ErrorsI. missed activation
deadline
II. missed terminationdeadline
Communication between SWcomponentsI. Data values of the received
data are faulty
II. The data is received later
than a deadlineIII. The data is received too early
IV. The data can not be sent outin the given time range
V. The data can not be sent out
VI. API Access Fault, (e.g.dynamic argument is out ofrange, )
7/31/2019 Lect 1 Intro Taxonomy
34/50
Actuator Faults/Errors
I. The actuator is not driven.
II. The actuator is permanently driven (without controller
command).
III. The actuator is not driven at the right time.
IV. The actuator is not driven with the correct
performance.
V. The actuator can not be driven correctly.
7/31/2019 Lect 1 Intro Taxonomy
35/50
Sensor Faults/Errors
I. The sensor delivers no value or an error signal.
II. The read value of the sensor is wrong.
III. The sensor delivers a value with a wrong timing.
7/31/2019 Lect 1 Intro Taxonomy
36/50
Internal Power Supply Faults
I. Over voltageII. Under voltage
III. Short circuit
IV. Over current (due to erroneously activated actuators,
defective actuators, defective components, misuse ofcomponents, etc )
V. Leakage current too high
VI. Brown out (slow decrease of the supply voltage belowthe minimum limit)
VII. Startup timing
VIII. Shutdown timing
7/31/2019 Lect 1 Intro Taxonomy
37/50
External Power Supply Faults
I. Over voltage (load dump, ISO pulse, generator
error)
II. Under voltage (due to Battery Low, line break)
III. Current limit
IV. Short circuit
7/31/2019 Lect 1 Intro Taxonomy
38/50
Faults/Errors in Communication SystemsAt a node level
I. Data values of a received message are faulty (Faulty data value).
II. The message is received later than a deadline (late message).III. The message is received too early.
IV. The message can not be sent out in the given time range.
V. The message can not be sent out.
At the system levelI. All receivers of the message (in a special case only one receiver exists)
regard the message as faulty with respect to the same main fault type,which is one of the faults (I to III)
II. All receivers of the message regard the message as faulty with respect toone of the main fault types (I to III), which can be different for eachreceiver.
III. Some of the receivers get a correct message, while the others get a faultymessage with respect to one of the main fault types (I to III), which is thesame for each receiver of the faulty message.
IV. Some of the receivers get a correct message, while the others get a faultymessage with respect to one of the main fault types (I to III), which can bedifferent for each receiver of the faulty message.
7/31/2019 Lect 1 Intro Taxonomy
39/50
Comprehensive Fault Model
Specification Faults
Adequacy faults: some of the properties expressed in thespecification are in contradiction with the required properties.
Over-specification: the specification satisfies the requiredproperties, but some feasible solutions are excluded because of
the presence of unnecessary properties; the specification is too
detailed.
Under-specification: all the properties expressed in thespecification are adequate, but some unacceptable solutions are
accepted; the specification is not precise enough.
Source: NUREG/CR-6316 Guidelines
7/31/2019 Lect 1 Intro Taxonomy
40/50
Requirement Faults (NASA fault taxonomy)
Incompleteness Omitted/Missing Incorrect Ambiguous Infeasible Inconsistent Over-specification Not Traceable Misplaced
Unachievable Item Non-verifiable Intentional Deviation Redundant or Duplicate
7/31/2019 Lect 1 Intro Taxonomy
41/50
Design Faults
Software design faults Application design faults Basic software design faults
Scheduling faults Services faults
Calibration faults
Firmware design faults
Hardware design faults Component design faults ECU design faults
Malicious design faults Disrupt or halt service; causing denial of service; improper
modification of system behavior
System design faults Relating to architecture design, communication infrastructure, wiring
harness, EMI protection, etc.
7/31/2019 Lect 1 Intro Taxonomy
42/50
Manufacturing Faults
Arise from weakness in the manufacturing andassembly processes at the various levels of details
from component manufacturing to the vehicle final
assembly. Such a fault could be caused by low quality in
materials/components, but may also be caused by a
software/hardware fault in the manufacturing system.
7/31/2019 Lect 1 Intro Taxonomy
43/50
Operational Faults(Refer to EASIS Fault Model)
Hardware faultsNode faults
CPU faults
Supervisor/watchdog faults
Internal communication (SPI) faults
Reset logic faults
Actuator faults
Sensor faults Power-supply faults
Communication faults/errors
7/31/2019 Lect 1 Intro Taxonomy
44/50
Operational Faults (Contd.)
Susceptibility faultsElectrical susceptibility (EMI transported by cablings)
Electromagnetic susceptibility (transported by air)
Environmental susceptibility
Maintenance faultsWrong software download
Wrong replacement parts
Wrong maintenance procedure followed
Malicious faultsSoftware intrusions
Hardware intrusions
Fault Hypothesis
7/31/2019 Lect 1 Intro Taxonomy
45/50
Fault Hypothesis
The fault hypothesis partitions the fault space into two sets
Level-1 faults: this is the set of faults that will be tolerated by thefault-tolerance mechanisms.
Level-2 faults: this is the set of fault that will not be tolerated bythe fault-tolerance mechanisms. These faults must be rare events.
If there is no precise fault hypothesis available, it isimpossible to test the proper behavior of the fault-
tolerance mechanisms.
If during the test and installation phase, it is found out thatlevel-2 faults are not rare events, then there exists afundamental design problem:
Either the fault-hypothesis is wrong
Or the implementation is deficient.
7/31/2019 Lect 1 Intro Taxonomy
46/50
Hardware Redundancy
Extra hardware is added to override the effects of a
failed component Static Hardware Redundancy - for
immediate masking of a failure
Example: Use three processors and vote on the
result.The wrong output of a single faulty processor ismasked
Dynamic Hardware Redundancy - Sparecomponents are activated upon the failure of acurrently active component
Hybrid Hardware Redundancy - Acombination of static and dynamic redundancytechniques
7/31/2019 Lect 1 Intro Taxonomy
47/50
Software Redundancy Example
Multiple teams of programmers
Write different versions of software for the same
function The hope is that such diversity will ensure that not
all the copies will fail on the same set of input data
7/31/2019 Lect 1 Intro Taxonomy
48/50
Information Redundancy
Add check bits to original data bits so that an errorin the data bits can be detected and even corrected
Error detecting and correcting codes have beendeveloped and are being used
Information redundancy often requires hardwareredundancy to process the additional check bits
7/31/2019 Lect 1 Intro Taxonomy
49/50
Time Redundancy
Provide additional time during which a failedexecution can be repeated
Most failures are transient - they go away aftersome time
If enough slack time is available, failed unit canrecover and redo affected computation
7/31/2019 Lect 1 Intro Taxonomy
50/50