ESEVO Fault Tolerance · ESEVO Fault Tolerance Frömel Modeling: Process of Abstracting I Behavior...
Transcript of ESEVO Fault Tolerance · ESEVO Fault Tolerance Frömel Modeling: Process of Abstracting I Behavior...
-
ESEVOFault Tolerance
Frömel
ESEVOFault Tolerance
Bernhard Frömel
based on slides by Hermann Kopetz.-
Institute of Computer EngineeringVienna University of Technology
-182.722 Embedded Systems Engineering LU
October, 2014
1/51
-
ESEVOFault Tolerance
Frömel
Part I
Fault Tolerance
2/51
-
ESEVOFault Tolerance
Frömel
Technological Paradise
”[In a] Technological Paradise no acts of God can bepermited and everything happens according to theblueprints.” [Hannes Alfven]1.
We are not living in a technology paradise!
1Nobel laureate3/51
-
ESEVOFault Tolerance
Frömel
Structure of Systems
”If you look at automata which have been built bymen or which exist in nature, you very frequentlynotice their structure is controlled to a much largerextent by the manner in which they might fail andby the (more or less effective) precautionarymeasures which have been taken against theirfailure.” [Neumann and Burks, 1966]
4/51
-
ESEVOFault Tolerance
Frömel
Robustness
In large systems highly improbable that all sub-systemsoperate as specified.
⇒ Faults are the norm, rather than the exception.Robustness is concerned with delivery of useful level ofservice in the face of disturbances (e.g., hardware faults,software errors, changes of specification, inappropriateuse, . . . ).
5/51
-
ESEVOFault Tolerance
Frömel
Design Challanges in Safety-Critical Applications
In safety-critical applications, where the safety of thesystem-at-large (e.g., an airplane, car, . . . ) depends on thecorrect operations of the computer system (e.g., primary flightcontrol system, x-by-wire-system in a car) following challengesmust be addressed:
I 10−9 challenge
I Modeling (process of abstraction)
I Faults (physical hardware faults, design faults, . . . )
I Human failures
6/51
-
ESEVOFault Tolerance
Frömel
The 10−9 Challenge
I System as a whole must be more reliable than any of itscomponents: E.g., system dependability: 1 Failure in Time(FiT) versus component dependability of 1000 FIT, 1 FIT ...1 failure in 109 hours).
I Architecture must be distributed and supportfault-tolerance to mask component failures.
I System as a whole not testable to the required level ofdependability.
I Safety argument based on combination of experimentalevidence about expected failure modes and failure ratesof Fault Containment Units (FCUs) and a formaldependability model that depicts the system structurefrom the point of view of dependability.
I Independence of FCUs of critical issue.
7/51
-
ESEVOFault Tolerance
Frömel
Modeling: Process of Abstracting
I Behavior of safety-critical computer systems must beexplainable by hierarchical structured set of behavioralmodels, each on them of a cognitive complexity that canbe handled by the human mind.
I Establish clear relationship between the behavioral modeland the dependability model at such a high level ofabstraction that the analysis of the dependability modelbecomes tractable.Example: Any migration of a function from one ElectronicControl Unit (ECU) to another ECU changes thedependability model and requires a new dependabilityanalysis.
I From the hardware point of view a complete chip forms asingle FCU that can fail in an arbitrary failure mode with aprobability of 10−6 failures/hour (1000 FiT).
8/51
-
ESEVOFault Tolerance
Frömel
Fault Hypothesis and Assumption Coverage
I Fault-Hypothesis states the assumptions about types andnumbers of faults that a fault-tolerant system musttolerate.
I Assumption coverage states to what extend are theseassumptions met by reality. Assumption coverage limitsthe dependability of a perfect fault-tolerant system.
I Fault hypothesis most important document in design offault-tolerant systems.
9/51
-
ESEVOFault Tolerance
Frömel
Fault Hypothesis I and Fault Hypothesis II
Fault Hypothesis I:Specification of faults that must be tolerated without anyimpact on essential system services (e.g., arbitrary failureof any single unit)
Fault Hypothesis II:Specification of faults that can be handled in therare-event scenario, e.g., for the never-give-up strategyExample: massive transients that cause the failure of allcommunication and more than one node over a givenperiod.
10/51
-
ESEVOFault Tolerance
Frömel
System States of a Fault-Tolerant System
Fault Hypothesis II
Fault Hypothesis I
CorrectStates
normalfailures
rareevents
NGU strategy
FT
11/51
-
ESEVOFault Tolerance
Frömel
Approach to Safety: The Swiss-CheeseModel [Reason and Reason, 1997]
Catastrophic System Failure
mul
tiple
laye
rs o
f
defe
nse
Normal Operation
On-Chip TMR
Off-Chip TMR
NGU Strategy
Subsystem Failure
12/51
-
ESEVOFault Tolerance
Frömel
Why is the Fault Hypothesis Needed?
I Design of the Fault-Tolerance Algorithms: Withoutprecise fault-hypothesis it is not known which fault-classmust be addressed during system design.
I Estimation of the Assumption Coverage: Probability thatthe assumptions that are contained in the fault hypothesisare not met by reality.
I Validation and Certification: For the validation it must beknown which faults are supposed to be tolerated by thegiven system.
I Design of the Never-Give-Up (NGU) Strategy: In case thefault hypothesis is violated the NGU process must bestarted.
13/51
-
ESEVOFault Tolerance
Frömel
Contents of the Fault Hypothesis
I Unit of Failure: What is the FCU?
I Failure Modes: What are the failure modes of the FCU?
I Frequency of Failures: What is the assumed Mean TimeTo Failure (MTTF) between failures for the different failuremodes, e.g., transient failures versus permanent failures?
I Detection: How are failures detected? How long is thedetection latency?
I Sate Recovery: How long does it take to repair corruptedstate (in case of a transient fault)?
14/51
-
ESEVOFault Tolerance
Frömel
Unit of Failure: Fault Containment Unit (FCU)
A Fault Containment Unit (FCU) is a set of subsystems thatshares one or more common resources that can be affected bya single fault and is assumed to fail independently from otherFCUs.
I Tolerance w.r.t. spatial proximity faults requires spatialseparation of FCUs: distributed architectures required.
I Fault Hypothesis must specify the failure modes of theFCUs and their associated frequencies.
I Beware of shared resources that compromise theindependence assumption: e.g., common hardware,power supply, oscillator, earthing, single time source, . . .
15/51
-
ESEVOFault Tolerance
Frömel
Independence of FCUs
Two basic mechanisms that compromise independence ofFCUs:
I missing fault isolation, and
I error propagation.
The independence of failures of different FCUs is most criticalissue in design of ultra-dependable systems.
I Is it justified to assume that a single silicon die can containtwo independent FCUs?
I Can we assume hat the failure modes of a single silicon dieare well behaved (e.g., fail-silent) to the required level ofprobability? (?????????)
16/51
-
ESEVOFault Tolerance
Frömel
Correlated Failures of a Single Die Caused by
I Mask alignment gets more critical as feature size shrinks(data sensitive failures)
I Packaging faults
I Power supply
I Earthing
I Timing source (oscillator)
I Processing parameters out of range
I Oxidation
I Electro-migration
In aerospace community it is assumed that single silicon dieforms a single FCU that can fail in an arbitrary failure modewith a probability of 10−6 failures per hour.
17/51
-
ESEVOFault Tolerance
Frömel
Critical Failure Modes of an FCU
I Crash Omission (CO) failures
I Massive transient disturbance
I Babbling idiot failures
I Masquerading failures
I Slightly-Off-Specification (SOS) failures
18/51
-
ESEVOFault Tolerance
Frömel
Babbling Idiot Failures
Due to a hardware of software fault, a node sends a messageon a shared communication medium without adhering to themedia access discipline.
I Fault injection experiments show that about 1 out of 50nodes failures is of the babbling idiot types.
I Dependent bus guardian reduces this probability to about1 out of 1000 failures.
I Independent bus guardian with own clock synchronizationalgorithm, power supply, etc. is needed in fail-operationalsafety critical applications.
19/51
-
ESEVOFault Tolerance
Frömel
A faulty node assumes the identity of another node and sendsincorrect messages:
I Any system that relies solely on information stored in amessage is potentially dangerous.
I A direct consequence of strong location transparency.
I Makes diagnosis very difficult
Example: Controller Area Network (CAN) Bus
20/51
-
ESEVOFault Tolerance
Frömel
Intermittent Errors
An intermittent error exists, if the transient error rate issignificantly higher than the natural transient error rate.Causes for intermittents:
I Slow physical degradation of the hardware (PN junctions,wires) with the effect of data sensitive errors, temperaturesensitive errors, cross talk, etc.
I Design errors in the production process: e.g., the slightmisalignment of masks, variation of the processing steps,lead to a premature aging of the chip
More than half the observed transient errors may be caused byintermittents.
21/51
-
ESEVOFault Tolerance
Frömel
Intermittent Failures: Increase of Transients
22/51
-
ESEVOFault Tolerance
Frömel
The Distinction between Bohrbugs andHeisenbugs [Gray, 1986]
I Bohrbugs are design errors in the software that causereproducible failures. E.g., a logic error
”Bohrbugs, like the Borh atom, are solid, easily detected bystandard techniques, and hence boring.” [Gray, 1986].
I Heisenbugs are design errors in the software that seem togenerate quasi-random failures. E.g., a synchronzationerror that will cause the occasional violation of an integritycondition.
”But Heisenbugs may elude a bugcatcher for years of execution.Indeed, the bugcatcher may perturb the situation just enough tomake the Heisenbug disappear.” [Gray, 1986].
I From a phenomenological point of view, a failure that iscaused by a Heisenbug cannot be distinguished from afailure caused by a transient hardware malfunction.
I Experience shows that it is much more difficult to find andeliminate Heisenbugs than it is to eliminate Bohrbugs.
23/51
-
ESEVOFault Tolerance
Frömel
Massive Transient Disturbances
I Massive transient disturbance occurs if the signals on acommunication channel are distorted by an externalenergy source such that no communication is possible fora given interval of time (blackout interval). E.g.,disturbance by Electro Magnetic Interference (EMI) (radarpulse).
I Normally correlated effects on replicated channelsI Self-stabilization mechanisms must:
I detect the onset of a blackout,I monitor the duration of the disturbance, andI restart the communication.
24/51
-
ESEVOFault Tolerance
Frömel
Assumption about the Frequency of Faults of SoCs
Assumed behavioral hardware failure rates (order ofmagnitude):
Type of Failure Failure Rate inFIT
Source
Transient NodeFailures (failsilent)
< 1 000 000 FIT(MTTF > 1 000hours) 10 000more probablethan permanentfailures
Neutron bom-bardment,Aerospace
Transient NodeFailures (nonfail-silent)
< 10 000 FIT(MTTF > 100000 hours)
Fault injectionexperiments
Permanent Hard-ware Failures
< 100 FIT (MTTF> 10 000 000)
Automotive FieldData
Tendency: increase of transient failures!
25/51
-
ESEVOFault Tolerance
Frömel
The Cause of a Transient Fault
I External Disturbances: e.g., high energy radiation(hardware)
I Internal Degradation of the Chip Hardware: e.g.,corrosion of a PN junction (hardware)
I Heisenbugs: e.g., design errors in the software that areonly activated under rare conditions, e.g., design error inthe synchronization of processes.
26/51
-
ESEVOFault Tolerance
Frömel
Technology Scaling Effects on Reliability
I Increase of power densities and temperatures as aconsequence of device scaling.
I Higher temperatures have a negative effect on reliabilitybecause of
I ElctromigrationI Thermo-mechanical stress caused by thermal cyclesI Dielectric (gate-oxide) breakdown
I Smaller footprint of devices leads to multi-bit failurescaused by a single ambient cosmic event.
I Manufacturing tolerances are more critical.
27/51
-
ESEVOFault Tolerance
Frömel
South Atlantic Anomaly
I Flux of energetic particles down to altitudes of about200km.
I Possible cause of first generation Globalstar2 satellites(fast paced degradation of S-band amplifiers).
2en.wikipedia.org/wiki/Globalstar28/51
en.wikipedia.org/wiki/Globalstar
-
ESEVOFault Tolerance
Frömel
Single Event Upset for Uosat-3 Spacecraft3
Errors (bit flips) detected at UoSAT-3 spacecraft in polar orbit.
3http://www.esa.int/Our_Activities/Space_Engineering/Space_Environment/Radiation_effects
29/51
http://www.esa.int/Our_Activities/Space_Engineering/Space_Environment/Radiation_effectshttp://www.esa.int/Our_Activities/Space_Engineering/Space_Environment/Radiation_effects
-
ESEVOFault Tolerance
Frömel
Integrity-Level of Application Domains
Application SystemMTTF wrtpermanentfailures (inyears)
SystemMTTF wrttransientfailures (inyears)
DataintegrityRequire-ment
MarketVolume
Examples
LowIntegrity
> 10 > 1 low huge consumerelectronics
ModerateIntegrity
> 100 > 10 moderate large present-dayautomotive
HighIntegrity
> 1 000 > 100 very high moderate enterpriseserver
Safety-Critical
> 100 000 >100 000 very high small flightcontrol
30/51
-
ESEVOFault Tolerance
Frömel
The Dilemma
I Consumer Electronics (CE) domain has size to supportlarge development costs needed to build powerful SoCs.
I Since in near future there is no need to mitigateconsequences of ambient cosmic radiation, CE industrywill not pay extra for hardening their chips.
I Architectural mitigation strategies have to be developedsuch that replicated mass-market chips can be used tobuild high-integrity embedded systems.
31/51
-
ESEVOFault Tolerance
Frömel
Error Containment
In distributed computer systems consequences of a fault (i.e.,the ensuing error) can propagate outside originating FCU by anerroneous message of the faulty node to the environment.
I Propagated error invalidates the independenceassumption.
I The error detector must be in a different FCU than thefaulty unit.
I Distinguish between architecture-based andapplication-based error detection.
I Distinguish between error detection in the time-domainand error detection in the value domain.
Since an Error Containment Region (ECR) requires at leasttwo FCUs, a single die cannot form an ECR!
32/51
-
ESEVOFault Tolerance
Frömel
Fault Containment versus Error Containment
33/51
-
ESEVOFault Tolerance
Frömel
Consequences for an Architecture
In a safety-critical application an System-on-Chip (SoC) must beconsidered to form a single FCU, i.e., a single unit of failurethat can fail in an arbitrary failure mode because of:
I Same physical space (physical proximity failures)
I Same wafer production process and mask (mask alignmentissues)
I Same bulk material
I Same power supply and same earthing
I Same timing source
I Same . . .
Although some of these dependencies can be eliminated,others cannot be eliminated. We cannot assume anindependent error detector on the same die.
34/51
-
ESEVOFault Tolerance
Frömel
Mitigation at the Architecture Level: TMR
Triple Modular Redundancy (TMR) is the generally acceptedtechnique for mitigation of component failures at the systemlevel:
35/51
-
ESEVOFault Tolerance
Frömel
Failure Modes of an FCU – Are there Restrictions?
36/51
-
ESEVOFault Tolerance
Frömel
Mitigation at the Architecture Level: TMR
37/51
-
ESEVOFault Tolerance
Frömel
Final Voter within a Voting Actuator
38/51
-
ESEVOFault Tolerance
Frömel
Final Voter at Actuator – Four Wheels of a Car
39/51
-
ESEVOFault Tolerance
Frömel
Requirements of TMR
What architectural services are needed to implement TMR atarchitecture level?
I Provision of an FCU for each of the replicas,
I Synchronization infrastructure,
I Predictable multicast communication,
I Replicated communication channels,
I Support for voting, and
I Deterministic (which includes timely) operation.
40/51
-
ESEVOFault Tolerance
Frömel
Simplex versus TMR Reliability (without repair)
41/51
-
ESEVOFault Tolerance
Frömel
Certification
An independent assessment of a given system design and itsvalidation that ensures that the system is ’for for purpose’.
I Carried out by a certification agency.
I Ensures that all justifyable precautions have been taken inorder to minimize the risk to the public.
I Of particular importance in application fields where asingle accident can cause catastrophic consequences forthe public at large, e.g., nuclear power, aircraft
I ’shares the responsibility’ in case of an accident.
42/51
-
ESEVOFault Tolerance
Frömel
What is a Safety Case?
A safety case comprises the totality of documented argumentsand documented evidence that is used to justify the claim thata system is sufficiently safe for deployment:
I Diverse arguments to support the claim.
I Independent assessment on the basis of the documentedevidence.
43/51
-
ESEVOFault Tolerance
Frömel
Safety-Case Principles
I Keep it Simple (Stupid): Complexity is a source of errorand unreliability. This applies to requirements,architecture, specification, and implementation of thesystem and software engineering process.
I Phased Development: Delivery of safety case should bephased along with other project deliverables andintegrated into design process.
I Maintenance of the Safety Case: The safety case must bemaintained in order to stay relevant.
I Foundations: The safety case should be developed in thecontext of a well-managed quality and safety managementsystem.
44/51
-
ESEVOFault Tolerance
Frömel
The Core of the Safety Case
I Deterministic analysis of the hazards and faults that couldarise and cause adverse effects (loss of life, injury,economic damage, . . . ).
I Demonstration of the sufficiencies and adequacies of theprovisions (engineering and procedural) taken. Thearguments can be supported by probabilistic analysis. Theuse of mass-market components can help!
I Economic justification why specific measures have beentaken and others have been excluded.
45/51
-
ESEVOFault Tolerance
Frömel
Which Evidence is Preferred?
I Deterministic over statistical
I Quantitative over qualitative
I Direct over indirect
I Product over Process
46/51
-
ESEVOFault Tolerance
Frömel
ARINC RTCA/DO-178B
”The purpose of this document is to provideguidelines for the production of software for airbornesystems and equipment that performs its intendedfunction with a level of confidence in safety thatcomplies with airworthinessrequirements.” [DO178B, 1992]
I Document has been produced by a committee consistingof representatives of the major aerospace companies,airlines, and regulatory bodies.
I RTCA/DO-178 B represents an international concensusview of an approach that produces safe systems and isreasonably practical.
I Has been used in a number of major projects (e.g., Boeing777).
47/51
-
ESEVOFault Tolerance
Frömel
Zero Failure Rate Software
I Is the claim of ”zero failure-rate software” achievable andassessable?
I If the ”zero failure-rate software route” is taken, then thefirst software failure invalidates the argument.
I Experience has shown that it is highly probable thatsoftware (and even hardware) is not free of design faults.
I Scientifically based statements: 10−5 failures/hour.Example: Ariane 5
48/51
-
ESEVOFault Tolerance
Frömel
The ALARP Principle
49/51
-
ESEVOFault Tolerance
Frömel
References
Part II
End – Thank You!
50/51
-
References[DO178B, 1992] DO178B, R. (1992).
178b: Software considerations in airborne systems and equipment certification.December, 1st.
[Gray, 1986] Gray, J. (1986).Why do computers stop and what can be done about it?In Symposium on reliability in distributed software and database systems, pages 3–12. LosAngeles, CA, USA.
[Neumann and Burks, 1966] Neumann, J. v. and Burks, A. W. (1966).Theory of self-reproducing automata.
[Reason and Reason, 1997] Reason, J. T. and Reason, J. T. (1997).Managing the risks of organizational accidents, volume 6.Ashgate Aldershot.
Fault ToleranceEnd – Thank You!References