ESEVO Fault Tolerance · ESEVO Fault Tolerance Frömel Modeling: Process of Abstracting I Behavior...

ESEVOFault Tolerance

Frömel


Bernhard Frömel

based on slides by Hermann Kopetz.-

Institute of Computer EngineeringVienna University of Technology

-182.722 Embedded Systems Engineering LU

October, 2014

1/51


Frömel

Part I

Fault Tolerance

2/51


Frömel

Technological Paradise

”[In a] Technological Paradise no acts of God can bepermited and everything happens according to theblueprints.” [Hannes Alfven]1.

We are not living in a technology paradise!

1Nobel laureate3/51


Frömel

Structure of Systems

”If you look at automata which have been built bymen or which exist in nature, you very frequentlynotice their structure is controlled to a much largerextent by the manner in which they might fail andby the (more or less effective) precautionarymeasures which have been taken against theirfailure.” [Neumann and Burks, 1966]

4/51


Frömel

Robustness

In large systems highly improbable that all sub-systemsoperate as specified.

⇒ Faults are the norm, rather than the exception.Robustness is concerned with delivery of useful level ofservice in the face of disturbances (e.g., hardware faults,software errors, changes of specification, inappropriateuse, . . . ).

5/51


Frömel

Design Challanges in Safety-Critical Applications

In safety-critical applications, where the safety of thesystem-at-large (e.g., an airplane, car, . . . ) depends on thecorrect operations of the computer system (e.g., primary flightcontrol system, x-by-wire-system in a car) following challengesmust be addressed:

I 10−9 challenge

I Modeling (process of abstraction)

I Faults (physical hardware faults, design faults, . . . )

I Human failures

6/51


Frömel

The 10−9 Challenge

I System as a whole must be more reliable than any of itscomponents: E.g., system dependability: 1 Failure in Time(FiT) versus component dependability of 1000 FIT, 1 FIT ...1 failure in 109 hours).

I Architecture must be distributed and supportfault-tolerance to mask component failures.

I System as a whole not testable to the required level ofdependability.

I Safety argument based on combination of experimentalevidence about expected failure modes and failure ratesof Fault Containment Units (FCUs) and a formaldependability model that depicts the system structurefrom the point of view of dependability.

I Independence of FCUs of critical issue.

7/51


Frömel

Modeling: Process of Abstracting

I Behavior of safety-critical computer systems must beexplainable by hierarchical structured set of behavioralmodels, each on them of a cognitive complexity that canbe handled by the human mind.

I Establish clear relationship between the behavioral modeland the dependability model at such a high level ofabstraction that the analysis of the dependability modelbecomes tractable.Example: Any migration of a function from one ElectronicControl Unit (ECU) to another ECU changes thedependability model and requires a new dependabilityanalysis.

I From the hardware point of view a complete chip forms asingle FCU that can fail in an arbitrary failure mode with aprobability of 10−6 failures/hour (1000 FiT).

8/51


Frömel

Fault Hypothesis and Assumption Coverage

I Fault-Hypothesis states the assumptions about types andnumbers of faults that a fault-tolerant system musttolerate.

I Assumption coverage states to what extend are theseassumptions met by reality. Assumption coverage limitsthe dependability of a perfect fault-tolerant system.

I Fault hypothesis most important document in design offault-tolerant systems.

9/51


Frömel

Fault Hypothesis I and Fault Hypothesis II

Fault Hypothesis I:Specification of faults that must be tolerated without anyimpact on essential system services (e.g., arbitrary failureof any single unit)

Fault Hypothesis II:Specification of faults that can be handled in therare-event scenario, e.g., for the never-give-up strategyExample: massive transients that cause the failure of allcommunication and more than one node over a givenperiod.

10/51


Frömel

System States of a Fault-Tolerant System

Fault Hypothesis II

Fault Hypothesis I

CorrectStates

normalfailures

rareevents

NGU strategy

FT

11/51


Frömel

Approach to Safety: The Swiss-CheeseModel [Reason and Reason, 1997]

Catastrophic System Failure

mul

tiple

laye

rs o

f

defe

nse

Normal Operation

On-Chip TMR

Off-Chip TMR

NGU Strategy

Subsystem Failure

12/51


Frömel

Why is the Fault Hypothesis Needed?

I Design of the Fault-Tolerance Algorithms: Withoutprecise fault-hypothesis it is not known which fault-classmust be addressed during system design.

I Estimation of the Assumption Coverage: Probability thatthe assumptions that are contained in the fault hypothesisare not met by reality.

I Validation and Certification: For the validation it must beknown which faults are supposed to be tolerated by thegiven system.

I Design of the Never-Give-Up (NGU) Strategy: In case thefault hypothesis is violated the NGU process must bestarted.

13/51


Frömel

Contents of the Fault Hypothesis

I Unit of Failure: What is the FCU?

I Failure Modes: What are the failure modes of the FCU?

I Frequency of Failures: What is the assumed Mean TimeTo Failure (MTTF) between failures for the different failuremodes, e.g., transient failures versus permanent failures?

I Detection: How are failures detected? How long is thedetection latency?

I Sate Recovery: How long does it take to repair corruptedstate (in case of a transient fault)?

14/51


Frömel

Unit of Failure: Fault Containment Unit (FCU)

A Fault Containment Unit (FCU) is a set of subsystems thatshares one or more common resources that can be affected bya single fault and is assumed to fail independently from otherFCUs.

I Tolerance w.r.t. spatial proximity faults requires spatialseparation of FCUs: distributed architectures required.

I Fault Hypothesis must specify the failure modes of theFCUs and their associated frequencies.

I Beware of shared resources that compromise theindependence assumption: e.g., common hardware,power supply, oscillator, earthing, single time source, . . .

15/51


Frömel

Independence of FCUs

Two basic mechanisms that compromise independence ofFCUs:

I missing fault isolation, and

I error propagation.

The independence of failures of different FCUs is most criticalissue in design of ultra-dependable systems.

I Is it justified to assume that a single silicon die can containtwo independent FCUs?

I Can we assume hat the failure modes of a single silicon dieare well behaved (e.g., fail-silent) to the required level ofprobability? (?????????)

16/51


Frömel

Correlated Failures of a Single Die Caused by

I Mask alignment gets more critical as feature size shrinks(data sensitive failures)

I Packaging faults

I Power supply

I Earthing

I Timing source (oscillator)

I Processing parameters out of range

I Oxidation

I Electro-migration

In aerospace community it is assumed that single silicon dieforms a single FCU that can fail in an arbitrary failure modewith a probability of 10−6 failures per hour.

17/51


Frömel

Critical Failure Modes of an FCU

I Crash Omission (CO) failures

I Massive transient disturbance

I Babbling idiot failures

I Masquerading failures

I Slightly-Off-Specification (SOS) failures

18/51


Frömel

Babbling Idiot Failures

Due to a hardware of software fault, a node sends a messageon a shared communication medium without adhering to themedia access discipline.

I Fault injection experiments show that about 1 out of 50nodes failures is of the babbling idiot types.

I Dependent bus guardian reduces this probability to about1 out of 1000 failures.

I Independent bus guardian with own clock synchronizationalgorithm, power supply, etc. is needed in fail-operationalsafety critical applications.

19/51


Frömel

A faulty node assumes the identity of another node and sendsincorrect messages:

I Any system that relies solely on information stored in amessage is potentially dangerous.

I A direct consequence of strong location transparency.

I Makes diagnosis very difficult

Example: Controller Area Network (CAN) Bus

20/51


Frömel

Intermittent Errors

An intermittent error exists, if the transient error rate issignificantly higher than the natural transient error rate.Causes for intermittents:

I Slow physical degradation of the hardware (PN junctions,wires) with the effect of data sensitive errors, temperaturesensitive errors, cross talk, etc.

I Design errors in the production process: e.g., the slightmisalignment of masks, variation of the processing steps,lead to a premature aging of the chip

More than half the observed transient errors may be caused byintermittents.

21/51


Frömel

Intermittent Failures: Increase of Transients

22/51


Frömel

The Distinction between Bohrbugs andHeisenbugs [Gray, 1986]

I Bohrbugs are design errors in the software that causereproducible failures. E.g., a logic error

”Bohrbugs, like the Borh atom, are solid, easily detected bystandard techniques, and hence boring.” [Gray, 1986].

I Heisenbugs are design errors in the software that seem togenerate quasi-random failures. E.g., a synchronzationerror that will cause the occasional violation of an integritycondition.

”But Heisenbugs may elude a bugcatcher for years of execution.Indeed, the bugcatcher may perturb the situation just enough tomake the Heisenbug disappear.” [Gray, 1986].

I From a phenomenological point of view, a failure that iscaused by a Heisenbug cannot be distinguished from afailure caused by a transient hardware malfunction.

I Experience shows that it is much more difficult to find andeliminate Heisenbugs than it is to eliminate Bohrbugs.

23/51


Frömel

Massive Transient Disturbances

I Massive transient disturbance occurs if the signals on acommunication channel are distorted by an externalenergy source such that no communication is possible fora given interval of time (blackout interval). E.g.,disturbance by Electro Magnetic Interference (EMI) (radarpulse).

I Normally correlated effects on replicated channelsI Self-stabilization mechanisms must:

I detect the onset of a blackout,I monitor the duration of the disturbance, andI restart the communication.

24/51


Frömel

Assumption about the Frequency of Faults of SoCs

Assumed behavioral hardware failure rates (order ofmagnitude):

Type of Failure Failure Rate inFIT

Source

Transient NodeFailures (failsilent)

< 1 000 000 FIT(MTTF > 1 000hours) 10 000more probablethan permanentfailures

Neutron bom-bardment,Aerospace

Transient NodeFailures (nonfail-silent)

< 10 000 FIT(MTTF > 100000 hours)

Fault injectionexperiments

Permanent Hard-ware Failures

< 100 FIT (MTTF> 10 000 000)

Automotive FieldData

Tendency: increase of transient failures!

25/51


Frömel

The Cause of a Transient Fault

I External Disturbances: e.g., high energy radiation(hardware)

I Internal Degradation of the Chip Hardware: e.g.,corrosion of a PN junction (hardware)

I Heisenbugs: e.g., design errors in the software that areonly activated under rare conditions, e.g., design error inthe synchronization of processes.

26/51


Frömel

Technology Scaling Effects on Reliability

I Increase of power densities and temperatures as aconsequence of device scaling.

I Higher temperatures have a negative effect on reliabilitybecause of

I ElctromigrationI Thermo-mechanical stress caused by thermal cyclesI Dielectric (gate-oxide) breakdown

I Smaller footprint of devices leads to multi-bit failurescaused by a single ambient cosmic event.

I Manufacturing tolerances are more critical.

27/51


Frömel

South Atlantic Anomaly

I Flux of energetic particles down to altitudes of about200km.

I Possible cause of first generation Globalstar2 satellites(fast paced degradation of S-band amplifiers).

2en.wikipedia.org/wiki/Globalstar28/51

en.wikipedia.org/wiki/Globalstar


Frömel

Single Event Upset for Uosat-3 Spacecraft3

Errors (bit flips) detected at UoSAT-3 spacecraft in polar orbit.

3http://www.esa.int/Our_Activities/Space_Engineering/Space_Environment/Radiation_effects

29/51

http://www.esa.int/Our_Activities/Space_Engineering/Space_Environment/Radiation_effectshttp://www.esa.int/Our_Activities/Space_Engineering/Space_Environment/Radiation_effects


Frömel

Integrity-Level of Application Domains

Application SystemMTTF wrtpermanentfailures (inyears)

SystemMTTF wrttransientfailures (inyears)

DataintegrityRequire-ment

MarketVolume

Examples

LowIntegrity

> 10 > 1 low huge consumerelectronics

ModerateIntegrity

> 100 > 10 moderate large present-dayautomotive

HighIntegrity

> 1 000 > 100 very high moderate enterpriseserver

Safety-Critical

> 100 000 >100 000 very high small flightcontrol

30/51


Frömel

The Dilemma

I Consumer Electronics (CE) domain has size to supportlarge development costs needed to build powerful SoCs.

I Since in near future there is no need to mitigateconsequences of ambient cosmic radiation, CE industrywill not pay extra for hardening their chips.

I Architectural mitigation strategies have to be developedsuch that replicated mass-market chips can be used tobuild high-integrity embedded systems.

31/51


Frömel

Error Containment

In distributed computer systems consequences of a fault (i.e.,the ensuing error) can propagate outside originating FCU by anerroneous message of the faulty node to the environment.

I Propagated error invalidates the independenceassumption.

I The error detector must be in a different FCU than thefaulty unit.

I Distinguish between architecture-based andapplication-based error detection.

I Distinguish between error detection in the time-domainand error detection in the value domain.

Since an Error Containment Region (ECR) requires at leasttwo FCUs, a single die cannot form an ECR!

32/51


Frömel

Fault Containment versus Error Containment

33/51


Frömel

Consequences for an Architecture

In a safety-critical application an System-on-Chip (SoC) must beconsidered to form a single FCU, i.e., a single unit of failurethat can fail in an arbitrary failure mode because of:

I Same physical space (physical proximity failures)

I Same wafer production process and mask (mask alignmentissues)

I Same bulk material

I Same power supply and same earthing

I Same timing source

I Same . . .

Although some of these dependencies can be eliminated,others cannot be eliminated. We cannot assume anindependent error detector on the same die.

34/51


Frömel

Mitigation at the Architecture Level: TMR

Triple Modular Redundancy (TMR) is the generally acceptedtechnique for mitigation of component failures at the systemlevel:

35/51


Frömel

Failure Modes of an FCU – Are there Restrictions?

36/51


Frömel

Mitigation at the Architecture Level: TMR

37/51


Frömel

Final Voter within a Voting Actuator

38/51


Frömel

Final Voter at Actuator – Four Wheels of a Car

39/51


Frömel

Requirements of TMR

What architectural services are needed to implement TMR atarchitecture level?

I Provision of an FCU for each of the replicas,

I Synchronization infrastructure,

I Predictable multicast communication,

I Replicated communication channels,

I Support for voting, and

I Deterministic (which includes timely) operation.

40/51


Frömel

Simplex versus TMR Reliability (without repair)

41/51


Frömel

Certification

An independent assessment of a given system design and itsvalidation that ensures that the system is ’for for purpose’.

I Carried out by a certification agency.

I Ensures that all justifyable precautions have been taken inorder to minimize the risk to the public.

I Of particular importance in application fields where asingle accident can cause catastrophic consequences forthe public at large, e.g., nuclear power, aircraft

I ’shares the responsibility’ in case of an accident.

42/51


Frömel

What is a Safety Case?

A safety case comprises the totality of documented argumentsand documented evidence that is used to justify the claim thata system is sufficiently safe for deployment:

I Diverse arguments to support the claim.

I Independent assessment on the basis of the documentedevidence.

43/51


Frömel

Safety-Case Principles

I Keep it Simple (Stupid): Complexity is a source of errorand unreliability. This applies to requirements,architecture, specification, and implementation of thesystem and software engineering process.

I Phased Development: Delivery of safety case should bephased along with other project deliverables andintegrated into design process.

I Maintenance of the Safety Case: The safety case must bemaintained in order to stay relevant.

I Foundations: The safety case should be developed in thecontext of a well-managed quality and safety managementsystem.

44/51


Frömel

The Core of the Safety Case

I Deterministic analysis of the hazards and faults that couldarise and cause adverse effects (loss of life, injury,economic damage, . . . ).

I Demonstration of the sufficiencies and adequacies of theprovisions (engineering and procedural) taken. Thearguments can be supported by probabilistic analysis. Theuse of mass-market components can help!

I Economic justification why specific measures have beentaken and others have been excluded.

45/51


Frömel

Which Evidence is Preferred?

I Deterministic over statistical

I Quantitative over qualitative

I Direct over indirect

I Product over Process

46/51


Frömel

ARINC RTCA/DO-178B

”The purpose of this document is to provideguidelines for the production of software for airbornesystems and equipment that performs its intendedfunction with a level of confidence in safety thatcomplies with airworthinessrequirements.” [DO178B, 1992]

I Document has been produced by a committee consistingof representatives of the major aerospace companies,airlines, and regulatory bodies.

I RTCA/DO-178 B represents an international concensusview of an approach that produces safe systems and isreasonably practical.

I Has been used in a number of major projects (e.g., Boeing777).

47/51


Frömel

Zero Failure Rate Software

I Is the claim of ”zero failure-rate software” achievable andassessable?

I If the ”zero failure-rate software route” is taken, then thefirst software failure invalidates the argument.

I Experience has shown that it is highly probable thatsoftware (and even hardware) is not free of design faults.

I Scientifically based statements: 10−5 failures/hour.Example: Ariane 5

48/51


Frömel

The ALARP Principle

49/51


Frömel

References

Part II

End – Thank You!

50/51

References[DO178B, 1992] DO178B, R. (1992).

178b: Software considerations in airborne systems and equipment certification.December, 1st.

[Gray, 1986] Gray, J. (1986).Why do computers stop and what can be done about it?In Symposium on reliability in distributed software and database systems, pages 3–12. LosAngeles, CA, USA.

[Neumann and Burks, 1966] Neumann, J. v. and Burks, A. W. (1966).Theory of self-reproducing automata.

[Reason and Reason, 1997] Reason, J. T. and Reason, J. T. (1997).Managing the risks of organizational accidents, volume 6.Ashgate Aldershot.

Fault ToleranceEnd – Thank You!References

ESEVO Fault Tolerance · ESEVO Fault Tolerance Frömel Modeling: Process of Abstracting I Behavior...

Documents

Transcript of ESEVO Fault Tolerance · ESEVO Fault Tolerance Frömel Modeling: Process of Abstracting I Behavior...