ESEVO Fault Tolerance · ESEVO Fault Tolerance Frömel Modeling: Process of Abstracting I Behavior...

51
ESEVO Fault Tolerance Frömel ESEVO Fault Tolerance Bernhard Frömel based on slides by Hermann Kopetz. - Institute of Computer Engineering Vienna University of Technology - 182.722 Embedded Systems Engineering LU October, 2014 1/51

Transcript of ESEVO Fault Tolerance · ESEVO Fault Tolerance Frömel Modeling: Process of Abstracting I Behavior...

  • ESEVOFault Tolerance

    Frömel

    ESEVOFault Tolerance

    Bernhard Frömel

    based on slides by Hermann Kopetz.-

    Institute of Computer EngineeringVienna University of Technology

    -182.722 Embedded Systems Engineering LU

    October, 2014

    1/51

  • ESEVOFault Tolerance

    Frömel

    Part I

    Fault Tolerance

    2/51

  • ESEVOFault Tolerance

    Frömel

    Technological Paradise

    ”[In a] Technological Paradise no acts of God can bepermited and everything happens according to theblueprints.” [Hannes Alfven]1.

    We are not living in a technology paradise!

    1Nobel laureate3/51

  • ESEVOFault Tolerance

    Frömel

    Structure of Systems

    ”If you look at automata which have been built bymen or which exist in nature, you very frequentlynotice their structure is controlled to a much largerextent by the manner in which they might fail andby the (more or less effective) precautionarymeasures which have been taken against theirfailure.” [Neumann and Burks, 1966]

    4/51

  • ESEVOFault Tolerance

    Frömel

    Robustness

    In large systems highly improbable that all sub-systemsoperate as specified.

    ⇒ Faults are the norm, rather than the exception.Robustness is concerned with delivery of useful level ofservice in the face of disturbances (e.g., hardware faults,software errors, changes of specification, inappropriateuse, . . . ).

    5/51

  • ESEVOFault Tolerance

    Frömel

    Design Challanges in Safety-Critical Applications

    In safety-critical applications, where the safety of thesystem-at-large (e.g., an airplane, car, . . . ) depends on thecorrect operations of the computer system (e.g., primary flightcontrol system, x-by-wire-system in a car) following challengesmust be addressed:

    I 10−9 challenge

    I Modeling (process of abstraction)

    I Faults (physical hardware faults, design faults, . . . )

    I Human failures

    6/51

  • ESEVOFault Tolerance

    Frömel

    The 10−9 Challenge

    I System as a whole must be more reliable than any of itscomponents: E.g., system dependability: 1 Failure in Time(FiT) versus component dependability of 1000 FIT, 1 FIT ...1 failure in 109 hours).

    I Architecture must be distributed and supportfault-tolerance to mask component failures.

    I System as a whole not testable to the required level ofdependability.

    I Safety argument based on combination of experimentalevidence about expected failure modes and failure ratesof Fault Containment Units (FCUs) and a formaldependability model that depicts the system structurefrom the point of view of dependability.

    I Independence of FCUs of critical issue.

    7/51

  • ESEVOFault Tolerance

    Frömel

    Modeling: Process of Abstracting

    I Behavior of safety-critical computer systems must beexplainable by hierarchical structured set of behavioralmodels, each on them of a cognitive complexity that canbe handled by the human mind.

    I Establish clear relationship between the behavioral modeland the dependability model at such a high level ofabstraction that the analysis of the dependability modelbecomes tractable.Example: Any migration of a function from one ElectronicControl Unit (ECU) to another ECU changes thedependability model and requires a new dependabilityanalysis.

    I From the hardware point of view a complete chip forms asingle FCU that can fail in an arbitrary failure mode with aprobability of 10−6 failures/hour (1000 FiT).

    8/51

  • ESEVOFault Tolerance

    Frömel

    Fault Hypothesis and Assumption Coverage

    I Fault-Hypothesis states the assumptions about types andnumbers of faults that a fault-tolerant system musttolerate.

    I Assumption coverage states to what extend are theseassumptions met by reality. Assumption coverage limitsthe dependability of a perfect fault-tolerant system.

    I Fault hypothesis most important document in design offault-tolerant systems.

    9/51

  • ESEVOFault Tolerance

    Frömel

    Fault Hypothesis I and Fault Hypothesis II

    Fault Hypothesis I:Specification of faults that must be tolerated without anyimpact on essential system services (e.g., arbitrary failureof any single unit)

    Fault Hypothesis II:Specification of faults that can be handled in therare-event scenario, e.g., for the never-give-up strategyExample: massive transients that cause the failure of allcommunication and more than one node over a givenperiod.

    10/51

  • ESEVOFault Tolerance

    Frömel

    System States of a Fault-Tolerant System

    Fault Hypothesis II

    Fault Hypothesis I

    CorrectStates

    normalfailures

    rareevents

    NGU strategy

    FT

    11/51

  • ESEVOFault Tolerance

    Frömel

    Approach to Safety: The Swiss-CheeseModel [Reason and Reason, 1997]

    Catastrophic System Failure

    mul

    tiple

    laye

    rs o

    f

    defe

    nse

    Normal Operation

    On-Chip TMR

    Off-Chip TMR

    NGU Strategy

    Subsystem Failure

    12/51

  • ESEVOFault Tolerance

    Frömel

    Why is the Fault Hypothesis Needed?

    I Design of the Fault-Tolerance Algorithms: Withoutprecise fault-hypothesis it is not known which fault-classmust be addressed during system design.

    I Estimation of the Assumption Coverage: Probability thatthe assumptions that are contained in the fault hypothesisare not met by reality.

    I Validation and Certification: For the validation it must beknown which faults are supposed to be tolerated by thegiven system.

    I Design of the Never-Give-Up (NGU) Strategy: In case thefault hypothesis is violated the NGU process must bestarted.

    13/51

  • ESEVOFault Tolerance

    Frömel

    Contents of the Fault Hypothesis

    I Unit of Failure: What is the FCU?

    I Failure Modes: What are the failure modes of the FCU?

    I Frequency of Failures: What is the assumed Mean TimeTo Failure (MTTF) between failures for the different failuremodes, e.g., transient failures versus permanent failures?

    I Detection: How are failures detected? How long is thedetection latency?

    I Sate Recovery: How long does it take to repair corruptedstate (in case of a transient fault)?

    14/51

  • ESEVOFault Tolerance

    Frömel

    Unit of Failure: Fault Containment Unit (FCU)

    A Fault Containment Unit (FCU) is a set of subsystems thatshares one or more common resources that can be affected bya single fault and is assumed to fail independently from otherFCUs.

    I Tolerance w.r.t. spatial proximity faults requires spatialseparation of FCUs: distributed architectures required.

    I Fault Hypothesis must specify the failure modes of theFCUs and their associated frequencies.

    I Beware of shared resources that compromise theindependence assumption: e.g., common hardware,power supply, oscillator, earthing, single time source, . . .

    15/51

  • ESEVOFault Tolerance

    Frömel

    Independence of FCUs

    Two basic mechanisms that compromise independence ofFCUs:

    I missing fault isolation, and

    I error propagation.

    The independence of failures of different FCUs is most criticalissue in design of ultra-dependable systems.

    I Is it justified to assume that a single silicon die can containtwo independent FCUs?

    I Can we assume hat the failure modes of a single silicon dieare well behaved (e.g., fail-silent) to the required level ofprobability? (?????????)

    16/51

  • ESEVOFault Tolerance

    Frömel

    Correlated Failures of a Single Die Caused by

    I Mask alignment gets more critical as feature size shrinks(data sensitive failures)

    I Packaging faults

    I Power supply

    I Earthing

    I Timing source (oscillator)

    I Processing parameters out of range

    I Oxidation

    I Electro-migration

    In aerospace community it is assumed that single silicon dieforms a single FCU that can fail in an arbitrary failure modewith a probability of 10−6 failures per hour.

    17/51

  • ESEVOFault Tolerance

    Frömel

    Critical Failure Modes of an FCU

    I Crash Omission (CO) failures

    I Massive transient disturbance

    I Babbling idiot failures

    I Masquerading failures

    I Slightly-Off-Specification (SOS) failures

    18/51

  • ESEVOFault Tolerance

    Frömel

    Babbling Idiot Failures

    Due to a hardware of software fault, a node sends a messageon a shared communication medium without adhering to themedia access discipline.

    I Fault injection experiments show that about 1 out of 50nodes failures is of the babbling idiot types.

    I Dependent bus guardian reduces this probability to about1 out of 1000 failures.

    I Independent bus guardian with own clock synchronizationalgorithm, power supply, etc. is needed in fail-operationalsafety critical applications.

    19/51

  • ESEVOFault Tolerance

    Frömel

    A faulty node assumes the identity of another node and sendsincorrect messages:

    I Any system that relies solely on information stored in amessage is potentially dangerous.

    I A direct consequence of strong location transparency.

    I Makes diagnosis very difficult

    Example: Controller Area Network (CAN) Bus

    20/51

  • ESEVOFault Tolerance

    Frömel

    Intermittent Errors

    An intermittent error exists, if the transient error rate issignificantly higher than the natural transient error rate.Causes for intermittents:

    I Slow physical degradation of the hardware (PN junctions,wires) with the effect of data sensitive errors, temperaturesensitive errors, cross talk, etc.

    I Design errors in the production process: e.g., the slightmisalignment of masks, variation of the processing steps,lead to a premature aging of the chip

    More than half the observed transient errors may be caused byintermittents.

    21/51

  • ESEVOFault Tolerance

    Frömel

    Intermittent Failures: Increase of Transients

    22/51

  • ESEVOFault Tolerance

    Frömel

    The Distinction between Bohrbugs andHeisenbugs [Gray, 1986]

    I Bohrbugs are design errors in the software that causereproducible failures. E.g., a logic error

    ”Bohrbugs, like the Borh atom, are solid, easily detected bystandard techniques, and hence boring.” [Gray, 1986].

    I Heisenbugs are design errors in the software that seem togenerate quasi-random failures. E.g., a synchronzationerror that will cause the occasional violation of an integritycondition.

    ”But Heisenbugs may elude a bugcatcher for years of execution.Indeed, the bugcatcher may perturb the situation just enough tomake the Heisenbug disappear.” [Gray, 1986].

    I From a phenomenological point of view, a failure that iscaused by a Heisenbug cannot be distinguished from afailure caused by a transient hardware malfunction.

    I Experience shows that it is much more difficult to find andeliminate Heisenbugs than it is to eliminate Bohrbugs.

    23/51

  • ESEVOFault Tolerance

    Frömel

    Massive Transient Disturbances

    I Massive transient disturbance occurs if the signals on acommunication channel are distorted by an externalenergy source such that no communication is possible fora given interval of time (blackout interval). E.g.,disturbance by Electro Magnetic Interference (EMI) (radarpulse).

    I Normally correlated effects on replicated channelsI Self-stabilization mechanisms must:

    I detect the onset of a blackout,I monitor the duration of the disturbance, andI restart the communication.

    24/51

  • ESEVOFault Tolerance

    Frömel

    Assumption about the Frequency of Faults of SoCs

    Assumed behavioral hardware failure rates (order ofmagnitude):

    Type of Failure Failure Rate inFIT

    Source

    Transient NodeFailures (failsilent)

    < 1 000 000 FIT(MTTF > 1 000hours) 10 000more probablethan permanentfailures

    Neutron bom-bardment,Aerospace

    Transient NodeFailures (nonfail-silent)

    < 10 000 FIT(MTTF > 100000 hours)

    Fault injectionexperiments

    Permanent Hard-ware Failures

    < 100 FIT (MTTF> 10 000 000)

    Automotive FieldData

    Tendency: increase of transient failures!

    25/51

  • ESEVOFault Tolerance

    Frömel

    The Cause of a Transient Fault

    I External Disturbances: e.g., high energy radiation(hardware)

    I Internal Degradation of the Chip Hardware: e.g.,corrosion of a PN junction (hardware)

    I Heisenbugs: e.g., design errors in the software that areonly activated under rare conditions, e.g., design error inthe synchronization of processes.

    26/51

  • ESEVOFault Tolerance

    Frömel

    Technology Scaling Effects on Reliability

    I Increase of power densities and temperatures as aconsequence of device scaling.

    I Higher temperatures have a negative effect on reliabilitybecause of

    I ElctromigrationI Thermo-mechanical stress caused by thermal cyclesI Dielectric (gate-oxide) breakdown

    I Smaller footprint of devices leads to multi-bit failurescaused by a single ambient cosmic event.

    I Manufacturing tolerances are more critical.

    27/51

  • ESEVOFault Tolerance

    Frömel

    South Atlantic Anomaly

    I Flux of energetic particles down to altitudes of about200km.

    I Possible cause of first generation Globalstar2 satellites(fast paced degradation of S-band amplifiers).

    2en.wikipedia.org/wiki/Globalstar28/51

    en.wikipedia.org/wiki/Globalstar

  • ESEVOFault Tolerance

    Frömel

    Single Event Upset for Uosat-3 Spacecraft3

    Errors (bit flips) detected at UoSAT-3 spacecraft in polar orbit.

    3http://www.esa.int/Our_Activities/Space_Engineering/Space_Environment/Radiation_effects

    29/51

    http://www.esa.int/Our_Activities/Space_Engineering/Space_Environment/Radiation_effectshttp://www.esa.int/Our_Activities/Space_Engineering/Space_Environment/Radiation_effects

  • ESEVOFault Tolerance

    Frömel

    Integrity-Level of Application Domains

    Application SystemMTTF wrtpermanentfailures (inyears)

    SystemMTTF wrttransientfailures (inyears)

    DataintegrityRequire-ment

    MarketVolume

    Examples

    LowIntegrity

    > 10 > 1 low huge consumerelectronics

    ModerateIntegrity

    > 100 > 10 moderate large present-dayautomotive

    HighIntegrity

    > 1 000 > 100 very high moderate enterpriseserver

    Safety-Critical

    > 100 000 >100 000 very high small flightcontrol

    30/51

  • ESEVOFault Tolerance

    Frömel

    The Dilemma

    I Consumer Electronics (CE) domain has size to supportlarge development costs needed to build powerful SoCs.

    I Since in near future there is no need to mitigateconsequences of ambient cosmic radiation, CE industrywill not pay extra for hardening their chips.

    I Architectural mitigation strategies have to be developedsuch that replicated mass-market chips can be used tobuild high-integrity embedded systems.

    31/51

  • ESEVOFault Tolerance

    Frömel

    Error Containment

    In distributed computer systems consequences of a fault (i.e.,the ensuing error) can propagate outside originating FCU by anerroneous message of the faulty node to the environment.

    I Propagated error invalidates the independenceassumption.

    I The error detector must be in a different FCU than thefaulty unit.

    I Distinguish between architecture-based andapplication-based error detection.

    I Distinguish between error detection in the time-domainand error detection in the value domain.

    Since an Error Containment Region (ECR) requires at leasttwo FCUs, a single die cannot form an ECR!

    32/51

  • ESEVOFault Tolerance

    Frömel

    Fault Containment versus Error Containment

    33/51

  • ESEVOFault Tolerance

    Frömel

    Consequences for an Architecture

    In a safety-critical application an System-on-Chip (SoC) must beconsidered to form a single FCU, i.e., a single unit of failurethat can fail in an arbitrary failure mode because of:

    I Same physical space (physical proximity failures)

    I Same wafer production process and mask (mask alignmentissues)

    I Same bulk material

    I Same power supply and same earthing

    I Same timing source

    I Same . . .

    Although some of these dependencies can be eliminated,others cannot be eliminated. We cannot assume anindependent error detector on the same die.

    34/51

  • ESEVOFault Tolerance

    Frömel

    Mitigation at the Architecture Level: TMR

    Triple Modular Redundancy (TMR) is the generally acceptedtechnique for mitigation of component failures at the systemlevel:

    35/51

  • ESEVOFault Tolerance

    Frömel

    Failure Modes of an FCU – Are there Restrictions?

    36/51

  • ESEVOFault Tolerance

    Frömel

    Mitigation at the Architecture Level: TMR

    37/51

  • ESEVOFault Tolerance

    Frömel

    Final Voter within a Voting Actuator

    38/51

  • ESEVOFault Tolerance

    Frömel

    Final Voter at Actuator – Four Wheels of a Car

    39/51

  • ESEVOFault Tolerance

    Frömel

    Requirements of TMR

    What architectural services are needed to implement TMR atarchitecture level?

    I Provision of an FCU for each of the replicas,

    I Synchronization infrastructure,

    I Predictable multicast communication,

    I Replicated communication channels,

    I Support for voting, and

    I Deterministic (which includes timely) operation.

    40/51

  • ESEVOFault Tolerance

    Frömel

    Simplex versus TMR Reliability (without repair)

    41/51

  • ESEVOFault Tolerance

    Frömel

    Certification

    An independent assessment of a given system design and itsvalidation that ensures that the system is ’for for purpose’.

    I Carried out by a certification agency.

    I Ensures that all justifyable precautions have been taken inorder to minimize the risk to the public.

    I Of particular importance in application fields where asingle accident can cause catastrophic consequences forthe public at large, e.g., nuclear power, aircraft

    I ’shares the responsibility’ in case of an accident.

    42/51

  • ESEVOFault Tolerance

    Frömel

    What is a Safety Case?

    A safety case comprises the totality of documented argumentsand documented evidence that is used to justify the claim thata system is sufficiently safe for deployment:

    I Diverse arguments to support the claim.

    I Independent assessment on the basis of the documentedevidence.

    43/51

  • ESEVOFault Tolerance

    Frömel

    Safety-Case Principles

    I Keep it Simple (Stupid): Complexity is a source of errorand unreliability. This applies to requirements,architecture, specification, and implementation of thesystem and software engineering process.

    I Phased Development: Delivery of safety case should bephased along with other project deliverables andintegrated into design process.

    I Maintenance of the Safety Case: The safety case must bemaintained in order to stay relevant.

    I Foundations: The safety case should be developed in thecontext of a well-managed quality and safety managementsystem.

    44/51

  • ESEVOFault Tolerance

    Frömel

    The Core of the Safety Case

    I Deterministic analysis of the hazards and faults that couldarise and cause adverse effects (loss of life, injury,economic damage, . . . ).

    I Demonstration of the sufficiencies and adequacies of theprovisions (engineering and procedural) taken. Thearguments can be supported by probabilistic analysis. Theuse of mass-market components can help!

    I Economic justification why specific measures have beentaken and others have been excluded.

    45/51

  • ESEVOFault Tolerance

    Frömel

    Which Evidence is Preferred?

    I Deterministic over statistical

    I Quantitative over qualitative

    I Direct over indirect

    I Product over Process

    46/51

  • ESEVOFault Tolerance

    Frömel

    ARINC RTCA/DO-178B

    ”The purpose of this document is to provideguidelines for the production of software for airbornesystems and equipment that performs its intendedfunction with a level of confidence in safety thatcomplies with airworthinessrequirements.” [DO178B, 1992]

    I Document has been produced by a committee consistingof representatives of the major aerospace companies,airlines, and regulatory bodies.

    I RTCA/DO-178 B represents an international concensusview of an approach that produces safe systems and isreasonably practical.

    I Has been used in a number of major projects (e.g., Boeing777).

    47/51

  • ESEVOFault Tolerance

    Frömel

    Zero Failure Rate Software

    I Is the claim of ”zero failure-rate software” achievable andassessable?

    I If the ”zero failure-rate software route” is taken, then thefirst software failure invalidates the argument.

    I Experience has shown that it is highly probable thatsoftware (and even hardware) is not free of design faults.

    I Scientifically based statements: 10−5 failures/hour.Example: Ariane 5

    48/51

  • ESEVOFault Tolerance

    Frömel

    The ALARP Principle

    49/51

  • ESEVOFault Tolerance

    Frömel

    References

    Part II

    End – Thank You!

    50/51

  • References[DO178B, 1992] DO178B, R. (1992).

    178b: Software considerations in airborne systems and equipment certification.December, 1st.

    [Gray, 1986] Gray, J. (1986).Why do computers stop and what can be done about it?In Symposium on reliability in distributed software and database systems, pages 3–12. LosAngeles, CA, USA.

    [Neumann and Burks, 1966] Neumann, J. v. and Burks, A. W. (1966).Theory of self-reproducing automata.

    [Reason and Reason, 1997] Reason, J. T. and Reason, J. T. (1997).Managing the risks of organizational accidents, volume 6.Ashgate Aldershot.

    Fault ToleranceEnd – Thank You!References