An Introduction to Fault-Tolerant Computer SystemsComputer ...€¦ · An Introduction to...

Guest lecture Digital o. datorteknik D1Johan Karlsson

Academic year 2010/11

Dept. of Computer Science and EngineeringChalmers University of Technology 1

Combitech Systems

An Introduction to Fault-TolerantComputer SystemsComputer Systems

Johan Karlsson

Dependable Real-time Systems Group

Department of Computer Science and Engineering

Chalmers University of Technology

Why fault tolerance?

We depend on computer systems!

Bank services

Drive-by-wire

Digital- och datorteknik 2010-11-17 2Johan Karlsson

Work stations

File servers

Fly-by-wire




Definition of fault tolerance

F lt t l t id i f ilFault tolerance means to avoid service failuresin the presence of faults.

Avizienis, et al., ”Basic Concepts and Taxonomy of Dependable and Secure Computing”


Fault-Tolerance – How?

• By introducing extra resources or redundancy• By introducing extra resources, or redundancy

• Forms of redundancy hardware redundancy

software redundancy

time redundancy

information redundancy





Applications areas for FT

Business-critical applicationsBusiness-critical applications financial transactions e-business general-purpose file servers web servers ...

Embedded Systemsy fly-by-wire drive-by-wire railway signaling factory automation ...


Brake-By-Wire System

Wheel node

Wheel node

Pedal node

Wheel node

Wheel node






Wheel node

Wheel node

Pedal node

Wheel node

Wheel node



Wheel node

Wheel node

Pedal node

Wheel node





Important concepts

• Fault tolerance• Fault tolerance• Fault tolerance To avoid service failures in the presence of faults

• Graceful degradation Gradual reduction of service in the presence of faults

• Safety

• Fault tolerance To avoid service failures in the presence of faults

• Graceful degradation Gradual reduction of service in the presence of faults

• Safety


Safety The occurrence of faults will not endanger human life

or the environment

Safety The occurrence of faults will not endanger human life

or the environment

Course Homepage

www.cse.chalmers.se/edu/course/EDA122

Also available via the student portal

Here you find:

• The course PM (contains all administrative information)


The course PM (contains all administrative information)

• Lecture slides

• Messages from the examiner

• Old exams, etc

10Lecture 1




Overview of topics

Fault-tolerant real-time systemsPrinciples of fault tolerance

DesignDesign

DependabilityDependability

System examplesFault tolerance in distributed systems

Life-cycle modelsProbabilistic analysis

Safety case


ManagementManagement Verification &Verification &ValidationValidation

Standards

Terminology

Hazard and risk analysis

y

Fault injection

Fault Types

• Random faults (physical faults)• Random faults (physical faults) Aging faults

External disturbances– Ionizing particle radiation

– Electromagnetic interference

• Systematic faults (design faults in HW or SW) Specification faults Specification faults

Design faults

Implementation faults





Physical hardware faults

• Discrete components wires contacts• Discrete components, wires, contacts

Open circuit

Short circuit

Bridging faults

• Integrated circuits Particle radiation-induced faultsParticle radiation induced faults

– Single event upsets (a.k.a. soft errors)

Electro migration

Gate-oxide breakdowns

Hot carrier injection, NBTI, ….


Electro Migration


Open circuit failure. Hillocking, short circuit failure. (W. D. Nix, et al., 1992)




Gate oxide breakdowns

• Gate oxide breakdowns increase leakage currents and change g gelectrical characteristics of transistors


Gate oxide in 90 nm technologyThickness: 5 atom layers

Gate oxide scalingSource: Intel 2005

Single event upset (SEU)

• SEU:s are particle induced upsets (bit-flips)(bit flips)

• Caused highly energetic particles such as neutron, protons and muons

SiO2 gate

Poly Si gate

Drain Source

Particle trajectory

Bit-flips SRAM cell


Si substratep

SiO2 gate

n+ n+

Particle strike in n-channel MOSFET transistor

Depletion region




Bathtub Curve


Terminology

Fault - Cause of an error e g an open circuit aFault - Cause of an error, e.g. an open circuit, asoftware bug, or an external disturbance.

Error - Part of the system state which is liable to

lead to failure, e.g. a wrong value in a program variable.

Failure - Delivered service does not comply with

the specification, e.g. a cruise control locks at fullspeed.





Cause-and-Effect Relationship

Specification Software

Implementation Mistakes

External Disturbances

SpecificationMistakes

SoftwareFaults

ErrorsSystemFailures


Component Defects

HardwareFaults

Hardware Redundancy

• Voting redundancy• Voting redundancy

• Stand-by redundancy





Voting redundancyTriple Modular Redundancy (TMR)

Module 1

Module 3

Module 2Input Voting

element

Output


The three modules receive the same inputs and perform thesame calculations.

The modules are assumed to fail independently, one at a time.

The voting element uses majority voting to mask errors.

The three modules receive the same inputs and perform thesame calculations.

The modules are assumed to fail independently, one at a time.

The voting element uses majority voting to mask errors.

Single Points of Failure ina TMR System

Single point of failure

Module 1

Module 2Input Voting

element

Output

g p


Module 3

Single point of failure




Masking RedundancyTMR with triple voting and triple inputs

Module 1Input 1

Module 2

Voter

Voter

Output 1

Output 2Input 2


Module 3 Voter Output 3Input 3

Masking RedundancyMulti-stage TMR

Module

Module

Voter

Voter

Input 1Module

Module

Voter

VoterInput 2

Output 1

Output 2


Module Voter Module VoterInput 3 Output 3




Standby redundancy

Primarymodule

Back-upd l

InputSwitch

Systemoutput

Failuredetector

Fail-oversignal

Digital- och datorteknik 2010-11-17

module

Johan Karlsson 25

HP’s NonStop Computer Systems

• Highly available computers for on line transaction• Highly available computers for on-line transaction processing (OLTP) systems

• Typical applications: Automatic teller machines, Stock trading, Funds transfer,

911 emergency centers, Medical records, Travel and hotel reservations, etc

• Availability: 0 99999 – “five nines” or 5 min downtime perAvailability: 0,99999 – five nines , or 5 min downtime per year

• Data integrity: 1 FIT = 10-9 undetected errors per hour (one undetected data error per billion hours)





Marketing information from HP(from 2005)

• TelecommunicationsTelecommunications 135 public telephone companies currently rely on NonStop

technology. More than half of all 911 calls in the United States and the majority

of wireless calls worldwide depend on NonStop servers.

• Finance Eighty percent of all ATM transactions worldwide and 66 percent of

all point-of-sale transactions worldwide are handled by NonStopserversservers.

NonStop technology powers 75 percent of the world’s 100 largest electronic funds transfer networks and 106 of the world’s 120 stock and commodity exchanges.


NonStop System withself-checked processors

Self-checked processorsSelf-checked processors Stop promptly if an error occurs

Prevent error propagation

Process pairs Critical software is implemented as a

process pair, with one primary and one backup process executing on different processors

The primary process execute the program and sends state changes regularly to the backup process.

Backup process takes over if the primary process fails by itself or as a result of a processor failure.





associated hardware announcements

Johan Karlsson 29Digital- och datorteknik 2010-11-17

The Itanium Processor

Itanium 2 processor from Intel Corp.

For more info see: http://en.wikipedia.org/wiki/Intel_Itanium





Logical Processors


Ariane 5 Disaster

”On 4 June 1996 the maiden flight of the Ariane 5 launcherOn 4 June 1996, the maiden flight of the Ariane 5 launcher

ended in failure. Only about 40 seconds after the initiation of

the flight sequence, at an altitude of about 3700 m, the

launcher veered off its flight path, broke up and exploded.”

Digital- och datorteknik 2010-11-17 32

(From J.L. Lions, et al, ARIANE 5 Flight 501 failure,http://www.esrin.esa.it/tidc/Press/Press96/ariane5rep.html)

Johan Karlsson





Ariane 5 Disaster

• Development cost 8 billion euros• Development cost ~8 billion euros

• Loss of ~500 million euros

• One year delay of the Ariane program





Ariane 5 Disaster

From the press release issued by ESA on July 23 1996:From the press release issued by ESA on July 23, 1996:(http://www.esrin.esa.it/tidc/Press/Press96/press33.html)

The Inquiry Board report begins by presenting the causes of the failure,analysis of the flight data having indicated:

nominal behaviour of the launcher up to H0 + 36 seconds;

failure of the back-up Inertial Reference System followedimmediately by failure of the active Inertial Reference System;

i lli i h i i f h l f h lid


swivelling into the extreme position of the nozzles of the two solidboosters and, slightly later, of the Vulcain engine, causing thelauncher to veer abruptly;

self-destruction of the launcher correctly triggered by the rupture ofthe electrical links between the solid boosters and the core stage.

Johan Karlsson

Ariane 5 Flight Control System

• An Inertial Reference System (SRI) measures the attitude of the• An Inertial Reference System (SRI) measures the attitude of thelauncher and its movements in space. The SRI calculates angles andvelocities which are sent to the On-Board Computer (OBC).

• The OBC executes the flight control program and controls the nozzlesof the solid boosters and the Vulcain Cryogenic engine.

• There are two SRIs with identical hardware and software operating


There are two SRIs with identical hardware and software operating,one active and one in hot stand-by.

Johan Karlsson




Ariane 5 DisasterChain of events, tracing backwards in time

(From J L Lions et al ARIANE 5 Flight 501 failure(From J.L. Lions, et al, ARIANE 5 Flight 501 failure,http://www.esrin.esa.it/tidc/Press/Press96/ariane5rep.html)

• The launcher started to disintegrate at about H0 + 39 seconds becauseof high aerodynamic loads due to an angle of attack of more than 20degrees that led to separation of the boosters from the main stage, inturn triggering the self-destruct system of the launcher.


• This angle of attack was caused by full nozzle deflections of the solidboosters and the Vulcain main engine.

Johan Karlsson


• These nozzle deflections were commanded by the On-Board• These nozzle deflections were commanded by the On-BoardComputer (OBC) software on the basis of data transmitted by theInertial Reference System 2 (SRI 2). Part of these data at that time didnot contain proper flight data, but showed a diagnostic bit pattern of thecomputer of the SRI 2, which was interpreted as flight data.

• The reason why the active SRI 2 did not send correct attitude data wasthat the unit had declared a failure due to a software exception.


• The OBC could not switch to the back-up SRI 1 because that unit hadalready ceased to function during the previous data cycle (72milliseconds period) for the same reason as SRI 2

Johan Karlsson





• The internal SRI software exception was caused during execution of aThe internal SRI software exception was caused during execution of adata conversion from 64-bit floating point to 16-bit signed integer value.The floating point number which was converted had a value greaterthan what could be represented by a 16-bit signed integer. Thisresulted in an Operand Error. The data conversion instructions (in Adacode) were not protected from causing an Operand Error, althoughother conversions of comparable variables in the same place in thecode were protected.


• The error occurred in a part of the software that only performsalignment of the strap-down inertial platform. This software modulecomputes meaningful results only before lift-off. As soon as thelauncher lifts off, this function serves no purpose.

Johan Karlsson


• The alignment function is operative for 50 seconds after starting of theThe alignment function is operative for 50 seconds after starting of theFlight Mode of the SRIs which occurs at H0 - 3 seconds for Ariane 5.Consequently, when lift-off occurs, the function continues for approx.40 seconds of flight. This time sequence is based on a requirement ofAriane 4 and is not required for Ariane 5.

• The Operand Error occurred due to an unexpected high value of aninternal alignment function result called BH, Horizontal Bias, related tothe horizontal velocity sensed by the platform. This value is calculated


y y pas an indicator for alignment precision over time.

• The value of BH was much higher than expected because the earlypart of the trajectory of Ariane 5 differs from that of Ariane 4 and resultsin considerably higher horizontal velocity values.

Johan Karlsson




Lessons Learned from theAriane 5 Disaster

• Both random faults and systematic (development/design) faults mustBoth random faults and systematic (development/design) faults mustbe considered.

• Do not expect software, which has proven to be reliable in oneenvironment, to be reliable in another environment.

• Ensure that system tests that are realistic.

• Use an “intelligent” error handling strategy

Shut down non-critical tasks (processes) that fail.


Try to recover from errors that occur in critical tasks before shutting downthe unit.

Enforce omission failures instead of crash failures, whenever possible.

Johan Karlsson

N-version programming

Programversion 1

ProgramInputs

Programversion 3

Programversion 2

p

Votingelement

ProgramOutput

ProgramInputs

ProgramInputs


Programversion 4

Johan Karlsson




Recovery blocks

PrimaryModule

ProgramInputs

Acceptancetests

SecondaryModule 2

SecondaryModule 1

p

N-to-1Switch

ProgramOutput

Error detection

ProgramInputs

ProgramInputs


SecondaryModule N

ProgramInputs

Johan Karlsson

Summary

• Fault tolerance Graceful degradation Safety• Fault tolerance, Graceful degradation, Safety

• Terminology: faults errors failures

• Physical faults

• Voting redundancy, Standby redundancy

• Design faults

D i di it• Design diversity





Questions?

Johan Karlsson 47Digital- och datorteknik 2010-11-17

An Introduction to Fault-Tolerant Computer SystemsComputer ...€¦ · An Introduction to...

Documents

Transcript of An Introduction to Fault-Tolerant Computer SystemsComputer ...€¦ · An Introduction to...