An Introduction to Fault-Tolerant Computer SystemsComputer ...€¦ · An Introduction to...
Transcript of An Introduction to Fault-Tolerant Computer SystemsComputer ...€¦ · An Introduction to...
Guest lecture Digital o. datorteknik D1Johan Karlsson
Academic year 2010/11
Dept. of Computer Science and EngineeringChalmers University of Technology 1
Combitech Systems
An Introduction to Fault-TolerantComputer SystemsComputer Systems
Johan Karlsson
Dependable Real-time Systems Group
Department of Computer Science and Engineering
Chalmers University of Technology
Why fault tolerance?
We depend on computer systems!
Bank services
Drive-by-wire
Digital- och datorteknik 2010-11-17 2Johan Karlsson
Work stations
File servers
Fly-by-wire
Guest lecture Digital o. datorteknik D1Johan Karlsson
Academic year 2010/11
Dept. of Computer Science and EngineeringChalmers University of Technology 2
Definition of fault tolerance
F lt t l t id i f ilFault tolerance means to avoid service failuresin the presence of faults.
Avizienis, et al., ”Basic Concepts and Taxonomy of Dependable and Secure Computing”
Digital- och datorteknik 2010-11-17 3Johan Karlsson
Fault-Tolerance – How?
• By introducing extra resources or redundancy• By introducing extra resources, or redundancy
• Forms of redundancy hardware redundancy
software redundancy
time redundancy
information redundancy
Digital- och datorteknik 2010-11-17 4Johan Karlsson
Guest lecture Digital o. datorteknik D1Johan Karlsson
Academic year 2010/11
Dept. of Computer Science and EngineeringChalmers University of Technology 3
Applications areas for FT
Business-critical applicationsBusiness-critical applications financial transactions e-business general-purpose file servers web servers ...
Embedded Systemsy fly-by-wire drive-by-wire railway signaling factory automation ...
Digital- och datorteknik 2010-11-17 5Johan Karlsson
Brake-By-Wire System
Wheel node
Wheel node
Pedal node
Wheel node
Wheel node
Digital- och datorteknik 2010-11-17 6Johan Karlsson
Guest lecture Digital o. datorteknik D1Johan Karlsson
Academic year 2010/11
Dept. of Computer Science and EngineeringChalmers University of Technology 4
Brake-By-Wire System
Wheel node
Wheel node
Pedal node
Wheel node
Wheel node
Digital- och datorteknik 2010-11-17 7Johan Karlsson
Brake-By-Wire System
Wheel node
Wheel node
Pedal node
Wheel node
Digital- och datorteknik 2010-11-17 8Johan Karlsson
Guest lecture Digital o. datorteknik D1Johan Karlsson
Academic year 2010/11
Dept. of Computer Science and EngineeringChalmers University of Technology 5
Important concepts
• Fault tolerance• Fault tolerance• Fault tolerance To avoid service failures in the presence of faults
• Graceful degradation Gradual reduction of service in the presence of faults
• Safety
• Fault tolerance To avoid service failures in the presence of faults
• Graceful degradation Gradual reduction of service in the presence of faults
• Safety
Digital- och datorteknik 2010-11-17 9Johan Karlsson
Safety The occurrence of faults will not endanger human life
or the environment
Safety The occurrence of faults will not endanger human life
or the environment
Course Homepage
www.cse.chalmers.se/edu/course/EDA122
Also available via the student portal
Here you find:
• The course PM (contains all administrative information)
Digital- och datorteknik 2010-11-17 10Johan Karlsson
The course PM (contains all administrative information)
• Lecture slides
• Messages from the examiner
• Old exams, etc
10Lecture 1
Guest lecture Digital o. datorteknik D1Johan Karlsson
Academic year 2010/11
Dept. of Computer Science and EngineeringChalmers University of Technology 6
Overview of topics
Fault-tolerant real-time systemsPrinciples of fault tolerance
DesignDesign
DependabilityDependability
System examplesFault tolerance in distributed systems
Life-cycle modelsProbabilistic analysis
Safety case
Digital- och datorteknik 2010-11-17 11Johan Karlsson
ManagementManagement Verification &Verification &ValidationValidation
Standards
Terminology
Hazard and risk analysis
y
Fault injection
Fault Types
• Random faults (physical faults)• Random faults (physical faults) Aging faults
External disturbances– Ionizing particle radiation
– Electromagnetic interference
• Systematic faults (design faults in HW or SW) Specification faults Specification faults
Design faults
Implementation faults
Digital- och datorteknik 2010-11-17 12Johan Karlsson
Guest lecture Digital o. datorteknik D1Johan Karlsson
Academic year 2010/11
Dept. of Computer Science and EngineeringChalmers University of Technology 7
Physical hardware faults
• Discrete components wires contacts• Discrete components, wires, contacts
Open circuit
Short circuit
Bridging faults
• Integrated circuits Particle radiation-induced faultsParticle radiation induced faults
– Single event upsets (a.k.a. soft errors)
Electro migration
Gate-oxide breakdowns
Hot carrier injection, NBTI, ….
Digital- och datorteknik 2010-11-17 13Johan Karlsson
Electro Migration
Digital- och datorteknik 2010-11-17 14Johan Karlsson
Open circuit failure. Hillocking, short circuit failure. (W. D. Nix, et al., 1992)
Guest lecture Digital o. datorteknik D1Johan Karlsson
Academic year 2010/11
Dept. of Computer Science and EngineeringChalmers University of Technology 8
Gate oxide breakdowns
• Gate oxide breakdowns increase leakage currents and change g gelectrical characteristics of transistors
Digital- och datorteknik 2010-11-17 15Johan Karlsson
Gate oxide in 90 nm technologyThickness: 5 atom layers
Gate oxide scalingSource: Intel 2005
Single event upset (SEU)
• SEU:s are particle induced upsets (bit-flips)(bit flips)
• Caused highly energetic particles such as neutron, protons and muons
SiO2 gate
Poly Si gate
Drain Source
Particle trajectory
Bit-flips SRAM cell
Digital- och datorteknik 2010-11-17 16Johan Karlsson
Si substratep
SiO2 gate
n+ n+
Particle strike in n-channel MOSFET transistor
Depletion region
Guest lecture Digital o. datorteknik D1Johan Karlsson
Academic year 2010/11
Dept. of Computer Science and EngineeringChalmers University of Technology 9
Bathtub Curve
Digital- och datorteknik 2010-11-17 17Johan Karlsson
Terminology
Fault - Cause of an error e g an open circuit aFault - Cause of an error, e.g. an open circuit, asoftware bug, or an external disturbance.
Error - Part of the system state which is liable to
lead to failure, e.g. a wrong value in a program variable.
Failure - Delivered service does not comply with
the specification, e.g. a cruise control locks at fullspeed.
Digital- och datorteknik 2010-11-17 18Johan Karlsson
Guest lecture Digital o. datorteknik D1Johan Karlsson
Academic year 2010/11
Dept. of Computer Science and EngineeringChalmers University of Technology 10
Cause-and-Effect Relationship
Specification Software
Implementation Mistakes
External Disturbances
SpecificationMistakes
SoftwareFaults
ErrorsSystemFailures
Digital- och datorteknik 2010-11-17 19Johan Karlsson
Component Defects
HardwareFaults
Hardware Redundancy
• Voting redundancy• Voting redundancy
• Stand-by redundancy
Digital- och datorteknik 2010-11-17 20Johan Karlsson
Guest lecture Digital o. datorteknik D1Johan Karlsson
Academic year 2010/11
Dept. of Computer Science and EngineeringChalmers University of Technology 11
Voting redundancyTriple Modular Redundancy (TMR)
Module 1
Module 3
Module 2Input Voting
element
Output
Digital- och datorteknik 2010-11-17 21Johan Karlsson
The three modules receive the same inputs and perform thesame calculations.
The modules are assumed to fail independently, one at a time.
The voting element uses majority voting to mask errors.
The three modules receive the same inputs and perform thesame calculations.
The modules are assumed to fail independently, one at a time.
The voting element uses majority voting to mask errors.
Single Points of Failure ina TMR System
Single point of failure
Module 1
Module 2Input Voting
element
Output
g p
Digital- och datorteknik 2010-11-17 22Johan Karlsson
Module 3
Single point of failure
Guest lecture Digital o. datorteknik D1Johan Karlsson
Academic year 2010/11
Dept. of Computer Science and EngineeringChalmers University of Technology 12
Masking RedundancyTMR with triple voting and triple inputs
Module 1Input 1
Module 2
Voter
Voter
Output 1
Output 2Input 2
Digital- och datorteknik 2010-11-17 23Johan Karlsson
Module 3 Voter Output 3Input 3
Masking RedundancyMulti-stage TMR
Module
Module
Voter
Voter
Input 1Module
Module
Voter
VoterInput 2
Output 1
Output 2
Digital- och datorteknik 2010-11-17 24Johan Karlsson
Module Voter Module VoterInput 3 Output 3
Guest lecture Digital o. datorteknik D1Johan Karlsson
Academic year 2010/11
Dept. of Computer Science and EngineeringChalmers University of Technology 13
Standby redundancy
Primarymodule
Back-upd l
InputSwitch
Systemoutput
Failuredetector
Fail-oversignal
Digital- och datorteknik 2010-11-17
module
Johan Karlsson 25
HP’s NonStop Computer Systems
• Highly available computers for on line transaction• Highly available computers for on-line transaction processing (OLTP) systems
• Typical applications: Automatic teller machines, Stock trading, Funds transfer,
911 emergency centers, Medical records, Travel and hotel reservations, etc
• Availability: 0 99999 – “five nines” or 5 min downtime perAvailability: 0,99999 – five nines , or 5 min downtime per year
• Data integrity: 1 FIT = 10-9 undetected errors per hour (one undetected data error per billion hours)
Digital- och datorteknik 2010-11-17 26Johan Karlsson
Guest lecture Digital o. datorteknik D1Johan Karlsson
Academic year 2010/11
Dept. of Computer Science and EngineeringChalmers University of Technology 14
Marketing information from HP(from 2005)
• TelecommunicationsTelecommunications 135 public telephone companies currently rely on NonStop
technology. More than half of all 911 calls in the United States and the majority
of wireless calls worldwide depend on NonStop servers.
• Finance Eighty percent of all ATM transactions worldwide and 66 percent of
all point-of-sale transactions worldwide are handled by NonStopserversservers.
NonStop technology powers 75 percent of the world’s 100 largest electronic funds transfer networks and 106 of the world’s 120 stock and commodity exchanges.
Digital- och datorteknik 2010-11-17 27Johan Karlsson
NonStop System withself-checked processors
Self-checked processorsSelf-checked processors Stop promptly if an error occurs
Prevent error propagation
Process pairs Critical software is implemented as a
process pair, with one primary and one backup process executing on different processors
The primary process execute the program and sends state changes regularly to the backup process.
Backup process takes over if the primary process fails by itself or as a result of a processor failure.
Digital- och datorteknik 2010-11-17 28Johan Karlsson
Guest lecture Digital o. datorteknik D1Johan Karlsson
Academic year 2010/11
Dept. of Computer Science and EngineeringChalmers University of Technology 15
associated hardware announcements
Johan Karlsson 29Digital- och datorteknik 2010-11-17
The Itanium Processor
Itanium 2 processor from Intel Corp.
For more info see: http://en.wikipedia.org/wiki/Intel_Itanium
Digital- och datorteknik 2010-11-17 30Johan Karlsson
Guest lecture Digital o. datorteknik D1Johan Karlsson
Academic year 2010/11
Dept. of Computer Science and EngineeringChalmers University of Technology 16
Logical Processors
Digital- och datorteknik 2010-11-17 31Johan Karlsson
Ariane 5 Disaster
”On 4 June 1996 the maiden flight of the Ariane 5 launcherOn 4 June 1996, the maiden flight of the Ariane 5 launcher
ended in failure. Only about 40 seconds after the initiation of
the flight sequence, at an altitude of about 3700 m, the
launcher veered off its flight path, broke up and exploded.”
Digital- och datorteknik 2010-11-17 32
(From J.L. Lions, et al, ARIANE 5 Flight 501 failure,http://www.esrin.esa.it/tidc/Press/Press96/ariane5rep.html)
Johan Karlsson
Guest lecture Digital o. datorteknik D1Johan Karlsson
Academic year 2010/11
Dept. of Computer Science and EngineeringChalmers University of Technology 17
Digital- och datorteknik 2010-11-17 33Johan Karlsson
Digital- och datorteknik 2010-11-17 34Johan Karlsson
Guest lecture Digital o. datorteknik D1Johan Karlsson
Academic year 2010/11
Dept. of Computer Science and EngineeringChalmers University of Technology 18
Digital- och datorteknik 2010-11-17 35Johan Karlsson
Ariane 5 Disaster
• Development cost 8 billion euros• Development cost ~8 billion euros
• Loss of ~500 million euros
• One year delay of the Ariane program
Digital- och datorteknik 2010-11-17 36Johan Karlsson
Guest lecture Digital o. datorteknik D1Johan Karlsson
Academic year 2010/11
Dept. of Computer Science and EngineeringChalmers University of Technology 19
Ariane 5 Disaster
From the press release issued by ESA on July 23 1996:From the press release issued by ESA on July 23, 1996:(http://www.esrin.esa.it/tidc/Press/Press96/press33.html)
The Inquiry Board report begins by presenting the causes of the failure,analysis of the flight data having indicated:
nominal behaviour of the launcher up to H0 + 36 seconds;
failure of the back-up Inertial Reference System followedimmediately by failure of the active Inertial Reference System;
i lli i h i i f h l f h lid
Digital- och datorteknik 2010-11-17 37
swivelling into the extreme position of the nozzles of the two solidboosters and, slightly later, of the Vulcain engine, causing thelauncher to veer abruptly;
self-destruction of the launcher correctly triggered by the rupture ofthe electrical links between the solid boosters and the core stage.
Johan Karlsson
Ariane 5 Flight Control System
• An Inertial Reference System (SRI) measures the attitude of the• An Inertial Reference System (SRI) measures the attitude of thelauncher and its movements in space. The SRI calculates angles andvelocities which are sent to the On-Board Computer (OBC).
• The OBC executes the flight control program and controls the nozzlesof the solid boosters and the Vulcain Cryogenic engine.
• There are two SRIs with identical hardware and software operating
Digital- och datorteknik 2010-11-17 38
There are two SRIs with identical hardware and software operating,one active and one in hot stand-by.
Johan Karlsson
Guest lecture Digital o. datorteknik D1Johan Karlsson
Academic year 2010/11
Dept. of Computer Science and EngineeringChalmers University of Technology 20
Ariane 5 DisasterChain of events, tracing backwards in time
(From J L Lions et al ARIANE 5 Flight 501 failure(From J.L. Lions, et al, ARIANE 5 Flight 501 failure,http://www.esrin.esa.it/tidc/Press/Press96/ariane5rep.html)
• The launcher started to disintegrate at about H0 + 39 seconds becauseof high aerodynamic loads due to an angle of attack of more than 20degrees that led to separation of the boosters from the main stage, inturn triggering the self-destruct system of the launcher.
Digital- och datorteknik 2010-11-17 39
• This angle of attack was caused by full nozzle deflections of the solidboosters and the Vulcain main engine.
Johan Karlsson
Ariane 5 DisasterChain of events, tracing backwards in time
• These nozzle deflections were commanded by the On-Board• These nozzle deflections were commanded by the On-BoardComputer (OBC) software on the basis of data transmitted by theInertial Reference System 2 (SRI 2). Part of these data at that time didnot contain proper flight data, but showed a diagnostic bit pattern of thecomputer of the SRI 2, which was interpreted as flight data.
• The reason why the active SRI 2 did not send correct attitude data wasthat the unit had declared a failure due to a software exception.
Digital- och datorteknik 2010-11-17 40
• The OBC could not switch to the back-up SRI 1 because that unit hadalready ceased to function during the previous data cycle (72milliseconds period) for the same reason as SRI 2
Johan Karlsson
Guest lecture Digital o. datorteknik D1Johan Karlsson
Academic year 2010/11
Dept. of Computer Science and EngineeringChalmers University of Technology 21
Ariane 5 DisasterChain of events, tracing backwards in time
• The internal SRI software exception was caused during execution of aThe internal SRI software exception was caused during execution of adata conversion from 64-bit floating point to 16-bit signed integer value.The floating point number which was converted had a value greaterthan what could be represented by a 16-bit signed integer. Thisresulted in an Operand Error. The data conversion instructions (in Adacode) were not protected from causing an Operand Error, althoughother conversions of comparable variables in the same place in thecode were protected.
Digital- och datorteknik 2010-11-17 41
• The error occurred in a part of the software that only performsalignment of the strap-down inertial platform. This software modulecomputes meaningful results only before lift-off. As soon as thelauncher lifts off, this function serves no purpose.
Johan Karlsson
Ariane 5 DisasterChain of events, tracing backwards in time
• The alignment function is operative for 50 seconds after starting of theThe alignment function is operative for 50 seconds after starting of theFlight Mode of the SRIs which occurs at H0 - 3 seconds for Ariane 5.Consequently, when lift-off occurs, the function continues for approx.40 seconds of flight. This time sequence is based on a requirement ofAriane 4 and is not required for Ariane 5.
• The Operand Error occurred due to an unexpected high value of aninternal alignment function result called BH, Horizontal Bias, related tothe horizontal velocity sensed by the platform. This value is calculated
Digital- och datorteknik 2010-11-17 42
y y pas an indicator for alignment precision over time.
• The value of BH was much higher than expected because the earlypart of the trajectory of Ariane 5 differs from that of Ariane 4 and resultsin considerably higher horizontal velocity values.
Johan Karlsson
Guest lecture Digital o. datorteknik D1Johan Karlsson
Academic year 2010/11
Dept. of Computer Science and EngineeringChalmers University of Technology 22
Lessons Learned from theAriane 5 Disaster
• Both random faults and systematic (development/design) faults mustBoth random faults and systematic (development/design) faults mustbe considered.
• Do not expect software, which has proven to be reliable in oneenvironment, to be reliable in another environment.
• Ensure that system tests that are realistic.
• Use an “intelligent” error handling strategy
Shut down non-critical tasks (processes) that fail.
Digital- och datorteknik 2010-11-17 43
Try to recover from errors that occur in critical tasks before shutting downthe unit.
Enforce omission failures instead of crash failures, whenever possible.
Johan Karlsson
N-version programming
Programversion 1
ProgramInputs
Programversion 3
Programversion 2
p
Votingelement
ProgramOutput
ProgramInputs
ProgramInputs
Digital- och datorteknik 2010-11-17 44
Programversion 4
Johan Karlsson
Guest lecture Digital o. datorteknik D1Johan Karlsson
Academic year 2010/11
Dept. of Computer Science and EngineeringChalmers University of Technology 23
Recovery blocks
PrimaryModule
ProgramInputs
Acceptancetests
SecondaryModule 2
SecondaryModule 1
p
N-to-1Switch
ProgramOutput
Error detection
ProgramInputs
ProgramInputs
Digital- och datorteknik 2010-11-17 45
SecondaryModule N
ProgramInputs
Johan Karlsson
Summary
• Fault tolerance Graceful degradation Safety• Fault tolerance, Graceful degradation, Safety
• Terminology: faults errors failures
• Physical faults
• Voting redundancy, Standby redundancy
• Design faults
D i di it• Design diversity
Digital- och datorteknik 2010-11-17 46Johan Karlsson
Guest lecture Digital o. datorteknik D1Johan Karlsson
Academic year 2010/11
Dept. of Computer Science and EngineeringChalmers University of Technology 24
Questions?
Johan Karlsson 47Digital- och datorteknik 2010-11-17