Fundamental Concepts Neeraj Suri
description
Transcript of Fundamental Concepts Neeraj Suri
© Suri/DEEDSSWFT Course 2007-08 1
Fundamental Concepts
Neeraj Suri
Software Fault Tolerance (SWFT)
© Suri/DEEDSSWFT Course 2007-08 2
Overview
Motivation for Fault Tolerance Terminology
Faults, Errors and Failures
Dependability Recovery
Backward and forward
Redundancy Error Confinement
© Suri/DEEDSSWFT Course 2007-08 3
Examples of Software Failures Denver airport: Failure in luggage management system opening
delayed for several months Failure of a space probe sent to Mars due to inhomogeneity of
measuring units (inch and cm) Problems when space shuttle Endeavor met with Intelstat 6 due to
rounding of near-zero values Flaw in Apollo 11 software made moons gravity repulsive rather
than attractive Ariane V guidance failure AT&T system suffered a 9 hour US-wide blockade
Switch experienced abnormal behavior due to flaws in recovery Switch experienced abnormal behavior due to flaws in recovery software and propagated to all switchessoftware and propagated to all switches
Software problem caused radiation safety door of a nuclear power processing plant in the UK to open accidentally
Several patients killed through radiation overdoses due to software flaws in Therac-25 (cancer treatment system)
... (many, many, many more examples – WinXP, Vista !!!)
© Suri/DEEDSSWFT Course 2007-08 4
Software Failures
33%
62%
© Suri/DEEDSSWFT Course 2007-08 5
Has this trend continued?
[2] D. Oppenheimer et al. “Why do Internet services fail, and what can be done about it”, USENIX 2003.
(50%) (94%) (76%) (79%)(tolerance)
45%
2,5%
35%
17,5%
© Suri/DEEDSSWFT Course 2007-08 6
Microsoft EULA
EXCLUSION OF INCIDENTAL, CONSEQUENTIAL AND CERTAIN OTHER DAMAGES. TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE LAW, IN NO EVENT SHALL MICROSOFT OR ITS SUPPLIERS BE LIABLE FOR ANY SPECIAL, INCIDENTAL, INDIRECT, OR CONSEQUENTIAL DAMAGES WHATSOEVER (INCLUDING, BUT NOT LIMITED TO, DAMAGES FOR LOSS OF PROFITS OR CONFIDENTIAL OR OTHER INFORMATION, FOR BUSINESS INTERRUPTION, FOR PERSONAL INJURY, FOR LOSS OF PRIVACY, FOR FAILURE TO MEET ANY DUTY INCLUDING OF GOOD FAITH OR OF REASONABLE CARE, FOR NEGLIGENCE, AND FOR ANY OTHER PECUNIARY OR OTHER LOSS WHATSOEVER) ARISING OUT OF OR IN ANY WAY RELATED TO THE USE OF OR INABILITY TO USE THE SOFTWARE PRODUCT, THE PROVISION OF OR FAILURE TO PROVIDE SUPPORT SERVICES, OR OTHERWISE UNDER OR IN CONNECTION WITH ANY PROVISION OF THIS EULA, EVEN IN THE EVENT OF THE FAULT, TORT (INCLUDING NEGLIGENCE), STRICT LIABILITY, BREACH OF CONTRACT OR BREACH OF WARRANTY OF MICROSOFT OR ANY SUPPLIER, AND EVEN IF MICROSOFT OR ANY SUPPLIER HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
© Suri/DEEDSSWFT Course 2007-08 7
“Mistakes in software development will continue to be made, no matter how carefully the software is built, and failures will continue to occur: that is the way things are in engineering.”
John Knight, 2004
© Suri/DEEDSSWFT Course 2007-08 8
Reasons for Bad Software
Economic / Engineering Reasons need to sell software to pay for development diminishing return for fixing rare bugs
Legal Reasons no legal requirements for good software
Legacy Reasons need to be (bug) compatible
Technical Reasons software is complex!
© Suri/DEEDSSWFT Course 2007-08 9
Economical Reasons
Budget time, persons, money “too many bugs, too little time”
Time To Market expenses vs income
Reliability vs Features know thy customer!
Cost vs Benefit Tradeoff
© Suri/DEEDSSWFT Course 2007-08 10
Engineering Issues
Undetected bugs trigger conditions to complex to find bug
No / benign effects effect of bug is benign, or activation of bug is unlikely
Not easy to fix design is just flawed
Might introduce more bugs developers do not know code sufficiently well
© Suri/DEEDSSWFT Course 2007-08 11
Technical Reasons
Incomplete understanding of systems Too high complexity makes it impossible to make sure
that software is correct
Complexity For correctness, complexity should be kept minimal
© Suri/DEEDSSWFT Course 2007-08 12
Software Bugs
Professional programmer: about 100-150 “bugs” per 1000 lines of code exact value depends on many factors…
Example: Windows XP is about 45 million LOC would imply about 4.5 to 6.75 million bugs
© Suri/DEEDSSWFT Course 2007-08 13
Software Testing
Rule of thumb: In each round of testing one finds at most half of the
bugs
Multiple rounds of testing: find new bugs with new tests
Example: (million bugs) 4.5 2.251.120.560.280.14…
“Program testing can be a very effective way to
show the presence of bugs, but it is hopeless inadequate for showing their absence!”
Dijkstra
© Suri/DEEDSSWFT Course 2007-08 14
Software Size
A. Chou et al. “An Empirical Study of Operating System Errors”, SOSP 2001
© Suri/DEEDSSWFT Course 2007-08 15
Challenge: Software Continues to Grow...
Figure: Windows Generations
© Suri/DEEDSSWFT Course 2007-08 16
Projected Software Bugs
A. Chou et al. “An Empirical Study of Operating System Errors”, SOSP 2001
© Suri/DEEDSSWFT Course 2007-08 17
Software Bugs are a Fact
“If a problem has no solution, it may not be a problem, but a fact, not to be solved, but to be coped with over time”
Shimon Peres
© Suri/DEEDSSWFT Course 2007-08 18
Thesis
Using software fault tolerance mechanisms one could provide better software, at a lower cost, and shorter time-to-market
Example: Apache is very successful project Uses many SWFT mechanisms to increase its robustness
© Suri/DEEDSSWFT Course 2007-08 19
Defense in Depth
Strategy: Avoid faults (e.g., reduce stress of programmers, better
languages) Remove faults (e.g., use appropriate tools) Cope with faults (i.e., tolerate faults) Predict faults
© Suri/DEEDSSWFT Course 2007-08 20
Motivation
Progress in software engineering Design Testing Formal methods ...
Experience still shows that assuming software to be bug free is … (& can be hazardous to your health )
© Suri/DEEDSSWFT Course 2007-08 21
Dependability Threats
Failure (“Ausfall”) Observable deviation from the specification
Error (“Fehler”) Part of the system state that may lead to a failure
Fault (“Störung”) “Defect” or “Flaw” of a system An active fault produces an error. Otherwise, it is
dormant.
© Suri/DEEDSSWFT Course 2007-08 22
Chain of Dependability Threats
Fault Error Failure
A fault activates an error A error propagates to become a failure
activation propagation
© Suri/DEEDSSWFT Course 2007-08 23
Example
0 int size = 0 ; long* p = 0;1 add_element (long value) 2 size += 1;3 p = (int*) realloc(p,sizeof(long)*size);4 p[size-1] = value;5 }
Activation: add_element(...) and no memory;
Error: p=0 and size > 0 (* before line 4 *);
Propagation: execution of line 4
Failure: segmentation violation
© Suri/DEEDSSWFT Course 2007-08 24
Error Propagation
Error Error
Failure of one component can activate an error in another component
One components fault is another components failure
ComponentFault
Failure
Component
© Suri/DEEDSSWFT Course 2007-08 25
Goal of Fault Tolerance/Robustness
The Goal of Fault Tolerance/Robustness is to
Avoid/Mitigate Failures in the Presence of Faults
© Suri/DEEDSSWFT Course 2007-08 26
Fault Tolerance
component
Error component
failure
fault
System Interface
fault
error free
© Suri/DEEDSSWFT Course 2007-08 27
Basic Fault Tolerant Approach
Define set of likely faults and likely bounds on fault frequency (failure model)
Components must be able to tolerate specified faults
Faults outside failure model might or might not be handled
© Suri/DEEDSSWFT Course 2007-08 28
Origin of Faults
ProcessManageme
nt
Specification
Operation
Maintenance
HumanInteraction
Design
Implementation
Environment
Component
DefectsReuse
Requirements
Engineering
Documentation
© Suri/DEEDSSWFT Course 2007-08 29
Fault Classification
Persistence Transient fault Intermittent fault (periodic fault) Permanent fault
Creation time Design fault Operational fault
Intention Accidental fault Intentional fault
© Suri/DEEDSSWFT Course 2007-08 30
Classical Failure Model
Crash failure Fail-silent and Fail-stop
Omission failure Timing failure
System fails to respond within a specified time slice Both late and early responses might be “bad” Late timing failure = performance failure
Arbitrary failure System behaves arbitrarily
© Suri/DEEDSSWFT Course 2007-08 31
Failure Hierarchy
The algorithms used forachieving any kind of
fault tolerance depend onthe system and failure model
Arbitrary
Timing
Omission
Crash
Fail-stop
© Suri/DEEDSSWFT Course 2007-08 32
Achieving Dependability
Fault tolerance should not be applied in isolation
Fault AvoidanceFault
Forecasting
Fault RemovalFault
Tolerance
© Suri/DEEDSSWFT Course 2007-08 33
Fault Avoidance
Reduce the number of faults during software construction Rigorous Software Development Process
• Requirements Specification & Analysis• Structured Design• Well-defined Mapping to Programming Languages• Clear Documentation
Formal Methods Software Reuse
© Suri/DEEDSSWFT Course 2007-08 34
Rigorous Software Development (1)
Requirements elicitation Discover what features each stakeholder expects the
system to provide Imperfect process
• Technical and non-technical people have to collaborate• Use-cases
Computer scientists can’t be experts in all application areas
Current software development processes are rigid• No support for “iterative” requirements engineering
© Suri/DEEDSSWFT Course 2007-08 35
Rigorous Software Development (2)
Requirements Analysis / Specification Specify in a clear and precise way what functionality your
system must provide Complete, but not too complex Consistent Determine (or even better: generate) test cases Should not be rigid
© Suri/DEEDSSWFT Course 2007-08 36
Drink Distributor Example (1)
Provides hot drinks: coffee, tea and chocolate User interface Cycle treatment
1. Insert money2. Choose drink3. Take change4. Take drink
Or press cancel and coins are given back
ChangeDrink
Coins Cancel
CoffeeTeaChocolate
Drink Distributor
© Suri/DEEDSSWFT Course 2007-08 37
Drink Distributor Example (2)
Incomplete specification No deadline for cancellation specified What if user inserts new coins before the end of a
cycle? What if the user changes his selection? What should be done when resources (change,
cups, spoons, sugar, coffee, tea, chocolate, water) run out?
Provide partial service?(e.g. only tea and coffee / require exact change)
If manufacturer and user make divergent interpretations, operation time failure will occur
© Suri/DEEDSSWFT Course 2007-08 38
Drink Distributor Example (3)
Augment specification Cancellation not possible once drink has been
chosen Add green / red light to indicate cycle start Only the first selected beverage is taken into
account Add lights to show availability of drinks
Each omission of constraint in the specification can lead to a failure in the service delivered to the user
Dissatisfaction Loss of money
© Suri/DEEDSSWFT Course 2007-08 39
Rigorous Software Development (3)
Structured design For instance in Object-Orientation:
Apply O-O principles, e.g. abstraction, information hiding, modularity, classification, to reduce complexity of the solution
Provide easy-to-read documentation• UML
Programming Methodology Good programming discipline
• Pair-programming Well-defined mapping of design models to programming
constructs Standards or coding conventions
© Suri/DEEDSSWFT Course 2007-08 40
Formal Methods (1)
Specifications are developed using mathematically tractable languages and tools Petri Nets, Algebraic Specifications
Permits to prove desired properties Generation of test cases Generation of code!
© Suri/DEEDSSWFT Course 2007-08 41
Formal Methods (2)
Mathematical specifications of software tend to be equal in size as the program itself just as error-prone
Tools (model-checkers) still face algorithmic challenges when attempting to prove properties of huge models
However, have been successfully applied for safety-critical components
© Suri/DEEDSSWFT Course 2007-08 42
Software Reuse
Well exercised software is less likely to fail Save development cost Undiscovered faults may appear when the
component is used in a new environment
© Suri/DEEDSSWFT Course 2007-08 43
Fault Removal
Detect and remove existing faults using proofs Assertion checking in simulator
Testing Exhaustive testing not feasible Can’t show the absence of faults Quality measures
Formal Inspection
© Suri/DEEDSSWFT Course 2007-08 44
Fault Forecasting
Also known asSoftware reliability measurement [Lyu96]
Estimation Gather failure data during operation or testing Apply statistical inference techniques
Prediction Gather software metrics during development
Fault forecasting can indicate the need for additional testing or for applying fault tolerance
© Suri/DEEDSSWFT Course 2007-08 45
Seriousness Classes (1)
DO-178B (standard for SW certification), civil aeronautics Without effects Minor / benign
• Upset passengers, small increase in workload for the Upset passengers, small increase in workload for the crewcrew
Major / significant• Injuries of the passengers / crew and reducing the Injuries of the passengers / crew and reducing the
efficiency of the crewefficiency of the crew Dangerous / serious
• Small number of casualties / serious injuries, or Small number of casualties / serious injuries, or preventing the crew from achieving its task in a preventing the crew from achieving its task in a precise and complete mannerprecise and complete manner
Catastrophic / disastrous• Leading to human lives lostLeading to human lives lost
© Suri/DEEDSSWFT Course 2007-08 46
Seriousness Classes (2)
DO-178B, civil aeronautics Without effects Minor / benign
• Probable: p > 10Probable: p > 10-5-5
Major / significant• Rare: 10Rare: 10-7-7 < p < 10 < p < 10-5-5
Dangerous / serious• Extremely rare: 10Extremely rare: 10-9 -9 < p < 10< p < 10-7-7
Catastrophic / disastrous• Extremely improbable: p < 10Extremely improbable: p < 10-9 -9
© Suri/DEEDSSWFT Course 2007-08 47
Software Fault Tolerance
Tolerate faults that remain in the system after development, preventing system failure Remove errors and their effects from the computational
state before a failure occurs
© Suri/DEEDSSWFT Course 2007-08 48
Classification
Single Version Software Monitoring techniques, atomicity of actions, decision
verification, exception handling
Multi-version Software Functionally independent, yet equivalent software Recovery blocks, N-version programming, …
Multiple Data Representation Retry blocks, N-copy programming, …
© Suri/DEEDSSWFT Course 2007-08 49
Recovery
Error detection Identify erroneous state
Error diagnosis Assess the damage
Error containment Prevent further damage / error propagation
Error recovery Substitute the erroneous state with an error-free one
© Suri/DEEDSSWFT Course 2007-08 50
Backward Error Recovery (1)
System state is saved at predetermined recovery points Called checkpointing Incremental checkpointing, log
State should be checkpointed on stable storage, not affected by failures
Recover error-free state by rolling back to a previously saved (error-free) state
© Suri/DEEDSSWFT Course 2007-08 51
Backward Error Recovery (2)
Checkpoint
Checkpoint
Faultdetected
Fault manifests
Assumption: faulty behaviorAfter last checkpoint
© Suri/DEEDSSWFT Course 2007-08 52
Backward Error Recovery (2)
Checkpoint
Checkpoint
Rollback
Depending on the technique used:• Try again• Try a different alternate• Do nothing (wait for the next request)
© Suri/DEEDSSWFT Course 2007-08 53
Backward Error Recovery (2)
Checkpoint
Checkpoint
Checkpoint
Checkpoint
© Suri/DEEDSSWFT Course 2007-08 54
Advantages of Backward Recovery
+ Requires no knowledge of the errors in the system state
+ Can handle arbitrary / unpredictable faults (as long as they do not affect the recovery mechanism)
+ Can be applied regardless of the sustained damage (the saved state must be error-free, though)
+ General scheme / application independent+ Particularly suitable for recovering from transient
faults
© Suri/DEEDSSWFT Course 2007-08 55
Disadvantages of Backward Recovery
―Requires significant resources (e.g. time, computation, stable storage) for checkpointing and recovery
―Checkpointing requires To identify consistent states The system to be halted / slowed down temporarily
―Care must be taken in concurrent systems to avoid the domino effect
© Suri/DEEDSSWFT Course 2007-08 56
Forward Error Recovery
Detect the error Damage assessment Build a new error-free state from which the system
can continue execution “Safe stop” Degraded mode Error compensation
• E.g., switching to a different component, etc…
© Suri/DEEDSSWFT Course 2007-08 57
Forward Error Recovery (2)
Faultdetecte
d
Fault manifests
State Reconstruction
Damage Assessment
© Suri/DEEDSSWFT Course 2007-08 58
Advantages of Forward Recovery
+ Efficient (time / memory) If the characteristics of the fault are well understood,
forward recovery is the most efficient solution
+ Well suited for real-time applications Missed deadlines can be addressed
+ Anticipated faults can be dealt with in a timely way using redundancy
© Suri/DEEDSSWFT Course 2007-08 59
Disadvantages of Forward Recovery
—Application-specific—Can only remove predictable errors from the
system state—Requires knowledge of the actual error—Depends on the accuracy of error detection,
potential damage prediction, and actual damage assessment
—Not usable if the system state is damaged beyond recoverability
© Suri/DEEDSSWFT Course 2007-08 60
Redundancy
Key concept of fault tolerance Hardware redundancy
• Most common use of redundancyMost common use of redundancy• We’re not going to address itWe’re not going to address it
Software redundancy• Additional applications, modules, objects used in the Additional applications, modules, objects used in the
system to support fault tolerancesystem to support fault tolerance Information redundancy
• Error-detecting or error-correcting codesError-detecting or error-correcting codes• Diverse dataDiverse data• Data produced for fault toleranceData produced for fault tolerance
Time redundancy• Use additional time for fault toleranceUse additional time for fault tolerance
© Suri/DEEDSSWFT Course 2007-08 61
Architectural Structure
Systems, especially concurrent ones, are increasingly complex
Consist of several components / subcomponents Fault tolerance must account for that
Different fault tolerance approaches for each component Failure of a subcomponent can be perceived as a fault in
the parent component Clear structuring reduces complexity
© Suri/DEEDSSWFT Course 2007-08 62
Error Confinement
System partitioned into regions, beyond which effects of faults should not propagate
Components should only be accessible through a well-defined (and preferably narrow [Kop97]) interface
Different confinement regions may employ different fault tolerance techniques depending on failure semantics of the environment and subcomponents
© Suri/DEEDSSWFT Course 2007-08 63
Normal Processing
Idealized Fault-Tolerant Component (1)
Error Processing
Local Exception
Return to normal
Service Request
Service Request Reply
Reply
Interface Exception
Failure Exception
Failure Exception
Interface Exception
© Suri/DEEDSSWFT Course 2007-08 64
Idealized Fault-Tolerant Component (2)
Receives requests for service Produces responses 3 kinds of exceptions
Interface exception: An invalid service request has been made
Local exception: An internal error is detected Failure exception: Component is unable to provide the
requested service Recursive structure
© Suri/DEEDSSWFT Course 2007-08 65
Recommended Reading
[Lyu96] Lyu, M. R. (ed.): Handbook of Software Reliability Engineering, New York, IEEE Computer Society Press, McGraw-Hill, 1996.
[Kop97] Kopetz, H.: Real–Time Systems — Design Principles for Distributed Embedded Applications. Kluwer Academic Publishers, 1997.
[RX95] Randell, B.; Xu, J.: The Evolution of the Recovery Block Concept, chapter 1, pp. 1 – 21, in Lyu, M. R. (Ed.): Software Fault Tolerance, John Wiley & Sons, 1995.
[1] M. Sullivan and R. Chillarege, “A comparison of Software Defects in Database Management Systems and Operating Systems”, FTCS, p475-484, July 1992.
[2] D. Oppenheimer, A. Ganapathi, and D. A. Patterson, “Why do Internet services fail, and what can be done about it”, in Proc. USENIX Symposium on Internet Technologies and Systems, 2003.
[3] J. Gray, “A Census of Tandem System Availability Between 1985 and 1990”, Tandem Technical Report 90.1