Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 1

Fundamental Concepts

Neeraj Suri

Software Fault Tolerance (SWFT)


Overview

Motivation for Fault Tolerance Terminology

Faults, Errors and Failures

Dependability Recovery

Backward and forward

Redundancy Error Confinement


Examples of Software Failures Denver airport: Failure in luggage management system opening

delayed for several months Failure of a space probe sent to Mars due to inhomogeneity of

measuring units (inch and cm) Problems when space shuttle Endeavor met with Intelstat 6 due to

rounding of near-zero values Flaw in Apollo 11 software made moons gravity repulsive rather

than attractive Ariane V guidance failure AT&T system suffered a 9 hour US-wide blockade

Switch experienced abnormal behavior due to flaws in recovery Switch experienced abnormal behavior due to flaws in recovery software and propagated to all switchessoftware and propagated to all switches

Software problem caused radiation safety door of a nuclear power processing plant in the UK to open accidentally

Several patients killed through radiation overdoses due to software flaws in Therac-25 (cancer treatment system)

... (many, many, many more examples – WinXP, Vista !!!)


Software Failures

33%

62%


Has this trend continued?

[2] D. Oppenheimer et al. “Why do Internet services fail, and what can be done about it”, USENIX 2003.

(50%) (94%) (76%) (79%)(tolerance)

45%

2,5%

35%

17,5%


Microsoft EULA

EXCLUSION OF INCIDENTAL, CONSEQUENTIAL AND CERTAIN OTHER DAMAGES. TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE LAW, IN NO EVENT SHALL MICROSOFT OR ITS SUPPLIERS BE LIABLE FOR ANY SPECIAL, INCIDENTAL, INDIRECT, OR CONSEQUENTIAL DAMAGES WHATSOEVER (INCLUDING, BUT NOT LIMITED TO, DAMAGES FOR LOSS OF PROFITS OR CONFIDENTIAL OR OTHER INFORMATION, FOR BUSINESS INTERRUPTION, FOR PERSONAL INJURY, FOR LOSS OF PRIVACY, FOR FAILURE TO MEET ANY DUTY INCLUDING OF GOOD FAITH OR OF REASONABLE CARE, FOR NEGLIGENCE, AND FOR ANY OTHER PECUNIARY OR OTHER LOSS WHATSOEVER) ARISING OUT OF OR IN ANY WAY RELATED TO THE USE OF OR INABILITY TO USE THE SOFTWARE PRODUCT, THE PROVISION OF OR FAILURE TO PROVIDE SUPPORT SERVICES, OR OTHERWISE UNDER OR IN CONNECTION WITH ANY PROVISION OF THIS EULA, EVEN IN THE EVENT OF THE FAULT, TORT (INCLUDING NEGLIGENCE), STRICT LIABILITY, BREACH OF CONTRACT OR BREACH OF WARRANTY OF MICROSOFT OR ANY SUPPLIER, AND EVEN IF MICROSOFT OR ANY SUPPLIER HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.


“Mistakes in software development will continue to be made, no matter how carefully the software is built, and failures will continue to occur: that is the way things are in engineering.”

John Knight, 2004


Reasons for Bad Software

Economic / Engineering Reasons need to sell software to pay for development diminishing return for fixing rare bugs

Legal Reasons no legal requirements for good software

Legacy Reasons need to be (bug) compatible

Technical Reasons software is complex!


Economical Reasons

Budget time, persons, money “too many bugs, too little time”

Time To Market expenses vs income

Reliability vs Features know thy customer!

Cost vs Benefit Tradeoff


Engineering Issues

Undetected bugs trigger conditions to complex to find bug

No / benign effects effect of bug is benign, or activation of bug is unlikely

Not easy to fix design is just flawed

Might introduce more bugs developers do not know code sufficiently well


Technical Reasons

Incomplete understanding of systems Too high complexity makes it impossible to make sure

that software is correct

Complexity For correctness, complexity should be kept minimal


Software Bugs

Professional programmer: about 100-150 “bugs” per 1000 lines of code exact value depends on many factors…

Example: Windows XP is about 45 million LOC would imply about 4.5 to 6.75 million bugs


Software Testing

Rule of thumb: In each round of testing one finds at most half of the

bugs

Multiple rounds of testing: find new bugs with new tests

Example: (million bugs) 4.5 2.251.120.560.280.14…

“Program testing can be a very effective way to

show the presence of bugs, but it is hopeless inadequate for showing their absence!”

Dijkstra


Software Size

A. Chou et al. “An Empirical Study of Operating System Errors”, SOSP 2001


Challenge: Software Continues to Grow...

Figure: Windows Generations


Projected Software Bugs

A. Chou et al. “An Empirical Study of Operating System Errors”, SOSP 2001


Software Bugs are a Fact

“If a problem has no solution, it may not be a problem, but a fact, not to be solved, but to be coped with over time”

Shimon Peres


Thesis

Using software fault tolerance mechanisms one could provide better software, at a lower cost, and shorter time-to-market

Example: Apache is very successful project Uses many SWFT mechanisms to increase its robustness


Defense in Depth

Strategy: Avoid faults (e.g., reduce stress of programmers, better

languages) Remove faults (e.g., use appropriate tools) Cope with faults (i.e., tolerate faults) Predict faults


Motivation

Progress in software engineering Design Testing Formal methods ...

Experience still shows that assuming software to be bug free is … (& can be hazardous to your health )


Dependability Threats

Failure (“Ausfall”) Observable deviation from the specification

Error (“Fehler”) Part of the system state that may lead to a failure

Fault (“Störung”) “Defect” or “Flaw” of a system An active fault produces an error. Otherwise, it is

dormant.


Chain of Dependability Threats

Fault Error Failure

A fault activates an error A error propagates to become a failure

activation propagation


Example

0 int size = 0 ; long* p = 0;1 add_element (long value) 2 size += 1;3 p = (int*) realloc(p,sizeof(long)*size);4 p[size-1] = value;5 }

Activation: add_element(...) and no memory;

Error: p=0 and size > 0 (* before line 4 *);

Propagation: execution of line 4

Failure: segmentation violation


Error Propagation

Error Error

Failure of one component can activate an error in another component

One components fault is another components failure

ComponentFault

Failure

Component


Goal of Fault Tolerance/Robustness

The Goal of Fault Tolerance/Robustness is to

Avoid/Mitigate Failures in the Presence of Faults


Fault Tolerance

component

Error component

failure

fault

System Interface

fault

error free


Basic Fault Tolerant Approach

Define set of likely faults and likely bounds on fault frequency (failure model)

Components must be able to tolerate specified faults

Faults outside failure model might or might not be handled


Origin of Faults

ProcessManageme

nt

Specification

Operation

Maintenance

HumanInteraction

Design

Implementation

Environment

Component

DefectsReuse

Requirements

Engineering

Documentation


Fault Classification

Persistence Transient fault Intermittent fault (periodic fault) Permanent fault

Creation time Design fault Operational fault

Intention Accidental fault Intentional fault


Classical Failure Model

Crash failure Fail-silent and Fail-stop

Omission failure Timing failure

System fails to respond within a specified time slice Both late and early responses might be “bad” Late timing failure = performance failure

Arbitrary failure System behaves arbitrarily


Failure Hierarchy

The algorithms used forachieving any kind of

fault tolerance depend onthe system and failure model

Arbitrary

Timing

Omission

Crash

Fail-stop


Achieving Dependability

Fault tolerance should not be applied in isolation

Fault AvoidanceFault

Forecasting

Fault RemovalFault

Tolerance


Fault Avoidance

Reduce the number of faults during software construction Rigorous Software Development Process

• Requirements Specification & Analysis• Structured Design• Well-defined Mapping to Programming Languages• Clear Documentation

Formal Methods Software Reuse


Rigorous Software Development (1)

Requirements elicitation Discover what features each stakeholder expects the

system to provide Imperfect process

• Technical and non-technical people have to collaborate• Use-cases

Computer scientists can’t be experts in all application areas

Current software development processes are rigid• No support for “iterative” requirements engineering



Requirements Analysis / Specification Specify in a clear and precise way what functionality your

system must provide Complete, but not too complex Consistent Determine (or even better: generate) test cases Should not be rigid


Drink Distributor Example (1)

Provides hot drinks: coffee, tea and chocolate User interface Cycle treatment

1. Insert money2. Choose drink3. Take change4. Take drink

Or press cancel and coins are given back

ChangeDrink

Coins Cancel

CoffeeTeaChocolate

Drink Distributor



Incomplete specification No deadline for cancellation specified What if user inserts new coins before the end of a

cycle? What if the user changes his selection? What should be done when resources (change,

cups, spoons, sugar, coffee, tea, chocolate, water) run out?

Provide partial service?(e.g. only tea and coffee / require exact change)

If manufacturer and user make divergent interpretations, operation time failure will occur



Augment specification Cancellation not possible once drink has been

chosen Add green / red light to indicate cycle start Only the first selected beverage is taken into

account Add lights to show availability of drinks

Each omission of constraint in the specification can lead to a failure in the service delivered to the user

Dissatisfaction Loss of money



Structured design For instance in Object-Orientation:

Apply O-O principles, e.g. abstraction, information hiding, modularity, classification, to reduce complexity of the solution

Provide easy-to-read documentation• UML

Programming Methodology Good programming discipline

• Pair-programming Well-defined mapping of design models to programming

constructs Standards or coding conventions


Formal Methods (1)

Specifications are developed using mathematically tractable languages and tools Petri Nets, Algebraic Specifications

Permits to prove desired properties Generation of test cases Generation of code!


Formal Methods (2)

Mathematical specifications of software tend to be equal in size as the program itself just as error-prone

Tools (model-checkers) still face algorithmic challenges when attempting to prove properties of huge models

However, have been successfully applied for safety-critical components


Software Reuse

Well exercised software is less likely to fail Save development cost Undiscovered faults may appear when the

component is used in a new environment


Fault Removal

Detect and remove existing faults using proofs Assertion checking in simulator

Testing Exhaustive testing not feasible Can’t show the absence of faults Quality measures

Formal Inspection


Fault Forecasting

Also known asSoftware reliability measurement [Lyu96]

Estimation Gather failure data during operation or testing Apply statistical inference techniques

Prediction Gather software metrics during development

Fault forecasting can indicate the need for additional testing or for applying fault tolerance


Seriousness Classes (1)

DO-178B (standard for SW certification), civil aeronautics Without effects Minor / benign

• Upset passengers, small increase in workload for the Upset passengers, small increase in workload for the crewcrew

Major / significant• Injuries of the passengers / crew and reducing the Injuries of the passengers / crew and reducing the

efficiency of the crewefficiency of the crew Dangerous / serious

• Small number of casualties / serious injuries, or Small number of casualties / serious injuries, or preventing the crew from achieving its task in a preventing the crew from achieving its task in a precise and complete mannerprecise and complete manner

Catastrophic / disastrous• Leading to human lives lostLeading to human lives lost


Seriousness Classes (2)

DO-178B, civil aeronautics Without effects Minor / benign

• Probable: p > 10Probable: p > 10-5-5

Major / significant• Rare: 10Rare: 10-7-7 < p < 10 < p < 10-5-5

Dangerous / serious• Extremely rare: 10Extremely rare: 10-9 -9 < p < 10< p < 10-7-7

Catastrophic / disastrous• Extremely improbable: p < 10Extremely improbable: p < 10-9 -9


Software Fault Tolerance

Tolerate faults that remain in the system after development, preventing system failure Remove errors and their effects from the computational

state before a failure occurs


Classification

Single Version Software Monitoring techniques, atomicity of actions, decision

verification, exception handling

Multi-version Software Functionally independent, yet equivalent software Recovery blocks, N-version programming, …

Multiple Data Representation Retry blocks, N-copy programming, …


Recovery

Error detection Identify erroneous state

Error diagnosis Assess the damage

Error containment Prevent further damage / error propagation

Error recovery Substitute the erroneous state with an error-free one


Backward Error Recovery (1)

System state is saved at predetermined recovery points Called checkpointing Incremental checkpointing, log

State should be checkpointed on stable storage, not affected by failures

Recover error-free state by rolling back to a previously saved (error-free) state



Checkpoint

Checkpoint

Faultdetected

Fault manifests

Assumption: faulty behaviorAfter last checkpoint



Checkpoint

Checkpoint

Rollback

Depending on the technique used:• Try again• Try a different alternate• Do nothing (wait for the next request)



Checkpoint

Checkpoint

Checkpoint

Checkpoint


Advantages of Backward Recovery

+ Requires no knowledge of the errors in the system state

+ Can handle arbitrary / unpredictable faults (as long as they do not affect the recovery mechanism)

+ Can be applied regardless of the sustained damage (the saved state must be error-free, though)

+ General scheme / application independent+ Particularly suitable for recovering from transient

faults


Disadvantages of Backward Recovery

―Requires significant resources (e.g. time, computation, stable storage) for checkpointing and recovery

―Checkpointing requires To identify consistent states The system to be halted / slowed down temporarily

―Care must be taken in concurrent systems to avoid the domino effect


Forward Error Recovery

Detect the error Damage assessment Build a new error-free state from which the system

can continue execution “Safe stop” Degraded mode Error compensation

• E.g., switching to a different component, etc…


Forward Error Recovery (2)

Faultdetecte

d

Fault manifests

State Reconstruction

Damage Assessment


Advantages of Forward Recovery

+ Efficient (time / memory) If the characteristics of the fault are well understood,

forward recovery is the most efficient solution

+ Well suited for real-time applications Missed deadlines can be addressed

+ Anticipated faults can be dealt with in a timely way using redundancy


Disadvantages of Forward Recovery

—Application-specific—Can only remove predictable errors from the

system state—Requires knowledge of the actual error—Depends on the accuracy of error detection,

potential damage prediction, and actual damage assessment

—Not usable if the system state is damaged beyond recoverability


Redundancy

Key concept of fault tolerance Hardware redundancy

• Most common use of redundancyMost common use of redundancy• We’re not going to address itWe’re not going to address it

Software redundancy• Additional applications, modules, objects used in the Additional applications, modules, objects used in the

system to support fault tolerancesystem to support fault tolerance Information redundancy

• Error-detecting or error-correcting codesError-detecting or error-correcting codes• Diverse dataDiverse data• Data produced for fault toleranceData produced for fault tolerance

Time redundancy• Use additional time for fault toleranceUse additional time for fault tolerance


Architectural Structure

Systems, especially concurrent ones, are increasingly complex

Consist of several components / subcomponents Fault tolerance must account for that

Different fault tolerance approaches for each component Failure of a subcomponent can be perceived as a fault in

the parent component Clear structuring reduces complexity


Error Confinement

System partitioned into regions, beyond which effects of faults should not propagate

Components should only be accessible through a well-defined (and preferably narrow [Kop97]) interface

Different confinement regions may employ different fault tolerance techniques depending on failure semantics of the environment and subcomponents


Normal Processing

Idealized Fault-Tolerant Component (1)

Error Processing

Local Exception

Return to normal

Service Request

Service Request Reply

Reply

Interface Exception

Failure Exception

Failure Exception

Interface Exception


Idealized Fault-Tolerant Component (2)

Receives requests for service Produces responses 3 kinds of exceptions

Interface exception: An invalid service request has been made

Local exception: An internal error is detected Failure exception: Component is unable to provide the

requested service Recursive structure


Recommended Reading

[Lyu96] Lyu, M. R. (ed.): Handbook of Software Reliability Engineering, New York, IEEE Computer Society Press, McGraw-Hill, 1996.

[Kop97] Kopetz, H.: Real–Time Systems — Design Principles for Distributed Embedded Applications. Kluwer Academic Publishers, 1997.

[RX95] Randell, B.; Xu, J.: The Evolution of the Recovery Block Concept, chapter 1, pp. 1 – 21, in Lyu, M. R. (Ed.): Software Fault Tolerance, John Wiley & Sons, 1995.

[1] M. Sullivan and R. Chillarege, “A comparison of Software Defects in Database Management Systems and Operating Systems”, FTCS, p475-484, July 1992.

[2] D. Oppenheimer, A. Ganapathi, and D. A. Patterson, “Why do Internet services fail, and what can be done about it”, in Proc. USENIX Symposium on Internet Technologies and Systems, 2003.

[3] J. Gray, “A Census of Tandem System Availability Between 1985 and 1990”, Tandem Technical Report 90.1

Fundamental Concepts Neeraj Suri

Documents

Transcript of Fundamental Concepts Neeraj Suri