Software Safety in Embedded Systems & Software Safety: Why, What, and How – Leveson UC San Diego...

45
Software Safety in Embedded Systems & Software Safety: Why, What, and How – Leveson UC San Diego CSE 294 Spring Quarter 2006 Barry Demchak

Transcript of Software Safety in Embedded Systems & Software Safety: Why, What, and How – Leveson UC San Diego...

Software Safety in Embedded Systems &

Software Safety: Why, What, and How – Leveson

UC San DiegoCSE 294

Spring Quarter 2006Barry Demchak

Previous Paper

System Safety in Computer-Controlled Automotive Systems – Leveson (2000) Types of accidents Safeware Methodology

Project Management Software Hazard Analysis Software Requirements Specification & Analysis Software Design & Analysis Design & Analysis of Human-Machine Interaction Software Verification Feedback from Operational Experience Change Control and Analysis

Roadmap

Safety definitions Industrial safety and risk Systems Issues – hardware and software Software Safety Analysis and Modeling Verification and Validation System Safety Engineering

Safety Before Computers

NASA: 10-9 chance of failure over a 10 hour flight

British nuclear reactors: no single fault can cause a reactor to trip, and 10-7 chance over 5000 hours of failure to meet a demand to trip

FAA: 10-9 chance per flight hour (i.e., not within total life span of entire fleet)

Introduction of Computers

Nuclear Power Plants Space Shuttle Airbus Aircraft Space Satellites NORAD

Purpose: perform functions that are too dangerous, quick, or complex for humans

System Safety (def.)

Subdiscipline of systems engineering Applies scientific, management, and

engineering principals Ensures adequate safety throughout the

system life cycle Constrained by operational effectiveness,

time, and cost MilSpec: “freedom from those conditions that

can cause death, injury, occupational illness, or damage to or loss of equipment or property”

More Definitions

Accident Unwanted and unexpected release of energy

Mishap (or failure) Unplanned event or series of events Death, injury, occupational illness, damage, or

loss of equipment or property, or environmental harm

Hazard A condition that can lead to a mishap

More Definitions (cont’d)

Risk Probability of a hazardous state occurring Probability of a hazardous state leading to a

mishap Perceived severity of the worst potential

mishap that could result from a hazard Hazard probability Hazard criticality (severity)

Early Approach

Operational or Industrial Safety Examining system during operating life Correcting unacceptable hazards Ignores crushing effect of single catastrophe

Assumptions All faults caused by human errors could be

avoided completely or located and removed prior to delivery and operation

Relatively low complexity of hardware

Ford Pinto (early 1970s)

Specifications: 2000 pounds, $2000 sale price Use existing factory tooling Safety issue with gas tank placement Analysis

Deaths cost $200,000, burns cost $67,000 Cost to make change $137M, benefit $49M

Ford engineer: “But you miss the point entirely. You see, safety isn't the issue, trunk space is. You have no idea how stiff the competition is over trunk space.”

Ford president: “Safety doesn’t sell” Verdict: $100M

Anecdotes

Safety devices themselves have been responsible for losses or increasing chances of mishaps

Redundancy sometimes degrades safety Unrelated (but related) systems cause errors

Later Approach

System Safety Design acceptable safety level before actual

production or operation Optimize safety by applying scientific and

engineering principals to identify and control hazards through analysis, design, and management procedures

Hazard analysis identifies and assesses Criticality level of hazards Risks involved in system design

Later approach (cont’d)

Assumptions Complexity of software and hardware

interaction causes non-linear increase in human-error-induced faults

Impossible to demonstrate safety ahead of usage

Complexity and coupling are covariant

Hardware vs Systems

Hardware Widgets have long history of use and fault

analysis … highly responsive to redundant techniques

Infinite number of stable states Software

No history with software … reuse is rare Large number of discrete states without

repetitive structure Difficult to test under realistic conditions

More Systems Issues

Difficult to specify completely – what it does, and what it does not do

Cannot identify misunderstandings about requirements

Engineers assume perfect execution environments, don’t consider transient faults

Lack of system-level methods and viewpoints

Even Bigger Systems Issues

Specification and implementation of components is not the same as between components

Between-component interactions grow exponentially and are often underrepresented in analyses

Components include Software and components Hardware Human operators

Still Bigger Systems Issues

More Components Development Methodologies Source code maintenance Verification/Validation Methodologies Stakeholder Values

Management Individual Programmers Customer Human Users Suppliers

Definitions

Reliability Probability that system will perform intended

function Safety

Probability that hazard will not lead to a mishap

Reliability = failure free Safety = mishap free Reliability and Safety often conflict

Safety

Studied separately from security, reliability, or availability

Separation of concerns Safety requirements are identified and

separated from operational requirements Conflicts resolved in a well-reasoned manner

Definitions

System Sum total of all component parts Software is only a part, and its correctness

exists only in relation to other system components

Software Safety

Ensures software will execute within a system context without resulting in unacceptable risk

Safety-critical software functions Directly or indirectly allow a hazardous system

state to exist Safety-critical software

Contains safety-critical functions

System Characteristics

Inputs and outputs over time Control subsystem

Description of function to be performed Specification of operating constraints (quality,

capacity, process, and safety) Safety constraints are hazards rewritten as

constraints Safety constraints written, maintained, and

audited separately

Constraints, Requirements, Design

Analysis and Modeling

Preliminary Hazard Analysis (PHA) Subsystem Hazard Analysis (SSHA) System Hazard Analysis (SHA) Operating and Support Hazard Analysis

(OSHA)

Safeware – Leveson

Hazard Analysis

Start with list of identifiable hazards Work backward to discover combination of

faults that produce the hazard Categorization

Frequent Occasional Reasonably remote Remote … physically impossible

Hazard Examples (Nuclear Weapons)

Inadvertent nuclear detonation Inadvertent prearming, arming, launching,

firing, or releasing Deliberate prearming, arming, launching,

firing, or releasing under inappropriate conditions

Software Requirement Analysis

Hard to do Cubby-hole mentality Rarely includes what the system should not

do Techniques

Fault Tree Analysis (FTA) Real Time Logic (RTL) Petri nets

Fault Tree Example

Real Time Logic

Model the system in terms of events and actions (both data dependency and temporal ordering)

Generate predicates Determine whether a safety assertion is a

theorem derivable from the model Inherently unsafe means that the assertion

cannot be derived from the model

Time Petri Nets

Mathematical modeling of discrete event systems in terms of conditions and events and the relationship between them

Facilitates backward analysis Points to failures and faults which are

potentially most hazardous Nontrivial to build and maintain

Research Question

What is the place of these analysis techniques in an agile development environment??

Safety Verification and Validation

Showing that a fault cannot occur Showing that if a fault occurs, it is not

dangerous Only as good as the specifications Specifications are usually incomplete, and

hardware specifications are rare

Safety Verification and Validation

Methodologies Proofs of adequacy Software Fault Tree (proofs of fault tree

analyses) Determine safety requirements Detect software logic errors Identify multiple failure sequences involving

different parts of the system Inform critical runtime checks Inform testing

Safety Verification and Validation

Methodologies Nuclear Safety Cross Check Analysis

(NSCCA) Demonstrate that software will not contribute to a

nuclear mishap Multiple technical analyses demonstrate

adherence to specifications Demonstrate security and control measures A lot of qualitative judgment regarding criticality

Software Common Mode Analysis Sneak Software Analysis

Safety Analysis – Quantitative

Requires statistical histories which may not exist

Applies mostly to physical systems Single-valued Best Estimate

Information sufficient for determinate models Probabilistic

Science is understood, but limited parameters available

Bounding Putting a ceiling on the answer

System Safety Engineering

Identify hazards Assessing hazards (likelihood and criticality) Design to eliminate or control hazards Assess risks that cannot be eliminated or

controlled

Failure Mode Definitions

Fail-safe Default is safe mode, no attempt to execute

operational mission Fail-operational

Default is to correct fault and continue with operational mission

Fail-soft Default is to continue with degraded

operations

Designing for Safety

Not possible to ensure safety by analysis or verification alone

Analysis and verification may be cost-prohibitive

Different standard hierarchy Intrinsically safe Prevents or minimizes occurrence of hazards Controls the hazard Warns of presence of hazard

Safety Design Mechanisms

Lockout device Prevents event from occurring when hazard is

present Lockin device

Maintains an event or condition Interlock device

Assuring operation sequences in correct order

Safety Design Principals

Provide leverage for certification Avoid complexity where possible Reduce risk by reducing hazard likelihood, or

severity, or both Modularize to separate safety-critical

functions from non-critical functions Execute safety-critical functions under

separate authority Fail on a single-point failure

Safety Design Principals (cont’d)

Start out in safe state, and take affirmative actions to reach higher risk states

Check critical flags as close as possible to actions they protect

Avoid compliments: absence of “armed” is not “safe”

Use “true” values to indicate safety … “false” values can result from common hardware failures

Safety Design Principals (cont’d)

Detection of unsafe states Watchdog timer Independent monitors Asserts and exception handlers Use backward recovery (return system to safe

state) instead of forward recovery (plow ahead)

Human Factors

Define partnership between human and computer Avoid complacency Avoid confusion Avoid passive monitoring

Conclusion

Select suite of techniques and tools spanning entire software development process

Apply them consciensciously, consistently, and thoroughly

Consider implementation tradeoffs Low catastrophe, high cost alternatives Moderate catastrophe, moderate cost

alternatives High catastrophe, low cost alternatives

Take Home Messages

Safety is a system issue – in the large sense Software engineering techniques can

contribute to system safety – in both a narrow and broad context

Acceptable risk is king, and determining and executing it is hard