Software Safety: An Oxymoron? - VanQ 2007
-
Upload
ken-wong -
Category
Technology
-
view
669 -
download
2
description
Transcript of Software Safety: An Oxymoron? - VanQ 2007
Software Safety:Software Safety:An Oxymoron?An Oxymoron?
March 29, 2007Ken Wong, Ph.D., Senior Systems Analyst
McKesson Medical Imaging Group
Points to Ponder*Points to Ponder*
A system can be correct and reliable and yet A system can be correct and reliable and yet unsafeunsafeSoftware safety is not about bugsSoftware safety is not about bugsProgram testing can be used to show the Program testing can be used to show the presence of bugs, but never to show their presence of bugs, but never to show their absenceabsence
* We will return to these statements in the * We will return to these statements in the discussiondiscussion
OutlineOutline
Introduction to Software SafetyIntroduction to Software SafetySoftware: Meet System SafetySoftware: Meet System SafetySystem Safety: Meet SoftwareSystem Safety: Meet SoftwareVerifying Software SafetyVerifying Software Safety
Introduction toIntroduction toSoftware SafetySoftware Safety
Software In the Real WorldSoftware In the Real World
TheracTherac 25 accidents25 accidentsArianeAriane 5 Flight 501 explosion5 Flight 501 explosionTitan 4 Centaur/Titan 4 Centaur/MilstarMilstar failurefailureTCAS collision near TCAS collision near UberlingenUberlingen, Germany, Germany
ArianeAriane 501501
ArianeAriane 501 Events501 Events
Destruction of Destruction of ArianeAriane 501 on 4 June 1996 501 on 4 June 1996 (from final report):(from final report):
nominal nominal behaviourbehaviour of the launcher up to H0 + 36 of the launcher up to H0 + 36 seconds; seconds; failure of the backfailure of the back--up up Inertial Reference SystemInertial Reference System(SRI) followed immediately by failure of the active (SRI) followed immediately by failure of the active SRI; SRI;
Building Dependable Software Building Dependable Software ……
Security
Safety
Reliability
Correctness
Quality
Safety is a Distinct PropertySafety is a Distinct Property
Safety is a distinct part of the interlocking puzzle Safety is a distinct part of the interlocking puzzle of how to build dependable softwareof how to build dependable software
A system can be A system can be ““correctcorrect”” and and ““reliablereliable”” and yet and yet unsafe!unsafe!Improved software process alone does not mean a Improved software process alone does not mean a safer systemsafer system
Note: These can be a contentious claims even Note: These can be a contentious claims even among safety engineers.among safety engineers.
Safety is Safety is ……
avoiding mishaps!avoiding mishaps!
Software: Software: Meet System SafetyMeet System Safety
““Is it SafeIs it Safe””? ?
Christian Szell: Is it safe? Babe: Yes, it's safe, it's very safe, it's so safe you wouldn't believe it.- Marathon Man 1976
System Safety System Safety
““System SafetySystem Safety”” is a systematic approach to is a systematic approach to safety primarily developed in the US for the safety primarily developed in the US for the aerospace and defense industriesaerospace and defense industries
Spreading to other industries, e.g., health careSpreading to other industries, e.g., health care
Focus on managing system Focus on managing system hazardshazardsE.g., FDA Quality System Regulation recommends E.g., FDA Quality System Regulation recommends ““risk analysisrisk analysis”” (A.K.A. hazard analysis)(A.K.A. hazard analysis)
System SafetySystem Safety
Hazard ID
Hazard Analysis
Risk Assessment
Hazard Mitigation
Safety Verification
HazardHazard
A A hazardhazard is the systemis the system’’s potential contribution s potential contribution to a mishapto a mishap
E.g., brake failure, engine overheatingE.g., brake failure, engine overheating
Key is understanding the system Key is understanding the system environmentenvironment
Hazards and MishapsHazards and Mishaps
mishaphazardhazard causes
System
Environment
ArianeAriane 501: SRI Bug?501: SRI Bug?
Uncaught exception from floating point Uncaught exception from floating point conversionconversion
From high value of BH (Horizontal Bias)From high value of BH (Horizontal Bias)Programming 101!Programming 101!
Conversion check deliberately removed for Conversion check deliberately removed for performance reasonsperformance reasons
SRI reused from SRI reused from ArianeAriane 44Check not required for Check not required for ArianeAriane 4 trajectory4 trajectory
Safety is a System PropertySafety is a System Property
SRI worked exactly as specified SRI worked exactly as specified –– for for ArianeAriane 4!4!ArianeAriane 5 trajectory different from 5 trajectory different from ArianeAriane 44SRI spec did NOT include SRI spec did NOT include ArianeAriane 5 trajectory data 5 trajectory data SRI NOT tested with SRI NOT tested with ArianeAriane 5 trajectory data5 trajectory data
““SafetySafety”” cannot be understood without knowing cannot be understood without knowing the operational environmentthe operational environment
FDA FDA ““useuse--relatedrelated”” vsvs ““device failuredevice failure”” hazardshazardsE.g., TCAS collision in GermanyE.g., TCAS collision in Germany
When Software Met SafetyWhen Software Met Safety
…… there was a definite risk in assuming that critical there was a definite risk in assuming that critical equipment such as the SRI had been validated by equipment such as the SRI had been validated by qualification on its own, or by previous use on qualification on its own, or by previous use on ArianeAriane 4. 4.
ARIANE 5 Flight 501 Failure ReportARIANE 5 Flight 501 Failure Report
System Safety: System Safety: Meet SoftwareMeet Software
In the beginning (or Europe) In the beginning (or Europe) ……**
Mechanical systems with well understood Mechanical systems with well understood designsdesignsHazards caused by component Hazards caused by component failure failure from from random hardware random hardware faultsfaults
Mitigation through Mitigation through integrity integrity andand redundancyredundancy
* Myth, but there is underlying truth in all good myths* Myth, but there is underlying truth in all good myths
Steering Fails
Steering Wheel Fails
Steering Assembly Fails Driver
Error
OR
OR
Basic Event
Intermediate Event
Fault Tree AnalysisFault Tree Analysis
Drive Shaft Fails Steering Control Software Fails
OR
Is Software Another Component?Is Software Another Component?
What is the probability that the steering What is the probability that the steering control software fails?control software fails?If software is just another component:If software is just another component:
1.1. Software cannot wear out or breakdown like a Software cannot wear out or breakdown like a mechanical componentmechanical component
2.2. Only Only ““faultfault”” is a programming bugis a programming bug3.3. Assuming programmers do their job, failure rate Assuming programmers do their job, failure rate
should be should be zerozero**
*Paraphrased from talk by a system safety engineer*Paraphrased from talk by a system safety engineer
Steering Fails
Steering Wheel
Steering Assembly Fails Driver
Error
OR
OR
Basic Event
Intermediate Event
Software RevealedSoftware Revealed
Drive Shaft Fails
OR
Steering Control Software Fails
The Software WerewolfThe Software Werewolf
Of all the monsters that fill the nightmares of our Of all the monsters that fill the nightmares of our folklore, none terrify more than werewolves, because folklore, none terrify more than werewolves, because they transform unexpectedly from the familiar into they transform unexpectedly from the familiar into horrors horrors …… The familiar software project, at least The familiar software project, at least as seen by the nontechnical manager, has something as seen by the nontechnical manager, has something of this character of this character ……
Frederick P. Brooks, Jr. from No Silver Bullet : Frederick P. Brooks, Jr. from No Silver Bullet : Essence and Accidents of Software EngineeringEssence and Accidents of Software Engineering
ArianeAriane 501: Safety in Numbers?501: Safety in Numbers?
In response to In response to ““faultfault””, the Primary SRI was , the Primary SRI was deliberately shutdowndeliberately shutdown
Attempt made to switch to backup SRIAttempt made to switch to backup SRITypical strategy in face of random failuresTypical strategy in face of random failures
However, BOTH However, BOTH SRIsSRIs shutdown!shutdown!““FaultFault”” due to same design in both due to same design in both SRIsSRIsException in nonException in non--essential component essential component
Safety is an Emergent Property Safety is an Emergent Property
Software safety is not about Software safety is not about ““faultsfaults””Many potential Many potential ““faultsfaults”” but not all created equal but not all created equal ––most have no impact on safetymost have no impact on safety
““CorrectCorrect”” behaviourbehaviour can contribute to the can contribute to the hazard!hazard!
Hazards can emerge from complex interactions Hazards can emerge from complex interactions between between ““correctcorrect”” componentscomponents
When Safety Met SoftwareWhen Safety Met Software
An underlying theme in the development of An underlying theme in the development of ArianeAriane 5 is 5 is the bias towards the mitigation of random failure.the bias towards the mitigation of random failure.Board wishes to point out that software is an expression Board wishes to point out that software is an expression of a highly detailed design and does not fail in the same of a highly detailed design and does not fail in the same sense as a mechanical system.sense as a mechanical system.
ARIANE 5 Flight 501 Failure Report ARIANE 5 Flight 501 Failure Report
Verifying Software Verifying Software SafetySafety
Software and Safety ProcessSoftware and Safety Process
Requirements
Design
Hazards
Source Code
Hazard ID, Analysis and Mitigation
Verification
Safety Verification
Limits of TestingLimits of Testing
Program testing can be used to show the presence of Program testing can be used to show the presence of bugs, but never to show their absence bugs, but never to show their absence
E. E. DijkstraDijkstra in Structured Programmingin Structured Programming
HazardHazard--Driven TestingDriven Testing
Focus on hazard Focus on hazard –– force it to occur force it to occur Consider:Consider:
Hazard risk (Hazard risk (““riskrisk--based testingbased testing””))Mishap scenariosMishap scenariosHazard causes identified during hazard analysisHazard causes identified during hazard analysisProblem reports/issues with safety implicationsProblem reports/issues with safety implications
See Jeffrey J. Joyce and Ken Wong, See Jeffrey J. Joyce and Ken Wong, HazardHazard--driven Testing of driven Testing of SafetySafety--Related SoftwareRelated Software
Summary and ConclusionsSummary and Conclusions
Safety is a distinct propertySafety is a distinct propertySafety is a system propertySafety is a system property
Operational and development environment factorsOperational and development environment factors
Safety is an emergent propertySafety is an emergent propertyHazards can emerge from complex interactions Hazards can emerge from complex interactions between between ““correctcorrect”” componentscomponents
Safety and Software: Safety and Software: Happy Together?Happy Together?
References*References*
ARIANE 5 Flight 501 Failure Report by the ARIANE 5 Flight 501 Failure Report by the Inquiry BoardInquiry Board, Paris, July 1996 , Paris, July 1996 Frederick P. Brooks, Jr., Frederick P. Brooks, Jr., No Silver Bullet : Essence No Silver Bullet : Essence and Accidents of Software Engineeringand Accidents of Software Engineering, Computer , Computer Magazine, April 1987Magazine, April 1987Jeffrey J. Joyce and Ken Wong, Jeffrey J. Joyce and Ken Wong, HazardHazard--driven driven Testing of SafetyTesting of Safety--Related SoftwareRelated Software, 21st , 21st International System Safety Conference, Ottawa, International System Safety Conference, Ottawa, Ontario, August 4Ontario, August 4--8, 20038, 2003
*All available on*All available on--lineline