Fantastic Failures

download Fantastic Failures

of 30

Transcript of Fantastic Failures

  • 8/14/2019 Fantastic Failures

    1/30

    Class 304: Fantastic Failures

    Embedded Systems Conference

    Wednesday, 31 March 2004

    By Kim R. Fowler

    Historical Case Studies

  • 8/14/2019 Fantastic Failures

    2/30

  • 8/14/2019 Fantastic Failures

    3/30

    31 March 2004 5

    Ariane 5

    (Photographic source is ESA/CNES. You can find these photos at the following website:

    www.mssl.ucl.ac.uk/www_plasma/missions/cluster/about_cluster/clu ster1/cluster1_images.html)

    31 March 2004 6

    Ariane 5 recounted

    Dual-redundant processors

    3 unprotected variables that overflowed

    Processors reset on overflow, no gracefulrecovery

    Used in Ariane 4, no check of flight dynamics

    Ariane 5 had > horizontal drift velocities

    Reuse is tricky, end-to-end system testnecessary

    Find report at:www.esa.int/export/esaLA/Pr_33_1996_p_EN.html

  • 8/14/2019 Fantastic Failures

    4/30

    31 March 2004 7

    Therac 25

    Medical linear accelerator for treating tumors Mid-1980s overdosed six patients

    Problems Quick editing by operator caused race condition

    Cryptic error messages ignored

    No explanation in Users Manual of error codes

    50 times full dose but displayed no dose given

    No mechanical interlocks No software reviews or audits, little

    documentation

    31 March 2004 8

    Therac 25 Lessons

    Need general plan for system development

    The operator interface must be clear, intuitive,and explained

    Hardware safeguards must limit software faults

    Good design, not testing, makes a safe system See Appendix A Medical Devices: The Therac-

    25 from Nancy Leveson, Safeware: System

    Safety and Computers, Addison-Wesley, 1995.

  • 8/14/2019 Fantastic Failures

    5/30

  • 8/14/2019 Fantastic Failures

    6/30

    31 March 2004 11

    Chernobyl events Experiment called for by engineers in Moscow

    Manual shutdown, automatic control turned off Power dropped to 1% capacity

    Removed more control rods

    Power crept up to 7%

    Turned on more water to produce more steam

    Water cooled reactor, dropping steam and reactivity

    Removed even more control rods

    Steam production rose until 1:22 a.m. when operatorsshut off water flow

    Heat built up quickly, control rod sleeves bent Could not insert control rods

    Steam explosion

    31 March 2004 12

    Chernobyl Lessons

    Theoretical knowledge vs. hands-on

    Humans over-steer dynamic systems

    Humans dont handle interacting,nonlinear problems well

    Groupthink Understand human nature

    Clarity of function

    Reduce confounding problems

    Accommodate in system design

  • 8/14/2019 Fantastic Failures

    7/30

    31 March 2004 13

    Apple Lisa

    (Part of the computer collection of Giorgio Ungarelli, photograph used with permission.)

    31 March 2004 14

    Apple Lisa Legacy

    Brilliant concept before its time

    Mouse

    Graphical file management

    People not ready for paradigm shift

  • 8/14/2019 Fantastic Failures

    8/30

    31 March 2004 15

    Apple Lisa Lessons

    Prohibitive price for unappreciated

    capability

    Cost-effective solutions rely on users

    understanding

    Failure falls into business/political arena difficult to predict and avoid

    Navy Terrier/LEAP

  • 8/14/2019 Fantastic Failures

    9/30

    31 March 2004 17

    Terrier LEAP outline

    Concept for ballistic missile intercept

    Use current (early-mid 1990s) technology

    Prepare and test quickly

    Target launched from Wallops Island

    Interceptor launched from cruiser inAtlantic

    Basic human error foiled success

    31 March 2004 18

    LEAP Target

    (Photograph courtesy

    of Raytheon, Inc.)

  • 8/14/2019 Fantastic Failures

    10/30

    31 March 2004 19

    LEAP General Operation

    High-resolution radars at Wallops Island tracktarget (shipboard radars insufficient)

    Wallops Island processor collected data from theradars, filtered the target track with a six-stateKalman filter, and transmitted the track to theship.

    Sent target tracks to ship via redundant telephonelandlines and Inmarsat satellite links

    Ship processor received the data, predicted theintercept time and point, and indicated when tolaunch the interceptor missile.

    31 March 2004 20

    LEAP Missile & Intercept

    (Photograph courtesy of Raytheon, Inc.)

  • 8/14/2019 Fantastic Failures

    11/30

    31 March 2004 21

    LEAP Testing Finds Problems

    End-to-end tests of the system simulated a target launch,

    transmitted the simulated data through the entiresystem to the ship,

    calculated an intercept as if we were at sea.

    Redundant landlines switch maintenance inNew Jersey cut off early test

    Separate landlines

    one through New Jersey other through Pennsylvania

    31 March 2004 22

    Richmond K. Turner, GC20

    (Photograph courtesy of the Johns Hopkins

    University Applied Physics Laboratory.)

  • 8/14/2019 Fantastic Failures

    12/30

    31 March 2004 23

    Testing Finds Problems (contd.)

    Two shipboard radars caused problems

    SPS-49 jammed the Inmarsat receivers

    SPS-20 jammed the GPS receivers

    Inmarsat situated on port and starboardbridge to reduce superstructure blockage

    Too many dropouts with commercial

    modems, switched to cell phone modems

    31 March 2004 24

    LEAP Targeting Processorand laboratory test set

    (Photographs courtesy of the Johns Hopkins

    University Applied Physics Laboratory.)

  • 8/14/2019 Fantastic Failures

    13/30

    31 March 2004 25

    LEAP: Lessons Learned

    Technical failure

    Simple, human error can interrupt the bestdesigns

    Careful development and thorough testingnecessary

    All components must be tested within thesystem to uncover interactions

    31 March 2004 26

    Aegis LEAP

    A success story

    Three successful intercepts in 2002, morein 2003

    Carefully planned development

  • 8/14/2019 Fantastic Failures

    14/30

  • 8/14/2019 Fantastic Failures

    15/30

    31 March 2004 29

    Kinetic Kill Vehicle and TargetImage

    (Figure and photograph courtesy of the Johns Hopkins

    University Applied Physics Laboratory.)

    31 March 2004 30

    Aegis LEAP Launch

    (Photographs courtesy of the Johns Hopkins University Applied Physics Laboratory.)

  • 8/14/2019 Fantastic Failures

    16/30

    31 March 2004 31

    Thorough Ground Test Program

    Separation tests squibs, batteries, explosive bolts

    KW hover test for the closed loop pointing

    Air bearing tests of maneuvers: pitch-to-ditch, IR seekercalibration, and pointing before separation

    Hardware-in-the-loop simulation and test of avionics

    KW tests for the IR seeker characterization, stabilization,third stage interfaces

    Vacuum tests PCB delamination, arcing, and outgassing

    Aerothermal testing in a hypersonic wind tunnel fornosecone heating and outgassing, seeker shield function,strake heating and insulation

    Types of Failure

  • 8/14/2019 Fantastic Failures

    17/30

    31 March 2004 33

    Examples: Product Recalls

    [. . .] recalled 45,000 heaters for defective thermostats thatwere improperly positioned, which could lead to theoverheating.

    [. . .] recalled 3.1 million dishwashers. The slide switch (thelever that selects between heat drying and energy saving)can melt and ignite over time, posing a fire hazard.

    [. . .] recalled 5,500 toy flashlights because the batteriesmay overheat or leak and children can suffer burns fromthe leaking battery.

    [. . .] recalled upright vacuum cleaners because the powercord may break inside of the handle posing electricalshock and burn injury hazards.

    http://www.matthewslawfirm.com

    31 March 2004 34

    Examples: Automotive Recalls

    March 12, 2002[. . .] recalled the [. . .] trailer hitch circuitry in the converter is inadequate to properlymanage voltage spikes that can lead to an electricalshort or open circuit within the converter, causing afailure and an inoperative trailer light.

    September 11, 2000[. . .] recalled about 270,000 [cars] air bags that may deploy unexpectedly because of

    corrosion in the inflator. During 2000[. . .] recalled ignition modules that could

    cause a car to stall. When the temperature of the ignitionmodule rises above a certain temperature the chances ofthe module cutting out also increases.

    http://www.crash-worthiness.com

  • 8/14/2019 Fantastic Failures

    18/30

    31 March 2004 35

    Examples: More AutomotiveRecalls

    [. . .] recalled 263,000 1995-97 [vehicles] . . . The airbagelectronic control module (AECM) could corrode fromwater or road salt and then accidentally fire the driver sideairbag.

    [. . .] recalled 757,000 1992-97 [vehicles] because higherthan specified electrical load through accessory powerfeed circuit may cause a short circuit and allow current toflow through ground wiring. This could cause overheatingand an electrical fire.

    [. . .] recalled 1995-97 [vehicles] because improperlyrouted wire harness for the air-conditioner may permitwires to rub together and short circuit, resulting in a blownfuse, dead battery, or fire.

    http://www.matthewslawfirm.com

    31 March 2004 36

    Examples: More AutomotiveRecalls

    December 11, 1998[. . .] recalled 226[electric vehicles] to reprogram the logic in

    the motor electronic control unit (ECU), whichcan mistakenly detect a failure of an electrical

    current sensor at speeds above 50 mph. Itcan cause the sudden loss of power andunexpected deceleration.

    http://autorepair.about.com/library/recalls/

  • 8/14/2019 Fantastic Failures

    19/30

    31 March 2004 37

    Elements of UnintendedConsequences in Previous Examples

    Passage of time usually fielded units

    Nonobvious or obscure causes

    Environmental interactions, i.e. corrosion,

    overheating

    Failure modes with significant effects, i.e.

    fire or injury

    31 March 2004 38

    The Nature of Problems

    Confounding complexity unforeseen circumstances

    multiple causes

    Human error nonobviousness to user improper use

    design oversight even if it appears to bea manufacturing problem

  • 8/14/2019 Fantastic Failures

    20/30

  • 8/14/2019 Fantastic Failures

    21/30

    31 March 2004 41

    Remedies

    Truth in advertising expertise, schedule estimation,management style/employee responses

    Work hard to develop reasonable schedules review and testing

    plan for contingencies

    Continuous learning lessons learned, your own experience

    others experiences

    Reduce complexity understand and define interactions

    do not reinvent the wheel limit features

    Teamwork

    31 March 2004 42

    Integrity

    The Big Picture

    Truth in advertising (your capability andskills)

    Estimation and scheduling

    Plan for the long term your success and reputation

    your products viability

    your companys reputation

  • 8/14/2019 Fantastic Failures

    22/30

    31 March 2004 43

    Failure and How to Handle It

    Types of failure technical

    professional

    political/societal

    Embrace failure

    admit and accept responsibility

    understand and learn

    put past behind you because others wont

    forgive others failures; help them to

    rebound

    Less control

    Progression

    Personal Examples

  • 8/14/2019 Fantastic Failures

    23/30

    31 March 2004 45

    Technical Failure

    Ultraviolet satellite camera with image

    intensifier

    Automatic gain control for image intensifier

    Nonlinear control problem

    First version blooming/collapsing picture

    Second version unreliable transmission

    of gain value

    31 March 2004 46

    Technical Failure 1st Version

    ( 2002, Figure courtesy of the Johns Hopkins University

    Applied Physics Laboratory.)

    Dn

    Up

    Up-downcounter

    reset

    Hi-threshold

    comparator

    Pixel clock

    Frame

    sync

    Image

    intensifier

    Camera

    DAC

    Video

    signal

  • 8/14/2019 Fantastic Failures

    24/30

    31 March 2004 47

    Technical Failure 1st Version

    Problem: blooming/collapsing picture

    Background:

    Discrete logic, up-down counters

    Unstable for bright objects

    Not fully simulated or analyzed

    Short development time (flew breadboards)

    Shoulda: analyzed/simulated expectedscenes during design

    31 March 2004 48

    Technical Failure 2nd Version

    ( 1996, Oxford University

    Press, used with permission.)

  • 8/14/2019 Fantastic Failures

    25/30

    31 March 2004 49

    Technical Failure 2nd Version

    Problem: unreliable transmission of gain value

    Background:

    Microcontroller implementation of AGC

    AGC stable for all scenes

    Readout of gain by ground equipment unreliable

    Analog encoding of gain into video frame

    Shoulda:

    Use digital encoding into video frame for noise margin Needed better understanding of noise environment

    31 March 2004 50

    Professional Failure

    Asked to finish programming effort while

    original designer moved onto otherprojects

    False starts and procrastination

    Finally removed myself from project

  • 8/14/2019 Fantastic Failures

    26/30

    31 March 2004 51

    Professional Failure

    Problem: did not complete assignment

    Background:

    Mounds of documentation to plow through

    Early realization of no-win situation

    Lost motivation

    No real recognition of work obvious to me

    Shoulda:

    Either not taken the job in the first place

    Or if no choice, plow through assignment while findinganother job (setting precedence)

    31 March 2004 52

    Professional/Business Failure

    Business deal

    My personal performance

    Technical excellence

    Professional excellence

    Maintained integrity

    Accused of bad stuff, which I did not do

    Deal fell through

  • 8/14/2019 Fantastic Failures

    27/30

    31 March 2004 53

    Professional/Business Failure

    Problem: business politics outside my control

    Background:

    Interesting proposition and product

    Long-term relationships

    Unknowns quantities introduced early in deal

    Weirdnesses grew

    Shoulda:

    Either not make deal in the first place Or left earlier before weirdness got out of hand

    Note: always deal with integrity or dont deal

    31 March 2004 54

    Political Failure

    Satellite subsystem

    Teams performance

    Technical excellence

    Professional excellence

    NASA sponsor pulled project in-house

  • 8/14/2019 Fantastic Failures

    28/30

    31 March 2004 55

    Political Failure

    Problem: politics outside my companys control

    Background:

    6-month long set of trade studies to define architecture

    Thorough studies and review

    Schedule well understood, team prepared to buildsystem

    Groups at NASA out of work

    NASA pulled project in-house to feed their own

    Shoulda:

    None, politics happen

    A Success Story

  • 8/14/2019 Fantastic Failures

    29/30

    31 March 2004 57

    The Sidewinder Missile A SuccessStory

    (Courtesy of the U.S. Navy. All U.S. Navy photos

    are public domain.

    http://library.thinkquest.org/jo113065/citations.htm)

    31 March 2004 58

    Sidewinder recounted

    Goal: simple, sturdy, cheap missile

    Small development team, 1949 1953

    Simple, clever combination of ideas Rollerons: simple but important control

    Proportional navigation simplified circuitry

    Torque-balance servo for maneuvering

    Canard control fins reduced wiring and connectors

    Simple data acquisition equipment

    Extensive testing and prototyping

  • 8/14/2019 Fantastic Failures

    30/30

    31 March 2004 59

    Sidewinder Lessons

    Breakthroughs require vision

    Small teams facilitate commitment andcommunications

    Simple and robust design

    Careful, thorough, and extensive testingand integration