Errors, Failures and Risks CS4020 Overview Failures and Errors in Computer Systems Case Study: The...

20
Errors, Errors, Failures and Failures and Risks Risks CS4020
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    253
  • download

    0

Transcript of Errors, Failures and Risks CS4020 Overview Failures and Errors in Computer Systems Case Study: The...

Page 1: Errors, Failures and Risks CS4020 Overview Failures and Errors in Computer Systems Case Study: The Therac-25 Increasing Reliability and Safety Dependence,

Errors, Failures Errors, Failures and Risksand Risks

CS4020

Page 2: Errors, Failures and Risks CS4020 Overview Failures and Errors in Computer Systems Case Study: The Therac-25 Increasing Reliability and Safety Dependence,

OverviewOverview

• Failures and Errors in Computer Systems

• Case Study: The Therac-25

• Increasing Reliability and Safety

• Dependence, Risk, and Progress

Page 3: Errors, Failures and Risks CS4020 Overview Failures and Errors in Computer Systems Case Study: The Therac-25 Increasing Reliability and Safety Dependence,

Failures and Errors in Failures and Errors in Computer SystemsComputer Systems

• Most computer applications are so complex it is virtually impossible to produce programs with no errors

• The cause of failure is often more than one factor

• Computer professionals must study failures to learn how to avoid them

• Computer professionals must study failures to understand the impacts of poor work

Page 4: Errors, Failures and Risks CS4020 Overview Failures and Errors in Computer Systems Case Study: The Therac-25 Increasing Reliability and Safety Dependence,

An Example – Billing ErrorsAn Example – Billing Errors

Cause:• Inaccurate and misinterpreted data in databases

– Large population where people may share names– Automated processing may not be able to

recognize special cases– Overconfidence in the accuracy of data– Errors in data entry– Lack of accountability for errors

Page 5: Errors, Failures and Risks CS4020 Overview Failures and Errors in Computer Systems Case Study: The Therac-25 Increasing Reliability and Safety Dependence,

Some more examples of failureSome more examples of failure• AT&T, Amtrak, NASDAQ systems have all had failures

– AT&T Wireless Services Inc. executives said yesterday that a massive software failure in November resulted in the inability to sign up several hundred thousand new subscribers

– NASDAQ power failures, system software failures on reporting closing price, etc.

– Amtrak had system error preventing passengers from buying Amtrak tickets online and at station kiosks across the country for a weekend.

• Voting system in 2000 presidential election– Irregularities in Florida, wide range of errors, including the insufficient provision of

adequate resources, caused a significant breakdown in the state’s plan, which resulted in a variety of problems that permeated the election process in Florida. Large numbers of Florida voters experienced frustration and anger on Election Day as they endured excessive delays, misinformation, and confusion, which resulted in the denial of their right to vote or to have their vote counted.

Page 6: Errors, Failures and Risks CS4020 Overview Failures and Errors in Computer Systems Case Study: The Therac-25 Increasing Reliability and Safety Dependence,

Some more examples of failureSome more examples of failure

• Denver Airport – Mid 90’s, software that controls its automated baggage system malfunctioning.

Scheduled for takeoff by last Halloween, the airport's grand opening was postponed until December to allow BAE Automated Systems time to flush the gremlins out of its $193-million system. December yielded to March. March slipped to May. In June the airport's planners, their bond rating demoted to junk and their budget hemorrhaging red ink at the rate of $1.1 million a day in interest and operating costs, conceded that they could not predict when the baggage system would stabilize enough for the airport to open.

• Ariane 5 Rocket– Ariane 5's first test flight (Ariane 5 Flight 501) on 4 June 1996 failed,

with the rocket self-destructing 37 seconds after launch because of a malfunction in the control software, which was arguably one of the most expensive computer bugs in history.

Page 7: Errors, Failures and Risks CS4020 Overview Failures and Errors in Computer Systems Case Study: The Therac-25 Increasing Reliability and Safety Dependence,

What was the problem???What was the problem???

Denver Airport: • Baggage system failed due to real world

problems, problems in other systems and software errors

• Main causes:– Time allowed for development was

insufficient– Denver made significant changes in

specifications after the project began

Page 8: Errors, Failures and Risks CS4020 Overview Failures and Errors in Computer Systems Case Study: The Therac-25 Increasing Reliability and Safety Dependence,

High-level Causes of High-level Causes of Computer-System FailuresComputer-System Failures• Lack of clear, well thought out goals and

specifications• Poor management and poor communication among

customers, designers, programmers, etc.• Pressures that encourage unrealistically low bids, low

budget requests, and underestimates of time requirements

• Use of very new technology, with unknown reliability and problems

• Refusal to recognize or admit a project is in trouble

Page 9: Errors, Failures and Risks CS4020 Overview Failures and Errors in Computer Systems Case Study: The Therac-25 Increasing Reliability and Safety Dependence,

Saftey- Critical Saftey- Critical ApplicationsApplications

We need to be especially dutiful when creating applications dealing with health and safety.

Example:• A-320: "fly-by-the-wire" airplanes (many systems are

controlled by computers and not directly by the pilots)– Between 1988-1992 four planes crashed

• Air traffic control is extremely complex, and includes computers on the ground at airports, devices in thousands of airplanes, radar, databases, communications, and so on - all of which must work in real time, tracking airplanes that move very fast

• In spite of problems, computers and other technologies have made air travel safer

Page 10: Errors, Failures and Risks CS4020 Overview Failures and Errors in Computer Systems Case Study: The Therac-25 Increasing Reliability and Safety Dependence,

Case Study: The Therac-25Case Study: The Therac-25

Therac-25 Radiation Overdoses:• radiation therapy machine produced by Atomic

Energy of Canada Limited (AECL) and CGR MeV of France

• Massive overdoses of radiation were given; the machine said no dose had been administered at all

• Caused severe and painful injuries and the death of three patients (+)

• Important to study to avoid repeating errors• Manufacturer, computer programmer, and

hospitals/clinics all have some responsibility

Page 11: Errors, Failures and Risks CS4020 Overview Failures and Errors in Computer Systems Case Study: The Therac-25 Increasing Reliability and Safety Dependence,

Case Study: The Therac-25Case Study: The Therac-25Software and Design problems:• Re-used software from older systems, unaware of bugs

in previous software

• Weaknesses in design of operator interface

• Inadequate test plan

• Bugs in software– Allowed beam to deploy when table not in proper

position– Ignored changes and corrections operators made at

console

Page 12: Errors, Failures and Risks CS4020 Overview Failures and Errors in Computer Systems Case Study: The Therac-25 Increasing Reliability and Safety Dependence,

Case Study: The Therac-25Case Study: The Therac-25Why So Many Incidents?• Hospitals had never seen such massive overdoses

before, were unsure of the cause

• Manufacturer said the machine could not have caused the overdoses and no other incidents had been reported (which was untrue)

• The manufacturer made changes to the turntable and claimed they had improved safety after the second accident. The changes did not correct any of the causes identified later

Page 13: Errors, Failures and Risks CS4020 Overview Failures and Errors in Computer Systems Case Study: The Therac-25 Increasing Reliability and Safety Dependence,

Case Study: The Therac-25Case Study: The Therac-25

Why So Many Incidents? (cont.)• Recommendations were made for further

changes to enhance safety; the manufacturer did not implement them

• The FDA declared the machine defective after the fifth accident

• The sixth accident occurred while the FDA was negotiating with the manufacturer on what changes were needed

Page 14: Errors, Failures and Risks CS4020 Overview Failures and Errors in Computer Systems Case Study: The Therac-25 Increasing Reliability and Safety Dependence,

Case Study: The Therac-25Case Study: The Therac-25Observations and Perspective:• Minor design and implementation errors usually occur

in complex systems; they are to be expected

• The problems in the Therac-25 case were not minor and suggest irresponsibility

• Accidents occurred on other radiation treatment equipment without computer controls when the technicians:– Left a patient after treatment started to attend a

party– Did not properly measure the radioactive drugs– Confused micro-curies and milli-curies

Page 15: Errors, Failures and Risks CS4020 Overview Failures and Errors in Computer Systems Case Study: The Therac-25 Increasing Reliability and Safety Dependence,

• If you were a judge who had to assign responsibility in this case, how much responsibility would you assign to the programmer, the manufacturer, and the hospital or clinic using the machine?

• Post your answers to the Discussion Board

Discussion QuestionDiscussion Question

Page 16: Errors, Failures and Risks CS4020 Overview Failures and Errors in Computer Systems Case Study: The Therac-25 Increasing Reliability and Safety Dependence,

Increasing Reliability and Increasing Reliability and SafetySafety

What goes Wrong?• Design and development problems• Management and use problems• Misrepresentation, hiding problems and

inadequate response to reported problems• Insufficient market or legal incentives to do a

better job• Re-use of software without sufficiently

understanding the code and testing it• Failure to update or maintain a database

Page 17: Errors, Failures and Risks CS4020 Overview Failures and Errors in Computer Systems Case Study: The Therac-25 Increasing Reliability and Safety Dependence,

Making it Better: Making it Better: Professional TechniquesProfessional Techniques

• Importance of good software engineering and professional responsibility

• User interfaces and human factors– Feedback– Should behave as an experienced user expects– Workload that is too low can lead to mistakes

• Redundancy and self-checking• Testing

– Include real world testing with real users

Page 18: Errors, Failures and Risks CS4020 Overview Failures and Errors in Computer Systems Case Study: The Therac-25 Increasing Reliability and Safety Dependence,

Law, Regulation and MarketsLaw, Regulation and MarketsMake it better by:

• Criminal and civil penalties– Penalize problems but provide incentives to produce good systems, but

shouldn't inhibit innovation.

• Warranties for consumer software– Most are sold ‘as-is’

• Regulation for safety-critical applications– Hard to do, but, could save failures

• Professional licensing– Arguments for and against

• Taking responsibility

Page 19: Errors, Failures and Risks CS4020 Overview Failures and Errors in Computer Systems Case Study: The Therac-25 Increasing Reliability and Safety Dependence,

Dependence, Risk, and Dependence, Risk, and ProgressProgress

• Are We Too Dependent on Computers?– Computers are tools– They are not the only dependence

• Electricity

• Risk and Progress– Many new technologies were not very safe when

they were first developed– We develop and improve new technologies in

response to accidents and disasters– We should compare the risks of using computers

with the risks of other methods and the benefits to be gained

Page 20: Errors, Failures and Risks CS4020 Overview Failures and Errors in Computer Systems Case Study: The Therac-25 Increasing Reliability and Safety Dependence,

• Err.3) Do you believe we are too dependent on computers? Why or why not?

• Err.4) In what ways are we safer due to new technologies?

• Post your answers to the discussion board.

Discussion QuestionsDiscussion Questions