Why do Internet services fail, and what can be done about...

Why do Internet services fail, and what can be done about it?

David Oppenheimer, Archana Ganapathi, and David Patterson

Computer Science DivisionUniversity of California at Berkeley

4th USENIX Symposium on Internet Technologies and SystemsMarch 2003

Motivation• Internet service availability is important

– email, instant messenger, web search, e-commerce, …

• User-visible failures are relatively frequent– especially if use non-binary definition of “failure”

• To improve availability, must know what causes failures– know where to focus research– objectively gauge potential benefit of techniques

• Approach: study failures from real Internet svcs.– evaluation includes impact of humans & networks

Outline

• Describe methodology and services studied

• Identify most significant failure root causes– source: type of component– impact: number of incidents, contribution to TTR

• Evaluate HA techniques to see which of them would mitigate the observed failures

• Drill down on one cause: operator error

• Future directions for studying failure data

Methodology• Obtain “failure” data from three Internet services– two services: problem tracking database– one service: post-mortems of user-visible failures

• We analyzed each incident– failure root cause

» hardware, software, operator, environment, unknown– type of failure

» “component failure” vs. “service failure”– time to diagnose + repair (TTR)

Methodology• Obtain “failure” data from three Internet services– two services: problem tracking database– one service: post-mortems of user-visible failures

• We analyzed each incident– failure root cause

» hardware, software, operator, environment, unknown– type of failure

» “component failure” vs. “service failure”– time to diagnose + repair (TTR)

• Did not look at security problems

Comparing the three services

562140# service failures

205N/A296# component failures

3 months6 months7 monthsperiod studied

custom s/w; open-source OS on x86

Network Appliance

filers

back-endnode

architecture

custom s/w; open-source OS on x86;

custom s/w; open-source OS on x86

custom s/w; Solaris on

SPARC, x86

front-end node

architecture

~500@ ~15 sites

> 2000@ 4 sites

~500 @ 2 sites

# of machines

~7 million~100 million~100 millionhits per dayContentReadMostlyOnlinecharacteristic

Outline• Describe methodology and services studied

Failure cause by % of service failuresOnline Content

ReadMostly

hardware10%

software25%

network20%

operator33%

unknown12%

hardware2%

software25%

network15%operator

unknown22%

software5%

network62%

operator19%

unknown14%

hardware6%software17%

network1%

operator76%

unknown1% software

6%network19%

operator75%

Failure cause by % of TTROnline Content

ReadMostly

network97%

operator3%

Most important failure root cause?

• Operator error generally the largest cause of service failure– even more significant as fraction of total “downtime”– configuration errors > 50% of operator errors– generally happened when making changes, not repairs

• Network problems significant cause of failures

Related work: failure causes

• Tandem systems (Gray)– 1985: Operator 42%, software 25%, hardware 18%– 1989: Operator 15%, software 55%, hardware 14%

• VAX (Murphy)– 1993: Operator 50%, software 20%, hardware 10%

• Public Telephone Network (Kuhn, Enriquez)– 1997: Operator 50%, software 14%, hardware 19%– 2002: Operator 54%, software 7%, hardware 30%

Potential effectiveness of techniques?

pre-deployment correctness testing*proactive restart*

pre-deployment fault injection/load testcomponent isolation*

post-deploy. fault injection/load testingautomatic configuration checking

redundancy*expose/monitor failures*

post-deployment correctness testing*

technique

* indicates technique already used by Online

Potential effectiveness of techniques?

2pre-deployment correctness testing*3proactive restart*3pre-deployment fault injection/load test5component isolation*6post-deploy. fault injection/load testing9automatic configuration checking9redundancy*12expose/monitor failures*26post-deployment correctness testing*

failures avoided / mitigated

technique

(40 service failures examined)

• Evaluate existing techniques to see which of them would mitigate the observed failures

Drilling down: operator errorWhy does operator error cause so many svc. failures?

Existing techniques (e.g., redundancy) are minimally effective at masking operator error

24%19%

operator software network hardware

25%21% 19%

operator software network hardware

% of component failures resulting in service failures

Content Online

Drilling down: operator error TTR

Detection and diagnosis difficult because of non-failstop failures and poor error checking

Why does operator error contribute so much to TTR?

hardware6%software17%

network1%

operator76%

unknown1% software

6%network19%

operator75%

Online Content

Future directions in studying failures• Quantify impact of of operational practices

• Study additional types of sites– transactional, intranets, peer-to-peer

• Create a public failure data repository– standard taxonomy of failure causes– standard metrics for impact – techniques for automatic anonymization– security (not just reliability)– automatic analysis (mining for trends, fixes, attacks, …)

• Perform controlled laboratory experiments

Conclusion• Operator error large cause of failures, downtime

• Many failures could be mitigated with– better post-deployment testing– automatic configuration checking– better error detection and diagnosis

• Longer-term: concern for operators must be built into systems from the ground up– make systems robust to operator error– reduce time it takes operators to detect, diagnose, and

repair problems

Willing to contribute failure data, or information about problem

detection/diagnosis techniques?

http://roc.cs.berkeley.edu/projects/faultmanage/

davidopp@cs.berkeley.edu

Why do Internet services fail, and what can be done about...

Documents

Transcript of Why do Internet services fail, and what can be done about...

Try again. Fail again. Fail better: Try again. Fail again ... and D paper 0670360902.pdfTry again. Fail again. Fail better: the cybernetics in design and the design in cybernetics

To fail or not to fail

An Overview of Human Error - The UC Berkeley/Stanford ...roc.cs.berkeley.edu/294fall01/slides/human-error.pdf · Slide 17 Accident theory (2) • Accidents are exacerbated by human

YOTG Hamburg - Frank Wolfram - Fail fast, fail cheap, fail often

This is the Turing Award talk I presented at ACM SIGMOD in ...roc.cs.berkeley.edu/294fall01/readings/gray-turing.pdf · Turing Award winner. Thanks also to Lucent Technologies for

Proceedings of USITS ’03: 4th USENIX Symposium on Internet ... · identifying the Web server that is “closest” to a client, e.g., has the highest-bandwidthor lowest-latencypath

IF YOU FAIL TO PREPARE, YOU PREPARE TO FAIL

Fail=Win : UX Fail Stories at FailChat

to fail and fail big€¦ · TO FAIL AND FAIL BIG 3. to fail and fail big. If you’re afraid that the floor might fall out from underneath you, then how can you possibly leap? Artistic

-:r:- · 3185 101059 29.5 72 102/ Fail 3186 101060 57 70 127/ Fail 3187 101061 49.5 59 109 Fail / 3188 101062 53 73 126-Fail ... 3222 101096 42 66 108-Fail

Fail Fast, Fail Often

Measurement and Analysis of a Streaming-Media Workloadcseweb.ucsd.edu/~voelker/pubs/usits-2001.pdfMeasurement and Analysis of a Streaming-Media Workload Maureen Chesire, Alec Wolman,

#FAIL: Congress - FAS.research€¦ · #FAIL: Congress . #FAIL: Markets . #FAIL: CEOs . #Win: NGOs . From Factory to Network Society . Access Trumps Ownership . 1 for 13 . $127m .

Adaptive Overload Control for Busy Internet Servers Matt Welsh and David Culler USENIX Symposium on Internet Technologies and Systems (USITS) 2003 Alex.

If you fail to plan, will your plan fail?

Surat Municipal Institute of Medical Education and Researchsmimer.suratmunicipal.gov.in/Content/documents/... · fail 164 fail 165 pass as per mci rule 166 fail 167 fail 168 12.00

Embracing Failure: Availability via Recovery …roc.cs.berkeley.edu/talks/pdf/stanford-cs548.pdfSlide 1 Embracing Failure: Availability via Recovery-Oriented Computing (ROC) Aaron

Fail Better: Teaching Students How to Fail to Succeed

Try Again. Fail Fast. Fail Better

Temporary Pass/Fail Grading Option-Fall 2020 Pass/Fail ...