Download - Cloud malfunction up11

Copyright © 2010 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are trademarks of Accentu re.

Lessons from a Cloud Malfunction An Analysis of a Global System Failure

Alex Maclinovsky – Architecture Innovation

12.09.2011

Copyright © 2011 Alex V. Maclinovsky & Accenture All Rights Reserved.

Introduction

For most of this presentation we will discuss a global

outage of Skype that took place around December

22nd 2010. But this story is not about Skype. It is

about building dependable internet-scale systems

and will be based on Skype case because there are

many similarities among complex distributed

systems, and often certain failure mechanisms

manifest themselves again and again.

Disclaimer: this analysis is not based on insider knowledge: I

relied solely on CIO’s blog, my observations during the event,

and experience in running similar systems.

2


Approximate Skype Outage Timeline

3

Weeks

before

incident

A buggy version of windows client released

Bug identified, fixed, new version released

0 min. Cluster of support servers responsible for offline instant

messaging became overloaded and failed

+30 min. Buggy clients receive delayed messages and begin to crash -

20% of total

+1 hour 30% of the publicly available super-nodes down, traffic surges

+2 hours Orphaned clients crowd surviving super-nodes, latter self destruct

+3 hours Cloud disintegrates

+6 hours Skype introduces recovery super-nodes, recovery starts

+12 hours Recovery slow, resources cannibalized to introduce more nodes

+24 hours Cloud recovery complete

+48 hours Cloud stable, resources released

+72 hours Sacrificed capabilities restored


Eleven Lessons

1. Pervasive Monitoring

2. Early warning systems

3. Graceful degradation

4. Contagious failure awareness

5. Design for failures

6. Fail-fast and exponential back-off

7. Scalable and fault-tolerant control plane

8. Fault injection testing

9. Meaningful failure messages

10. Efficient, timely and honest communication to the end users

11. Separate command, control and recovery infrastructure

4


1 - Pervasive Monitoring

It looks like Skype either did not have sufficient monitoring in place or

their monitoring was not actionable (there were no alarms triggered when

the system started behaving abnormally).

• Instrument everything

• Collect, aggregate and store telemetry

• Make results available in (near)real time

• Constantly monitor against normal operational boundaries

• Detect trends

• Generate events, raise alerts, trigger recovery actions

• Go beyond averages (tp90, tp99, tp999)

5


2 - Early warning systems

Build mechanisms that predict that a problem is

approaching via trend analysis and cultivation:

• Monitor trends

• Use “canaries”

• Use PID controllers

• Look for unusual

deviations between

correlated values Skype did not detect that the support cluster is approaching the tip-over point

6


2 - Early warning: trend spotting

Look for unexpected changes in system state and behavior

over time.

7

Time Time

Va

lue

Time TimeV

alu

e

Abnormality

Abnormality

Step function Unexpected trend

Average

tp99

TPSCPUAbnormality

Abnormality

Worst case

degradation Deviation between

correlated values

Common problem predictor patterns


3 - Graceful degradation

Graceful degradation occurs when in response to non-

normative operating conditions (e.g. overload, resource

exhaustion, failure of components or downstream

dependencies, etc.) system continues to operate, but provides

a reduced level of service rather than suffering a catastrophic

degradation (failing completely). It should be viewed as a

complementary mechanism to fault tolerance in the design of

highly available distributed systems.

8

Overload protection, such as load shedding or throttling, would have caused this event to fizzle out as a minor QoS violation which most Skype’s users would have never noticed.


3 - Graceful degradation - continued

There were two places where overload protection was missing:

9

• simple traditional throttling in the

support cluster would have kept it

from failing and triggering the rest

of the event.

• Once that cluster failed, a more

sophisticated globally distributed

throttling mechanism could have

prevented the contagious failure

of the “supernodes” which was

the main reason for the global

outage.

http://www.acliparthistory.com/church/cbr/index.php?id=320


4 - Contagious failure awareness

• P2P cloud outage was a classic example of positive

feedback induced contagious failure scenario

• Occurs when a failure of a component in a redundant

system increases the probability of failure of its peers that

are supposed to take over and compensate for the initial

failure.

• This mechanism is quite common and is responsible for a

number of infamous events ranging from Tu-144 crashes to

the credit default swap debacle

• Grid architectures are susceptible to

contagious failure: e.g. 2009 Gmail outage

10


5 - Design for failures

• Since failures are inevitable, dependable

systems need to follow the principles of

Recovery-Oriented Computing (ROC), by

aiming at recovery from failures rather

than failure-avoidance.

• Use built-in auto-recovery:

restart reboot reimage replace

• Root issue was client version-specific, and they

should have assumed it:

– check if there is a newer version and upgrade

– downgrade to select earlier versions flagged as “safe”.

11

http://roc.cs.berkeley.edu/roc_overview.html




5 - Design for failures: universal auto-recovery strategy

12


6 - Fail-fast and exponential back-off

• Vitally important in highly distributed systems

to avoid self-inflicted distributed denial of

service (DDoS) attacks similar to the one

which decimated the supernodes

• Since there were humans in the chain, the

back-off state should have been sticky:

persisted somewhere to prevent circumvention

by constantly restarting of the client

• When building a system where the same

request might be retried by different agents, it

is important to implement persistent global

back-off to make sure that no operations are

retried more frequently that permitted.

13


7 - Scalable fault-tolerant control plane

• Build a 23-million-way Big Red Button - ability to

instantly control the "Flash Crowd".

• Most distributed systems focus on scaling the data

plane and assume control plane is insignificant

• Skype’s control plane for the client relied on relay

by the supernodes and was effectively disabled

when the cloud disintegrated

– Backing it up with a simple

RSS-style command feed would

have made it possible to control

the cloud even in dissipated state.

14

http://en.wikipedia.org/wiki/Flash_Crowd


8 - Fault injection testing

• Skype's problem originated and became critical in

the parts of the system that were dealing with

failures of other components.

• This is very typical – often 50% of the code and

overall complexity is dedicated to fault handling.

• This code is extremely difficult to test outside

rudimentary white-box unit tests, so in most cases

it remains untested.

– Almost never regression tested, making it the most stale

and least reliable part of the system

– Is invoked when the system is experiencing problems 15


8 - Fault injection testing framework

• Requires an on-demand fault injection framework.

• Framework intercepts and controls all

communications between components and layers.

• Exposes API to simulate all conceivable kinds of

failures:

– total and intermittent component outages

– communication failures

– SLA violations

16


9 - Meaningful failure messages

• Throughout the event, both the client and central

site were reporting assorted problems that were

often unrelated to what was actually happening:

– e.g. at some point my Skype client complained that my

credentials were incorrect

• Leverage crowdsourcing:

– clear and relevant error messages

– easy ways for users to report problems

– real-time aggregation

– secondary monitoring and alerting network

17


10 - Communication to the end users

• Efficient, timely and honest communication to the

end users is the only way to run Dependable

Systems

• Dedicated Status site for humans

• Status APIs and Feeds

18


11 - Separate command, control and recovery infrastructure

• Have a physically separate and logically

independent emergency command, control and

recovery infrastructure

• Catalog technologies, systems and dependencies

used to detect faults, build, deploy and (re)start

applications, communicate status

• Avoid circular dependencies

• Use separate implementations or old stable

versions

19


11 - Separate command, control and recovery infrastructure - blueprint

20

Monitoring &

Control

Systems

Data Cloud

Binary Content

use

Manage

Man

age

Build & Deploy Systems

Maintain

Maintain

Standalone Data Cloud

use

use

Standalone Control & Recovery Stack

Man

age

Maint

ain

use

All Other Systems Command, Control and Recovery Infrastructure


Cloud Outage Redux

On April 21st 2011 operator error during

maintenance caused an outage of the Amazon

Elastic Block Store (“EBS”) Service, which in turn

brought down parts of the Amazon Elastic Compute

Cloud (“EC2”) and Amazon Relational Database

Service (RDS) Services. The outage lasted 54 hours

and data recovery took several more

days and 0.07% of the EBS volumes

were lost.

The outage affected over 100 sites, including big

names like Reddit, Foursquare and Moby

21

//upload.wikimedia.org/wikipedia/commons/7/71/Serpiente_alquimica.jpg


EC2 Outage Sequence

22

• An operator error occurs during routine

maintenance operation

• Production traffic routed into low capacity

backup network

• Split brain occurs

• When network is restored, system enters

“re-mirroring storm”

• EBS control plane overwhelmed

• Dependent services start failing


Applicable Lessons

1. Pervasive Monitoring

2. Early warning systems

3. Graceful degradation

4. Contagious failure awareness

5. Design for failures

6. Fail-fast and exponential back-off

7. Scalable and fault-tolerant control plane

8. Fault injection testing

9. Meaningful failure messages

10. Efficient, timely and honest communication to the end users

11. Separate command, control and recovery infrastructure

23


Further Reading and contact Information

Skype Outage Postmortem - http://blogs.skype.com/en/2010/12/cio_update.html

Amazon Outage Postmortem - http://aws.amazon.com/message/65648/

Recovery-Oriented Computing - http://roc.cs.berkeley.edu/roc_overview.html

Designing and Deploying Internet-Scale Services, James Hamilton - http://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa_DesigningServices.pptx

Design Patterns for Graceful Degradation, Titos Saridakis - http://www.springerlink.com/content/m7452413022t53w1/

Alex Maclinovsky blogs at http://randomfour.com/blog/

and can be reached via:

[email protected]

24

http://blogs.skype.com/en/2010/12/cio_update.html

http://aws.amazon.com/message/65648/







http://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa_DesigningServices.pptx



http://www.springerlink.com/content/m7452413022t53w1/

http://randomfour.com/blog/

mailto:[email protected]


Questions & Answers