Download - Amazon Cloud Major Outages Analysis

Amazon Major Cloud Outage Analysis

Author: Rahul Tyagi

2

The Agenda

• The Issue• The Goals• Analysis Methodology• The Analysis

3

The Issue

• Due to deep proliferation of Amazon cloud into enterprises, The major Amazon cloud outages causes wide spread impact…

• The organizations like Netflix, Dropbox, AirBnB and Pinterest had impact due to Amazon cloud outages

4

The Issue

• Major cloud outages are pretty regular events in recent past, some of the major outages

• Dec/24/2012• Oct/22/2012• Jun/29/2012• Apr/21/2011

5

The Goals

• We want to analyze chain of events causing major Amazon cloud outages (from official Amazon statements)…

• We analyzed major outages in past 2 years…

• The goal is to identify probable root causes and areas that have opportunity to improve…

6

Analysis Methodology

We would leverage “Analytical Hierarchy Process” for identifying root causes…

http://en.wikipedia.org/wiki/Analytic_hierarchy_process

7

Analysis Methodology

Analyze Amazon’s

Statements about Outage

Identify “Chain of Events”

causing outage

Categorize “Chain of Events”

Analysis and Conclusion

8

The Analysis > Analyze Amazon’s Statements about Outages

Outage Date Amazon’s Statement

Dec/24/2012 http://aws.amazon.com/message/680587/

Oct/22/2012 http://aws.amazon.com/message/680342/

Jun/29/2012 http://aws.amazon.com/message/67457/

Apr/21/2011 http://aws.amazon.com/message/65648/

We analyzed following Amazon’s official statements…

http://aws.amazon.com/message/680587/




9

The Analysis > Identify “Chain of Events” causing outages

Outage Core Issue

Dec-12“The [ELB State] data was deleted by a maintenance process that was inadvertently run against the production ELB state data”

Oct-12“The root cause of the problem was a latent bug in an operational data collection agent that runs on the EBS storage servers”

Jun-12

“In the single datacenter that did not successfully transfer to the generator backup, all servers continued to operate normally on Uninterruptable Power Supply (“UPS”) power. As onsite personnel worked to stabilize the primary and backup power generators, the UPS systems were depleting and servers began losing power at 8:04pm PDT”

Apr-11

“The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network.”

The statements in double quotes are from Amazon’s press releases…

10

The Analysis > Identify “Chain of Events” causing outages

Outage Chain of Events

Dec-12 "Maintenance process inadvertently run against production ELB state data"

Process for incident approval had loose ends

Validation for maintenance process's (which ran inadvertently) output was missing

"load balancers that were modified were improperly configured by the control plane"

Oct-12 "latent bug in an (EBS) operational data collection agent"

"latent memory leak bug in the reporting agent" The monitoring process of memory leak was non existent.

"the DNS update did not successfully propagate to all of the internal DNS servers"

"the (aggressive) throttling policy that was put in place was too aggressive"

Jun-12 "datacenter that did not successfully transfer to the generator backup"

"As onsite personnel worked to stabilize the primary and backup power generators, the UPS systems were depleting and servers began losing power at 8:04pm PDT""a small number of Multi-AZ RDS instances did not complete failover, due to a software bug"

"As the power and systems returned, a large number of ELBs came up in a state which triggered a bug we hadn’t seen before"

Apr-11“The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network.”

"We now understand the amount of capacity needed for large recovery events and will be modifying our capacity planning and alarming so that we carry the additional safety capacity that is needed for large scale failures"

"We will audit our change process and increase the automation to prevent this mistake from happening in the future""We will also invest in increasing our visibility, control, and automation to recover volumes in an EBS cluster"

11

The Analysis > Categorize “Chain of Events”

Outage Chain of Events Hardware Software Automation ProcessDec-12"Maintenance process inadvertently run against production ELB state data" X X

Process for incident approval had loose ends XValidation for maintenance process's (which ran inadvertently) output was missing X X X"load balancers that were modified were improperly configured by the control plane" X

Oct-12"latent bug in an (EBS) operational data collection agent" X X"latent memory leak bug in the reporting agent" The monitoring process of memory leak was non existent. X X"the DNS update did not successfully propagate to all of the internal DNS servers" X X"the (aggressive) throttling policy that was put in place was too aggressive" X X

Jun-12"datacenter that did not successfully transfer to the generator backup" X"As onsite personnel worked to stabilize the primary and backup power generators, the UPS systems were depleting and servers began losing power at 8:04pm PDT" X"a small number of Multi-AZ RDS instances did not complete failover, due to a software bug" X X"As the power and systems returned, a large number of ELBs came up in a state which triggered a bug we hadn’t seen before" X X

Apr-11

“The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network.” X"We now understand the amount of capacity needed for large recovery events and will be modifying our capacity planning and alarming so that we carry the additional safety capacity that is needed for large scale failures" X"We will audit our change process and increase the automation to prevent this mistake from happening in the future" X"We will also invest in increasing our visibility, control, and automation to recover volumes in an EBS cluster" X X

12

The Analysis > Analysis and Conclusions

Process issues are common theme in major outages at Amazon cloud…

13


Software 8

Automation 4

Process 14

Amazon Cloud Major Outage - Issues Categories#

of Is

sues

Process and Software are leading contributing factors to major outages at Amazon…

14


• The majority of issues contributing to outages are related to process or software

• It seems “Process” rigor in cloud operations and SDLC at Amazon has opportunity to improve

• Culture? We heard, Amazon has Just-Do-It culture, The process rigor may require more than just “just-do-it”

15

Thank You! You are Awesome! You deserve applause!!