Amazon Cloud Major Outages Analysis

15
Amazon Major Cloud Outage Analysis Author: Rahul Tyagi

description

Amazon is pretty successful organization. Amazon team is able to innovate persistently and launch successful products to marketplace. Over a period of time Amazon has rejuvenated itself, it started journey as online book seller. Amazon started adding new products and services to their online store i.e. toys, clothes, shoes, furniture etc. By now Amazon is considered biggest online store on planet earth. Amazon has done lot of innovation in retail space. The Amazon team honed its infrastructure skills over period of time to achieve the scale it needs. In last few years Amazon decided to leverage its IT infrastructure scale advantage and offer IT infrastructure as a service for its customers i.e. storage services on cloud based platform as a service. By now Amazon is one of the major cloud based service providers. Amazon is considered to have cool work culture, in many journals you may find reference of Amazon’s “Just-Do-It” style culture. Many small and large organizations are using Amazon’s cloud based product and services. The organizations like Dropbox, Reddit, Pinterest, AirBnB, Netflix etc. are leveraging Amazon cloud products for running their businesses. The Cloud platform is mission critical to Amazon’s customers. In recent past we have seen major outages at Amazon cloud based platform Amazon had major outage on Dec/24/2012, Oct/2012, Jun/2012. It seems now Amazon cloud outages pretty much as major quarterly event! The Amazon cloud snafus are causing major business disruptions to its customers i.e. over Christmas Eve many customers were unable to enjoy Netflix streaming services, Oct/2012 outage impacted organizations like Pinterest, AirBnB etc. We wonder an organization that is extremely successful in provided great products and services in retail space is failing (or struggling) in Cloud space. What are the potential reasons of outages, how to mitigate outages. I did analysis of the major Amazon Cloud incident considering root cause analysis published by Amazon, public opinions and customer commentary. As per my analysis, I see process flaws (cloud operations) as constant theme in majority of cloud outages at Amazon. Software (probably SDLC) related issues are also observed as contributing factors. I look forward to hear your thoughts.

Transcript of Amazon Cloud Major Outages Analysis

Page 1: Amazon Cloud Major Outages Analysis

Amazon Major Cloud Outage Analysis

Author: Rahul Tyagi

Page 2: Amazon Cloud Major Outages Analysis

2

The Agenda

• The Issue• The Goals• Analysis Methodology• The Analysis

Page 3: Amazon Cloud Major Outages Analysis

3

The Issue

• Due to deep proliferation of Amazon cloud into enterprises, The major Amazon cloud outages causes wide spread impact…

• The organizations like Netflix, Dropbox, AirBnB and Pinterest had impact due to Amazon cloud outages

Page 4: Amazon Cloud Major Outages Analysis

4

The Issue

• Major cloud outages are pretty regular events in recent past, some of the major outages

• Dec/24/2012• Oct/22/2012• Jun/29/2012• Apr/21/2011

Page 5: Amazon Cloud Major Outages Analysis

5

The Goals

• We want to analyze chain of events causing major Amazon cloud outages (from official Amazon statements)…

• We analyzed major outages in past 2 years…

• The goal is to identify probable root causes and areas that have opportunity to improve…

Page 6: Amazon Cloud Major Outages Analysis

6

Analysis Methodology

We would leverage “Analytical Hierarchy Process” for identifying root causes…

Page 7: Amazon Cloud Major Outages Analysis

7

Analysis Methodology

Analyze Amazon’s

Statements about Outage

Identify “Chain of Events”

causing outage

Categorize “Chain of Events”

Analysis and Conclusion

Page 8: Amazon Cloud Major Outages Analysis

8

The Analysis > Analyze Amazon’s Statements about Outages

Outage Date Amazon’s Statement

Dec/24/2012 http://aws.amazon.com/message/680587/

Oct/22/2012 http://aws.amazon.com/message/680342/

Jun/29/2012 http://aws.amazon.com/message/67457/

Apr/21/2011 http://aws.amazon.com/message/65648/

We analyzed following Amazon’s official statements…

Page 9: Amazon Cloud Major Outages Analysis

9

The Analysis > Identify “Chain of Events” causing outages

Outage Core Issue

Dec-12“The [ELB State] data was deleted by a maintenance process that was inadvertently run against the production ELB state data”

Oct-12“The root cause of the problem was a latent bug in an operational data collection agent that runs on the EBS storage servers”

Jun-12

“In the single datacenter that did not successfully transfer to the generator backup, all servers continued to operate normally on Uninterruptable Power Supply (“UPS”) power. As onsite personnel worked to stabilize the primary and backup power generators, the UPS systems were depleting and servers began losing power at 8:04pm PDT”

Apr-11

“The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network.”

The statements in double quotes are from Amazon’s press releases…

Page 10: Amazon Cloud Major Outages Analysis

10

The Analysis > Identify “Chain of Events” causing outages

Outage Chain of Events

Dec-12 "Maintenance process inadvertently run against production ELB state data"

Process for incident approval had loose ends

Validation for maintenance process's (which ran inadvertently) output was missing

"load balancers that were modified were improperly configured by the control plane"

Oct-12 "latent bug in an (EBS) operational data collection agent"

"latent memory leak bug in the reporting agent" The monitoring process of memory leak was non existent.

"the DNS update did not successfully propagate to all of the internal DNS servers"

"the (aggressive) throttling policy that was put in place was too aggressive"

Jun-12 "datacenter that did not successfully transfer to the generator backup"

"As onsite personnel worked to stabilize the primary and backup power generators, the UPS systems were depleting and servers began losing power at 8:04pm PDT""a small number of Multi-AZ RDS instances did not complete failover, due to a software bug"

"As the power and systems returned, a large number of ELBs came up in a state which triggered a bug we hadn’t seen before"

Apr-11“The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network.”

"We now understand the amount of capacity needed for large recovery events and will be modifying our capacity planning and alarming so that we carry the additional safety capacity that is needed for large scale failures"

"We will audit our change process and increase the automation to prevent this mistake from happening in the future""We will also invest in increasing our visibility, control, and automation to recover volumes in an EBS cluster"

Page 11: Amazon Cloud Major Outages Analysis

11

The Analysis > Categorize “Chain of Events”

Outage Chain of Events Hardware Software Automation ProcessDec-12"Maintenance process inadvertently run against production ELB state data" X X

Process for incident approval had loose ends XValidation for maintenance process's (which ran inadvertently) output was missing X X X"load balancers that were modified were improperly configured by the control plane" X

Oct-12"latent bug in an (EBS) operational data collection agent" X X"latent memory leak bug in the reporting agent" The monitoring process of memory leak was non existent. X X"the DNS update did not successfully propagate to all of the internal DNS servers" X X"the (aggressive) throttling policy that was put in place was too aggressive" X X

Jun-12"datacenter that did not successfully transfer to the generator backup" X"As onsite personnel worked to stabilize the primary and backup power generators, the UPS systems were depleting and servers began losing power at 8:04pm PDT" X"a small number of Multi-AZ RDS instances did not complete failover, due to a software bug" X X"As the power and systems returned, a large number of ELBs came up in a state which triggered a bug we hadn’t seen before" X X

Apr-11

“The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network.” X"We now understand the amount of capacity needed for large recovery events and will be modifying our capacity planning and alarming so that we carry the additional safety capacity that is needed for large scale failures" X"We will audit our change process and increase the automation to prevent this mistake from happening in the future" X"We will also invest in increasing our visibility, control, and automation to recover volumes in an EBS cluster" X X

Page 12: Amazon Cloud Major Outages Analysis

12

The Analysis > Analysis and Conclusions

Process issues are common theme in major outages at Amazon cloud…

Page 13: Amazon Cloud Major Outages Analysis

13

The Analysis > Analysis and Conclusions

Software 8

Automation 4

Process 14

Amazon Cloud Major Outage - Issues Categories#

of Is

sues

Process and Software are leading contributing factors to major outages at Amazon…

Page 14: Amazon Cloud Major Outages Analysis

14

The Analysis > Analysis and Conclusions

• The majority of issues contributing to outages are related to process or software

• It seems “Process” rigor in cloud operations and SDLC at Amazon has opportunity to improve

• Culture? We heard, Amazon has Just-Do-It culture, The process rigor may require more than just “just-do-it”

Page 15: Amazon Cloud Major Outages Analysis

15

Thank You! You are Awesome! You deserve applause!!