Surviving an Amazon Outage

34
©Continuent 2012. Surviving An Amazon Outage Neil Armitage, Cluster implementation Engineer, Continuent Wednesday, 24 April 13

Transcript of Surviving an Amazon Outage

Page 1: Surviving an Amazon Outage

©Continuent 2012.

Surviving An Amazon Outage

Neil Armitage, Cluster implementation Engineer, Continuent

Wednesday, 24 April 13

Page 2: Surviving an Amazon Outage

©Continuent 2012 2

Overview

• Continuent’s external/internal infrastructure is built in AWS

• Review carried out in the Summer of 2012 after several AWS Outages

• Treated the review as a Customer engagement

• Further review in Autumn of 2012 leading to the Multi-Cloud deployment

Wednesday, 24 April 13

Page 3: Surviving an Amazon Outage

©Continuent 2012

What is AWS

Amazon Web Services is a collection of remote computing services (also called web services)

that together make up a cloud computing platform.

The central services are EC2 (Compute) and S3 (Storage) Services.

3

Wednesday, 24 April 13

Page 4: Surviving an Amazon Outage

©Continuent 2012

AWS Regions

4

Ireland(3 AZ)

Sao Paulo(2 AZ)

Northern Virginia(5 AZ)

Oregon(3 AZ)

California(3 AZ)

Singapore(2 AZ)

Tokyo(3 AZ)

Sydney(2 AZ)

Wednesday, 24 April 13

Page 5: Surviving an Amazon Outage

©Continuent 2012

AWS Availability Zones

5

Region

Availability Zone Availability Zone

Availability Zone

Region

Availability Zone Availability Zone

Wednesday, 24 April 13

Page 6: Surviving an Amazon Outage

©Continuent 2012

AWS Services

• Compute EC2

• Network - Route 53 and Virtual Private Cloud (VPC)

• Content Delivery - Cloudfront

• Storage - S3, Glacier, EBS

• Database - DynamoDB, RDS, RedShift, SimpleDB

• Deployment - Cloudformation, Beanstalk, OpsWorks

6

Wednesday, 24 April 13

Page 7: Surviving an Amazon Outage

©Continuent 2012

AWS Size*

• Between 100K and 500K physical servers

• 1.5million Public IP Addresses

• S3 holds > 2 Trillion objects - 1.1m requests per second

• 1/3 of daily users access a site running on AWS

• 1% of internet tra!c goes through Amazon Infrastructure

7

* Estimates based on various internet sources

Wednesday, 24 April 13

Page 8: Surviving an Amazon Outage

©Continuent 2012

Continuent Systems

• External facing website

• Jira/Con"uence internal systems

• Subversion

• Jenkins build system

8

Wednesday, 24 April 13

Page 9: Surviving an Amazon Outage

©Continuent 2012

External Website

9

Internet ElasticIP

Web Server

DBServer

Region

Availability Zone

Wednesday, 24 April 13

Page 10: Surviving an Amazon Outage

©Continuent 2012

Jira/Con!uence/Subversion

10

Internet ElasticIP

App ServerJira

ConfluenceSVN ServerMySQL

Availability Zone

Region

Wednesday, 24 April 13

Page 11: Surviving an Amazon Outage

©Continuent 2012

AWS Problems Summer 2012

“Amazon Cloud Hit by Real Clouds, Downing Net!ix, Instagram, Other Sites”

Severe Storms caused power outages at AWS US-East Data centers, generators failed taking out 7% of EC2 instances.http://www.pcworld.com/article/258627/amazon_cloud_hit_by_real_clouds_knocking_out_popular_sites_like_netflix_instagram.html

11

Wednesday, 24 April 13

Page 12: Surviving an Amazon Outage

©Continuent 2012

Migration Plan

• Move to a clustered Continuent Tungsten environment

• Ensure all components are replicated into at least one other AWS Region

• Limited downtime on Customer facing systems

• Minimal downtime on internal systems

12

Wednesday, 24 April 13

Page 13: Surviving an Amazon Outage

©Continuent 2012 13

MasterSlave Slave

App Logic

Tungsten Connector

Replicator Replicator Replicator

App Logic

Tungsten Connector

Manager Manager Manager

Data Service: nyc

Wednesday, 24 April 13

Page 14: Surviving an Amazon Outage

©Continuent 2012 13

MasterSlave Slave

App Logic

Tungsten Connector

Replicator Replicator Replicator

App Logic

Tungsten Connector

Manager Manager Manager

Monitoring and control

Monitoring and control

Data Service: nyc

Wednesday, 24 April 13

Page 15: Surviving an Amazon Outage

©Continuent 2012 13

MasterSlave Slave

App Logic

Tungsten Connector

Replicator Replicator Replicator

App Logic

Tungsten Connector

Manager Manager Manager

Monitoring and control

Monitoring and control

Data Service: nyc

Wednesday, 24 April 13

Page 16: Surviving an Amazon Outage

©Continuent 2012 13

MasterSlave Slave

App Logic

Tungsten Connector

Replicator Replicator Replicator

App Logic

Tungsten Connector

Manager Manager Manager

Monitoring and control

Monitoring and control

Data Service: nyc

Wednesday, 24 April 13

Page 17: Surviving an Amazon Outage

©Continuent 2012

Website Database Tier - Round 1

14

Region

Availability Zone Availability Zone

Region

Availability Zone

US-EAST-1 US-WEST-1

1B 1C 1C

S3 Backups

S3 Backups

Connectors

Wednesday, 24 April 13

Page 18: Surviving an Amazon Outage

©Continuent 2012

DB Failures - Failure in US-EAST-1C

15

Region

Availability Zone Availability Zone

Region

Availability Zone

US-EAST-1 US-WEST-1

1B 1C 1C

S3 Backups

S3 Backups

Connectors

Wednesday, 24 April 13

Page 19: Surviving an Amazon Outage

©Continuent 2012

DB Failures - Failure in US-EAST

16

Region

Availability Zone Availability Zone

Region

Availability Zone

US-EAST-1 US-WEST-1

1B 1C 1C

S3 Backups

S3 Backups

Connectors

Wednesday, 24 April 13

Page 20: Surviving an Amazon Outage

©Continuent 2012 17

DEMO

Wednesday, 24 April 13

Page 21: Surviving an Amazon Outage

©Continuent 2012

Website Web Tier - Round 1

18

Region

Availability Zone Availability Zone

Region

Availability Zone

US-EAST-1 US-WEST-1

1B 1C 1C

S3 Backups

S3 Backups

Internet

EIP

Wednesday, 24 April 13

Page 22: Surviving an Amazon Outage

©Continuent 2012

Web Failures - Failure in US-EAST-1C

19

Region

Availability Zone Availability Zone

Region

Availability Zone

US-EAST-1 US-WEST-1

1B 1C 1C

S3 Backups

S3 Backups

Internet

EIP

Wednesday, 24 April 13

Page 23: Surviving an Amazon Outage

©Continuent 2012

Web Failures - Failure in US-EAST

20

Region

Availability Zone Availability Zone

Region

Availability Zone

US-EAST-1 US-WEST-1

1B 1C 1C

S3 Backups

S3 Backups

Internet

EIP

DNS Update

Wednesday, 24 April 13

Page 24: Surviving an Amazon Outage

©Continuent 2012

Jira/Con!uence/SVN - Round 1

21

Region

Availability Zone

Region

Availability Zone

US-EAST-1 US-WEST-1

1C 1C

S3 Backups

S3 Backups

Internet

EIP

Wednesday, 24 April 13

Page 25: Surviving an Amazon Outage

©Continuent 2012

AWS Failures - Autumn 2012

“Amazon Web Services outage takes out popular websites again”

•EBS degraded performance

•Problems allocating new volumes

http://www.pcworld.com/article/2012852/amazon-web-services-outage-takes-out-popular-

websites-again.html

22

Wednesday, 24 April 13

Page 26: Surviving an Amazon Outage

©Continuent 2012

Website Database Tier - Round 2

23

Region

Availability Zone Availability Zone

Region

Availability Zone

US-EAST-1

US-WEST-1

1B 1C

1C

S3 Backups

S3 Backups

RackSpace

Wednesday, 24 April 13

Page 27: Surviving an Amazon Outage

©Continuent 2012

Website Web Tier - Round 2

24

Region

Availability Zone Availability Zone

Region

Availability Zone

US-EAST-1

US-WEST-11B 1C

1C

S3 Backups

S3 Backups

Internet

EIP

RackSpace

Wednesday, 24 April 13

Page 28: Surviving an Amazon Outage

©Continuent 2012

Jira/Con!uence/SVN - Round 2

25

Region

Availability Zone Region

Availability Zone

US-EAST-1

US-WEST-11C

1C

S3 Backups

S3 Backups

Internet

EIP

RackSpace

Wednesday, 24 April 13

Page 29: Surviving an Amazon Outage

©Continuent 2012

Best Practices

• RAID EBS Volumes (RAID1)

• Backups

• xtrabackup (backed up into S3)

• EBS Snapshot

26

ec2-­‐consistent-­‐snapshot  \  -­‐-­‐mysql  -­‐-­‐freeze-­‐filesystem  /vol  \  -­‐-­‐region  eu-­‐west-­‐1    \  -­‐-­‐description  "$(hostanme)  RAID  snapshot  $(date  +'%Y-­‐%m-­‐%d  %H:%M:%S')"  \  vol-­‐1f9a6446  vol-­‐649a643d

Wednesday, 24 April 13

Page 30: Surviving an Amazon Outage

©Continuent 2012

Best Practices

• Monitoring

• Nagios scripts converted to email alerts

• New Relic

27

Wednesday, 24 April 13

Page 31: Surviving an Amazon Outage

©Continuent 2012

Lesson Learnt

• EC2 Instances fail

• One of anything is never enough

• Don’t assume you can spin up more resources instantly

• Think multi-cloud, public/private

• Resources are disposable - throw away and rebuild if needed

28

Wednesday, 24 April 13

Page 32: Surviving an Amazon Outage

©Continuent 2012

Further Plans

• Realtime replication of web assets (glusterFS?)

• Introduce a Elastic Load Balancer in front of US-EAST Web servers to allow for auto web failover

• Migrate into a VPC

• Investigate Route 53 for DNS Failover

29

Wednesday, 24 April 13

Page 33: Surviving an Amazon Outage

©Continuent 2012 30

We are Recruiting

Come to our booth for more infomation

Wednesday, 24 April 13

Page 34: Surviving an Amazon Outage

©Continuent 2012 31

Continuent Website:http://www.continuent.com

Tungsten Replicator 2.0:http://code.google.com/p/tungsten-replicator

Our Blogs:http://scale-out-blog.blogspot.comhttp://datacharmer.blogspot.comhttp://flyingclusters.blogspot.com

560 S. Winchester Blvd., Suite 500 San Jose, CA 95128 Tel +1 (866) 998-3642 Fax +1 (408) 668-1009e-mail: [email protected]

Wednesday, 24 April 13