Technical Lessons on how to do Backup and Disaster Recovery in the Cloud
-
Upload
amazon-web-services -
Category
Technology
-
view
3.802 -
download
0
description
Transcript of Technical Lessons on how to do Backup and Disaster Recovery in the Cloud
Parmigiano, a Monastery, Love and Faith
Simone Brunozzi Senior Technology Evangelist, Amazon Web Services
@simon
Technical lessons on how to do backup and disaster recovery in the cloud
"The mind is not a vessel to be filled, but a fire to be ignited." - Plutarch
Agenda
I. Prologue
II. Lessons
III. Customer Story
IV. Earthquake
V. Lessons
VI. Conclusions
The story of Monte Cassino
Backup
Shaw Media
What happened to my Parmigiano?
Disaster Recovery
... And a little surprise!
Prologue
Part I
Abbey of Monte
Cassino
Why is Monte Cassino important? ] [
The Treasure of Monte Cassino ] [
The Treasure of Monte Cassino ] [
800 papal documents 20,500 volumes in the Old Library 60,000 in the New Library 200 manuscripts on parchment 100,000 prints and paintings (including 11 Titians) 500 incunabula
A book printed before 1501 C.E.
Gutenberg’s Bible was printed in 1455
C.E.
Titian, one of the most influential
painters ever
x
Business continuity continuum ] [
High availability
Backup storage
Disaster recovery
High Availability : Keeping services alive.
Business continuity continuum ] [
High Availability : Keeping services alive. Backing up : Process of copying and archiving of data so it may be used to restore the original after a data loss event.
Business continuity continuum ] [
High Availability : Keeping services alive. Backing up : Process of copying and archiving of data so it may be used to restore the original after a data loss event. Disaster recovery : Recovery of technology infrastructure critical to an organization after a natural or human-induced disaster.
Business continuity continuum ] [
Origin of Backup ] [
Monastery : Brilliant, scalable, low-cost, highly durable backup system Origin of Universities (Charlemagne, 814 C.E.) The Empire
needs educated people
Let’s ask the Church!
Edict: Free education in
cathedrals and monasteries
Lots of books (and backups)
Origin ] [
Monastery : Brilliant, scalable, low-cost, highly durable backup system. Origin of Universities (Charlemagne, 814 a.C.) Indoctrination : One of the first critical function within an organization (Catholic Church) that needed continuation after any natural or human-induced disaster. It needed backup of books (Bibles, etc.) in order to function.
Barbarians, pestilences, fires, invasions, wars,
famines, revolts, etc.
Why is Monte Cassino important? ] [
World War II ] [
Dec 1942: Many “treasures” are transported from Rome and other places to Monte
Cassino, for safety
The Treasure of Monte Cassino ] [
Lost in translation ] [
It means “Military Division”
(abbreviated)
Intercepted German message: “Ist der Abt noch im Kloster?”
“Ja.”
It also means “Abbot”
(abbreviated)
Abbey of Monte Cassino ] [
The Treasure of Monte Cassino ] [
Feb 1944: Schlegel and Becker (Panzer-Division Hermann Göring) had the treasures transferred to the Vatican
x
Escape from Monte Cassino ] [
Escape from Monte Cassino ] [
Lt. Col. Julius Schlegel
(an Austrian Roman Catholic)
Capt. Maximilian Becker
(a Protestant surgeon)
“Biggest bombing against a single target of all time”
2
4
Monte Cassino after bombing (1944) ] [
Restoration in 1954 ] [
The Abbey of Monte Cassino today ] [
End of Prologue
Lessons from Monte Cassino
Part II
1. My backup should be accessible
a.k.a. the pain of physical data
transfer
AWS
1. My backup should be accessible
API AWS Direct Connect
AWS Storage Gateway
Customer owns the data
Redundancy
AWS Import/Export
AWS Storage Gateway ] [
GW-stored volumes
z
GW-Cached volumes
GW-stored volumes
“Cool” storage
“Cold”
w
VPN
Public / AWS Direct Connect
AWS Import/Export
z
2. My backup should be able to scale
Lessons from Monte Cassino ] [
2. My backup should be able to scale
• “Infinite” scale with Amazon S3 and Amazon Glacier • Scale to multiple regions • Seamless • No need to provision • Cost tiers (cheaper at scale)
Regions (8) GovCloud Regions (1)
(as of Nov 27th, 2012)
Global AWS Infrastructure ] [
Availability Zones (23)
Global AWS Infrastructure ] [ (as of Nov 27th, 2012)
Edge Locations (38)
Dallas (2)
St.Louis
Miami
Jacksonville Los Angeles (2)
Palo Alto
Seattle
Ashburn (2)
Newark New York (2)
Dublin
London Amsterdam (2) Stockholm
Frankfurt (2) Paris
Singapore (2)
Hong Kong
Tokyo
São Paulo
South Bend
San Jose
Osaka
Milan
Sydney
Madrid
Global AWS Infrastructure ] [ (as of Nov 27th, 2012)
3. My backup should be safe
Lessons from Monte Cassino ] [
3. My backup should be safe
• SSL Endpoints (Amazon S3 and Amazon Glacier) • Signed API calls • Store encrypted files • Server-side encryption • Durability: multiple copies across different data centers • Local/cloud with AWS Storage Gateway
3. My backup should be safe
4. My backup should work with a DR policy
(I don’t want to wait 10 years… )
Lessons from Monte Cassino ] [
4. My backup should work with a DR policy
• Easy to integrate within AWS or Hybrid • AWS Storage Gateway: Run services on Amazon EC2 (DR) • Clear costs • Reduced costs • I decide redundancy/availability in relation to costs
Lessons from Monte Cassino ] [
5. Someone should care about it
• Clear ownership • Permissions with IAM: Users, groups roles • Logs • AWS support
Lessons from Monte Cassino ] [
1. My backup should be accessible
2. My backup should be able to scale
3. My backup should be safe
4. My backup should work with a DR policy
5. Someone should care about it
A customer story
Part III
Augusto Rosa Manager, Server Operations – Shaw Media
augusto.rosa @ shawmedia.ca
50
Shaw Media ] [
Who we are ] [ • Shaw Media: Division of Shaw Communications Inc. • It reaches almost 100% of Canadians; 18 specialty channels • Global national newscast: 1+ million viewers every weekday • Access to full episodes: 20 websites, 4 video-on-demand • It engages with 25+ million Canadians per week
Before AWS ] [ • Data centers in Winnipeg and Toronto • Challenge to manage, frequent power outages, downtime • Expensive hosting fees inherited from parent company • Technology was old and in disarray (total revamp needed)
Mission Impossible? ] [
Mission ] [ • Implement a new CMS • Empower the editorial team • Business objectives • Time frame of 9 months • Be agile and cost effective
AWS
Amazon SQS Amazon SNS Amazon SES AWS Marketplace Amazon FPS Amazon DevPay Amazon Mechanical Turk Amazon Route 53 Amazon VPC AWS Direct Connect Amazon S3 Amazon Glacier Amazon EBS AWS Import/Export AWS Storage Gateway AWS Support
Amazon EC2 Amazon EMR Auto Scaling Elastic Load Balancing Amazon CloudFront Amazon RDS Amazon DynamoDB Amazon SimpleDB Amazon ElastiCache AWS Identity and Access Management Amazon CloudWatch AWS Elastic Beanstalk AWS CloudFormation Amazon CloudSearch Amazon SWF Alexa WIS and Alexa Top Sites
Amazon SQS Amazon SNS Amazon SES AWS Marketplace Amazon FPS Amazon DevPay Amazon Mechanical Turk Amazon Route 53 Amazon VPC AWS Direct Connect Amazon S3 Amazon Glacier Amazon EBS AWS Import/Export AWS Storage Gateway AWS Support
Amazon EC2 Amazon EMR Auto Scaling Elastic Load Balancing Amazon CloudFront Amazon RDS Amazon DynamoDB Amazon SimpleDB Amazon ElastiCache AWS Identity and Access Management Amazon CloudWatch AWS Elastic Beanstalk AWS CloudFormation Amazon CloudSearch Amazon SWF Alexa WIS and Alexa Top Sites
Phase One ] [ • Fast deployment of servers, network rules, load balancers • First site under new CMS: Live in 4 weeks from scratch • Full migration of 29 sites from a physical DC in 9 months
Phase Two ] [ • Full migration of 6 other websites and web services • From 2nd physical DC into AWS in 2 months • Migration: Windows ‘03/SQL ‘05 Windows ‘08/SQL ’08 • Creating new web farms takes 1 to 5 days (versus months) • Takes longer to procure licenses than the infrastructure • Ability to scale and automate
Benefits of Using AWS ] [ • Increased uptime from 98.8% to 99.99% • Scale to success, quicker response to business needs • 1+ M $ saved in capital and operational cost • No physical investment, smaller teams • Allowed using service management third-party companies • Easy backup on AWS 3 years retention (tax credits)
AWS Architecture ] [
Some Numbers ] [ • 50+ EC2 instances (various sizes) • 25+ TB traffic/month • 40M+ Route 53 queries • 10+ TB backup on Amazon S3
... And growing!
Lessons Learned ] [ • Architecting for AWS in mind from start • Use all Availability Zones in area you choose to host; divide across all • Plan for failures: Be crazy about it (things fail) • Backup backup backup • Monthly AMI • Windows/SQL Server workarounds (failover cluster, AD, etc.) • Engage with AWS Solutions Architects early
Disaster Recovery ] [ • Learn from outages all the time • Implement changes to prevent failures at cloud level • Document how you recover from failures • Single component may fail; architecture shouldn’t
Backup ] [ • Daily snapshots of all volumes automatically • VIP volumes: snapshots every 4 hours • Keep the last 10 snapshots • Dell Replay: It backs up file system files every 1 hour • Volumes replicated to Amazon S3 (Oregon) every 2 hours • SQL Server backup every 30 minutes • SQL Server backup volumes moved to Amazon S3 every 2 hours
Future ] [ • Move from public cloud to VPC • Auto Scaling on Amazon EC2 • Amazon S3 as image repository for all sites • Second cloud vendor as DR (instead of in-house) • Amazon ElastiCache for central caching for ASP.net apps
Augusto Rosa Manager, Server Operations – Shaw Media
augusto.rosa @ shawmedia.ca
The 2012 Emilia Earthquake
Part IV
May 20th, 2012: Earthquake in Italy ] [
Parmigiano warehouse (0.5B € damage) ] [
“Let’s do something NOW” ] [
Buy 1 Kg of Parmigiano for 1 Euro ] [
7
3
73
Everybody helped ] [
Lessons from an Earthquake
Part V
1. You NEED a DR in place!
2. Testing your DR
3. Reducing costs
4. You can have different DR solutions
Lessons from an Earthquake ] [
1. You NEED a DR in place!
DR with High Availability
App DR with Standby
7
9
Business Impact Analysis (RTO, RPO)
Lessons from an Earthquake ] [
• RTO (Recovery Time Objective): 1) Time for trying to fix the problem 2) The recovery itself 3) Testing 4) Tell users • RPO (Recovery Point Objective): how much data I can lose
Business Impact Analysis (RTO, RPO)
Lessons from an Earthquake ] [
1) Backup and Restore 2) “Pilot light” for quick recovery into AWS (cold standby) 3) Warm standby solution on AWS 4) Multi-site hybrid solution (AWS + on premises)
Different Types of DR Architecture
Cost ($/GB/month) Performance Durability
Amazon S3 0.125
Amazon Glacier 0.01
AWS Storage Gateway
0.125 (+ 125/GW)
Amazon EBS 0.10
Amazon EBS (PIOPS) 0.125
8
3
83
2. Testing your DR
Lessons from an Earthquake ] [
• Dev/test in the cloud is super easy • Spin up capacity only for the test • Regularly test your DR • Cost is minimal • What about data transfer speed?
2. Testing your DR
s3cmd ls --recursive
s3://datasets.elasticmapreduce/ngrams/b
ooks/ | awk '{print $4;
sub(/s3:\/\/datasets.elasticmapreduce/,
"/array", $4); print $4}' | parallel -
j0 -N2 --progress /usr/bin/s3cmd --no-
progress get {1} {2}
Special thanks to Craig Carl, AWS Solutions Architect
s3cmd ls --recursive
s3://datasets.elasticmapreduce/ngrams/b
ooks/ | awk '{print $4;
sub(/s3:\/\/datasets.elasticmapreduce/,
"/array", $4); print $4}' | parallel -
j0 -N2 --progress /usr/bin/s3cmd --no-
progress get {1} {2}
Lists every object in the bucket
s3cmd ls --recursive
s3://datasets.elasticmapreduce/ngrams/b
ooks/ | awk '{print $4;
sub(/s3:\/\/datasets.elasticmapreduce/,
"/array", $4); print $4}' | parallel -
j0 -N2 --progress /usr/bin/s3cmd --no-
progress get {1} {2}
Gets the path to the Amazon
S3 object and the local destination path
s3cmd ls --recursive
s3://datasets.elasticmapreduce/ngrams/b
ooks/ | awk '{print $4;
sub(/s3:\/\/datasets.elasticmapreduce/,
"/array", $4); print $4}' | parallel -
j0 -N2 --progress /usr/bin/s3cmd --no-
progress get {1} {2}
Runs parallel with as many threads as possible, '-N2' tells
parallel there were two arguments on stdin and
assigns them to {1} and {2}
s3cmd ls --recursive
s3://datasets.elasticmapreduce/ngrams/b
ooks/ | awk '{print $4;
sub(/s3:\/\/datasets.elasticmapreduce/,
"/array", $4); print $4}' | parallel -
j0 -N2 --progress /usr/bin/s3cmd --no-
progress get {1} {2}
It’s the command that GNU Parallel will run, '{1}' is
substituted with the Amazon S3 object path, '{2}' is
substituted with the local destination path
s3cmd ls --recursive
s3://datasets.elasticmapreduce/ngrams/b
ooks/ | awk '{print $4;
sub(/s3:\/\/datasets.elasticmapreduce/,
"/array", $4); print $4}' | parallel -
j0 -N2 --progress /usr/bin/s3cmd --no-
progress get {1} {2}
Copying 2.4 TB down from 48 hours to 9 hours (5x faster)
3. Reducing costs
Lessons from an Earthquake ] [
1) AWS cost reduction (e.g., S3 cost reduction on Nov 28) 2) Reduced redundancy (Amazon S3) 3) Retention policy 4) Hot/warm/cool/cold backup 5) Reserved capacity/tiers
3. Reducing costs
0–1 TB 0.125 0.093
1–50 TB 0.110 0.083
50–500 TB 0.95 0.073
500–1,000 TB 0.90 0.063
1–5 PB 0.80 0.053
5+ PB 0.55 0.037
Amazon S3 Standard $/GB/Month
Reduced $/GB/Month
94
4. You can have different DR solutions
Lessons from an Earthquake ] [
• Easy to integrate existing vendors with DR on AWS • Approach: One vendor/hybrid/multiple vendors • One region/multi-regions (if you need geodiversity)
4. You can have different DR solutions
1. You NEED a DR in place!
2. Testing your DR
3. Reducing costs
4. You can have different DR solutions
Lessons from an Earthquake ] [
Conclusions
Part VI
Backups Disaster Recovery
Action items
Agility Cost savings Control
x
Parmigiano, a Monastery, Love and Faith
Simone Brunozzi Senior Technology Evangelist, Amazon Web Services
@simon
Technical lessons on how to do Backup and Disaster Recovery in the Cloud