Felix Candelario
Global Financial Services Solutions Architect
“Disaster Recovery and Business Continuity”
Agenda
• AWS Disaster Recovery Concepts & Terminology
• Architecting for Recovery & Resiliency
• Disaster Recovery Testing & Assurance
• Architecting for the Cloud
“Everything fails, all the
time”
- Werner Vogels
(CTO, Amazon.com)
Concepts & Terminology
DR Terminology Map
ELB/Appliance
EC2/Auto Scaling
Route 53
Load Balancers
Web/App Servers
Your Data
Centers
DNS
Amazon RDS
Security Groups / ACL
Availability Zones / VPC
Multi-regionGeographical
Redundancy
Data Centers
Firewall
Database Servers
What is an AWS Region?
• Geographic locations that contains a cluster of
availability zones in a given metropolitan area.
• Each region is completely isolated and
independent from other regions
• Each region consists of 2 or more AZs to support
high availability (HA) through AZ independence
Highly Reliable Global Footprint
• Over 1 million active
customers per month across
190 countries
• 2,300 government agencies
• 7,000 educational
institutions
• 35 availability zones + 9
more coming soon
• 59 edge locations
13+ worldwide regions
What are Availability Zones?
• Groupings of one or more data centers that are
physically isolated.
• AZs are connected to each other over low-
latency links within the same region
• Using 2 or more AZs within a region can provide
support for capabilities such as synchronous
database replication and better pricing when
using Amazon EC2 Spot instances
Availability Zones are Notated as Letters
35 Availability Zones (AZs)
• Example
• US East 1 (Northern VA)
– us-east-1a
– us-east-1b
– us-east-1c
– us-east-1d
– us-east-1e
Availability Zone A
Availability Zone B
Availability Zone C
US-EAST-1
Availability Zone D
Availability Zone E
What is an Amazon VPC?
• Virtual isolated network that you define in which you can
launch AWS resources such as Amazon EC2 instances
• Complete control of your virtual networking environment
such as
• Set your own IP address ranges
• Create subnets
• Configure routing tables and network gateways
• Allows extension of your corporate network to the AWS
Cloud
VPC Pattern Diagram - Example
Development
Amazon VPC
Integration
Amazon VPC
Pre-production
Amazon VPC
Production
Amazon VPC
Putting It All Together
What Compute Services are available?
Amazon EC2 Auto ScalingElastic Load
Balancing
Actual
EC2
Elastic Virtual servers
in the cloud
Dynamic traffic
distribution
Automated scaling
of EC2 capacity
What Network Services are available?
Amazon VPC: AWS DirectConnect Amazon Route 53
Availability
Zone BAvailability
Zone A
Private, isolated
section of the AWS
Cloud
Private connectivity
between AWS and your
datacenter
Domain Name System
(DNS) web service.
Architecting for Recovery &
Resiliency
Resiliency
Backup Disaster Recovery
Reducing likelihood of
service failure
Maintaining Data
IntegrityRecovery after loss of
availability
It’s not all or nothing. Choose a strategy that
fits the business objective.
DisasterRecovery point Recovery time
Data loss Down time
Ascending levels of DR options
Backup &
Restore
Pilot Light
Warm
Standby
Hot-Site
Backup of on-
premises data to
AWS to use in a DR
event
Replicate data and
minimal running
services into AWS,
ready to take over
and flare up
Replicate data and
services into AWS
ready to take over
Replicated and load
balanced
environments that
are both actively
taking production
traffic
RPO
aRTO
COST
24 hours 24 hours
$
RPO
aRTO
COST
12 hours 4 hours
$$
RPO
aRTO
COST
1-4 hours 15 min
$$$
RPO
aRTO
COST
<15 min 0-5 min
$$$
Business continuity
begins
Un-interrupted Business
continuity
~$200 / Month
In US-EAST
+VPN
On-premises
Active Production
www.example.com
Corporate data center AWS region
AWS DR failover
App
Servers
DB
Server
VPN
Connection
Storage
GatewayiSCSI
Backup
System
S3 / Bucket
Glacier / Archive
Web
Servers Internet traffic
S3 (1TB)
$31/Month
Glacier (2TB)
$22/Month
Storage Gateway
$125/Month
S3 / Bucket
S3 (1TB)
$31/Month1TB Data
Volume
Backup and Restore Architecture
Suitable for
• Solutions that can sustain higher technical debt
• Lower business critical nature
• Low cost DR option
Leverage existing investments in
• De-duplication
• Compression
• WAN Acceleration
Backup and Restore Details
Pilot light
Subordinate
database
server
Pilot light–prepwww.example.com
Data mirroring replication
Not running
Pilot light system
Reverse
proxy/
caching
server
Datavolume
Application
server
Corporate data center
Reverse proxy/ caching server
Application server
MasterDatabase
server
Database
server
Pilot light–recoverywww.example.com
Start in minutes
Add additional
capacity,
if needed
Reverse
proxy/
caching
server
Datavolume
Application
server
Corporate data center
Reverse proxy/ caching server
Application server
MasterDatabase
server
Considerations
Suitable for:
• Solutions that need lower RTO & RPO
• higher business critical nature
• Mid-range cost DR option
Pilot Light Details
Warm standby
Warm standby–prep
Mirroring /replication
Application data source
cut over
Elastic loadbalancer
ActiveNot active for
production traffic
Route 53
www.example.com
Scaled down
standbyCorporate data center
Datavolume
Applicationserver
Subordinatedatabase
server
Reverse proxy/ caching server
AWS region
Reverse proxy/ caching server
Application server
MasterDatabase
server
Warm standby–recover
Elastic loadbalancerActive
Route 53
www.example.com
Scaled-up
production
Corporate data center
Datavolume
Applicationserver
Databaseserver
Reverse proxy/ caching server
AWS region
Reverse proxy/ caching server
Application Server
MasterDatabase
server
Hot site
Hot site–prep
Mirroring /replication
Application data source
cut over
Elastic loadbalancer
ActiveRoute 53
www.example.com
Corporate data center
Datavolume
Applicationserver
Subordinate database
server
Reverse proxy/ caching server
AWS region
Reverse proxy/ caching server
Application server
MasterDatabase
server
Active
Hot site–recovery
Elastic loadbalancer
Route 53
www.example.com
Corporate data center
Datavolume
Applicationserver
Databaseserver
Reverse proxy/ caching server
AWS region
Reverse proxy/ caching server
Application server
MasterDatabase
server
Active
Scaled up
for production
use
Considerations
Suitable for:
• Solutions that require RTO & RPO in minutes
• Core business critical functions
• Higher cost DR option
Warm Standby and Multi-site Details
Disaster Recovery Testing &
Assurance
Continuous Testing of Infrastructure
• Continuously and constantly test.
• Regularly execute tests in stable, production &
production-like test environments.
• Infrastructure as Code
• CI/CD Test in Infrastructure Build Pipeline
• Testing of infrastructure during Integration Test
Warm Standby – Testing
Mirroring /replication
Application data source
cut over
Elastic loadbalancer
ActiveNot active for
production traffic
Route 53
www.example.com
Scaled down
standbyCorporate data center
Datavolume
Applicationserver
Subordinatedatabase
server
Reverse proxy/ caching server
AWS region
Reverse proxy/ caching server
Application server
MasterDatabase
server
Warm Standby – Testing
Mirroring /replication
Application data source
cut over
Elastic loadbalancer
ActiveNot active for
production traffic
Route 53
www.example.com
Scaled down
standbyCorporate data center
Datavolume
Applicationserver
Subordinatedatabase
server
Reverse proxy/ caching server
AWS region
Reverse proxy/ caching server
Application server
MasterDatabase
server
Warm Standby – Testing
Mirroring /replication
Application data source
cut over
Elastic loadbalancer
ActiveNot active for
production traffic
Route 53
www.example.com
Scaled down
standbyCorporate data center
Datavolume
Applicationserver
Subordinatedatabase
server
Reverse proxy/ caching server
AWS region
Reverse proxy/ caching server
Application server
MasterDatabase
server
Warm Standby – Testing
Mirroring /replication
Application data source
cut over
Elastic loadbalancer
ActiveNot active for
production traffic
Route 53
www.example.com
Scaled down
standbyCorporate data center
Datavolume
Applicationserver
Subordinatedatabase
server
Reverse proxy/ caching server
AWS region
Reverse proxy/ caching server
Application server
MasterDatabase
server
aws rds reboot-db-instance --db-instance-identifier
dbInstanceID --force-failover
Architecting for Cloud
Architecting for Resiliency
Cloud Based Architectures
• High level of control over the environment
• Automate Everything! – Utilise AWS APIs
• Infrastructure as code – CloudFormation
• Parallel environment
• Rolling Update / All at Once
• Blue / Green Deployments
- Significant difference between physical and cloud is the
control and visibility cloud provides
Common thread: Environment automation
Deployment success depends on
mitigating risk for:
• Application issues (functional)
• Application performance
• People/process errors
• Infrastructure failure
• Rollback capability
• Large costs
CloudFormation most
comprehensive
automation platform
• Scope stacks from
network to software
• Control higher-level
automation services:
Elastic Beanstalk, ECS,
OpsWorks, Auto Scaling
Strength of
automation
platform
Benefits of deployment on AWS
AWS:
• Agile deployments
• Flexible options
• RPO/RTO & Business
Continuity objectives
• Scalable capacity
• Pay for what you use
• Automation capabilities
Enterprise Observations
Business
Enablement
Art of the
Possible
Legacy Tech
Debt
Art of the Possible - State of DevOps 2016
Frequent Deployments
200x more frequent
deployment
Faster Recovery
24x faster recovery
from failure
Lower Failure Rate
3x lower change failure
rate
Less Unplanned Work
22% less time spent on
unplanned work and
rework
Shorter Lead Times
2,555x shorter lead
times
Source: Puppet Labs - State of DevOps 2016 Report
Thank You
Top Related