NEDRIX-Architecting Complex Business Continuity Solutions
Transcript of NEDRIX-Architecting Complex Business Continuity Solutions
2005 Annual ConferenceOctober 24-26 2005, Newport, RI
2006 Winter ConferenceFebruary 16th, 2006
Sheraton Braintree Hotel, Braintree, MA
Architecting Complex Business Continuity Solutions
Gavin Lewellin Gavin Lewellin –– Global Practice Manager Global Practice Manager –– Replication and Data MigrationReplication and Data Migration
EMC CorporationEMC Corporation
Synopsis: This topic will cover the key high-level technical areas that should be considered when evaluating various technology options and strategies for replicating data, including: distance and network quality, data center placement, tiering and ILM, data federation and recovery groups. It also examines the most common best and worst practices that are often utilized when making critical decisions in addition to briefly discussing current trends and strategies.
2005 Annual ConferenceOctober 24-26 2005, Newport, RI
2006 Winter ConferenceFebruary 16th, 2006
Sheraton Braintree Hotel, Braintree, MA
Presentation Agenda• A Quick Technology Background• General Technology Planning Best Practices• Data Center Placement• Distance and Network Quality• Unlocking Latent Demand• Data Federation and it’s Impact on Recovery/Restart• Technology Evaluation Criteria• Current Trends and Strategies
2005 Annual ConferenceOctober 24-26 2005, Newport, RI
2006 Winter ConferenceFebruary 16th, 2006
Sheraton Braintree Hotel, Braintree, MA
DR Solution Spectrum
Physical & Logical
Recovery
Traditional DR
Log Shipping Hot Stand-By
Database
Database Replication Techniques
(Oracle Streams and
Adv Replication)
Synchronous
In-System (Local) Replication
Asynchronous
DISASTERDATABASERecovery
Recovery Restart Running
STO
RA
GE
BA
SED
DB
MS
BA
SED
i.e Oracle Data Guard
Hybrid Storage and
Database Solution
Point-In-Time Copies
2005 Annual ConferenceOctober 24-26 2005, Newport, RI
2006 Winter ConferenceFebruary 16th, 2006
Sheraton Braintree Hotel, Braintree, MA
Common Data Replication Modes• Synchronous Replication• Asynchronous Replication• Point-In-Time-Copies (PITC)• Three Data Center Strategies• Log Shipping• “Hybrid” Solutions
2005 Annual ConferenceOctober 24-26 2005, Newport, RI
2006 Winter ConferenceFebruary 16th, 2006
Sheraton Braintree Hotel, Braintree, MA
Synchronous ReplicationNo data exposure
Some performance impact
Limited distance
Source
Limited Distance
Target
Asynchronous ReplicationPredictable RPO
No performance impact
Unlimited distance
Source
Unlimited Distance
Target
Synchronous and Asynchronous
2005 Annual ConferenceOctober 24-26 2005, Newport, RI
2006 Winter ConferenceFebruary 16th, 2006
Sheraton Braintree Hotel, Braintree, MA
Point-In-Time Copies (1)Predictable RPO (Zero – Hours)
Some performance impact
Unlimited distance
Source
Unlimited Distance
Target
Prod
Bunker
Point-In-Time Copies (2)Predictable RPO (Hours)
No performance impact
Unlimited distance
Source
Unlimited Distance
Target
Point-In-Time Copies
Prod
2005 Annual ConferenceOctober 24-26 2005, Newport, RI
2006 Winter ConferenceFebruary 16th, 2006
Sheraton Braintree Hotel, Braintree, MA
PrimarySite
Long Distance Site
Sync
AsyncSource Target’
Seconds-Minutes time
lag RPO
Seconds-Minutes time
lag RPOBunker
Site
Target
Zero time lag RPO
Zero time lag RPO
Result:• Bunker Site and Long Distance Site
can be incrementally synchronized to most recent data
• Third link maintains continuous protection
Async
Three Data Center Strategies
2005 Annual ConferenceOctober 24-26 2005, Newport, RI
2006 Winter ConferenceFebruary 16th, 2006
Sheraton Braintree Hotel, Braintree, MA
Data Files
DataFiles
Other Data
1. Single database instance Synchronous full one time population2. Redo logs copied to archive logs3. Archive logs manually copied to remote site over the network4. Manually apply the archive logs on the remote site to create a database
point of consistency
4
Other Data
12
3
RedoLogs
Archive Logs
Archive Logs
Log Shipping
2005 Annual ConferenceOctober 24-26 2005, Newport, RI
2006 Winter ConferenceFebruary 16th, 2006
Sheraton Braintree Hotel, Braintree, MA
Data Files
DataFiles
Other Data
1. Single database instance Synchronous full one time population2. Redo logs remain in sync mode3. Redo logs copied to archive logs at log switch4. Archive logs copied to remote site over the network5. Manually apply the archive logs on the remote site to create a database
point of consistency
4
Other Data
12
3
RedoLogs
Archive Logs
Archive Logs
RedoLogs
2
Log Shipping (No data loss)
2005 Annual ConferenceOctober 24-26 2005, Newport, RI
2006 Winter ConferenceFebruary 16th, 2006
Sheraton Braintree Hotel, Braintree, MA
Data Systems DBMS
Web Server
App Server
File SystemStorage
Web Server
App Server
File SystemStorage
Mirror Activator DBMS
Primary Site Disaster Recovery Site
Block Replicator
“Hybrid” Storage and DBMS ReplicationSybase Mirror Activator addresses the gap created by using either storage
replication or transaction replication alone.– Works in conjunction with storage replication vendors to provide a live standby DBMS
with guaranteed transactional integrity– Extensive testing with EMC including both SRDF/S, SRDF/A (white paper available)– Works with EMC SRDF, IBM PPRC, Veritas Volume Replicator, NetApp SnapMirror, and
Hitachi TrueCopy
2005 Annual ConferenceOctober 24-26 2005, Newport, RI
2006 Winter ConferenceFebruary 16th, 2006
Sheraton Braintree Hotel, Braintree, MA
Review Related
InformationProtection Programs
1.a
Plan
DesignInfrastructure
Define Business
Requirements
Conduct RecoveryTesting
Test and Implement
Technologies
8
62
4
Manage
ManageResources,
Improvements& Measurement
10
Build
Develop /Update
ProgramDefinition
9
Project Planning
A
Profile Environment
B
Start-up & Preparation
Assess Program/Service Levels
1 Conduct Implementation
Planning
5
EvaluateAvailability and
RecoveryAlternatives
3 DevelopRecovery /
FailoverPlans
7
Program Management and Integration
Business Continuity Program Lifecycle
2005 Annual ConferenceOctober 24-26 2005, Newport, RI
2006 Winter ConferenceFebruary 16th, 2006
Sheraton Braintree Hotel, Braintree, MA
• What are the differences between Estimation, Modeling and Simulation?– It’s all in the planning....Ask for deliverables with clearly defined caveats...even if
it is via a paid engagement.
General Architecture Best Practices
Design CategoryESTIMATION MODELING SIMULATION
Design Project Cost• Free - $1M
Time to complete• 2 days – 6 months+
Accuracy of Analysis• 40% - 90%
2005 Annual ConferenceOctober 24-26 2005, Newport, RI
2006 Winter ConferenceFebruary 16th, 2006
Sheraton Braintree Hotel, Braintree, MA
• “Yes, last weekend was my yearly processing peak.”
Things we hear!
• “Oh, we maybe do about 5,000 – 15,000 I/O per second.”
• “We are only going to see 20% growth this year.”
• “We have no network in place. Can you tell us if this solution will work?”
• “We are going to completely re-architect the application(s) 1 week before we want this DR project to go-live.”
• “We bought an OC-3 between Singapore and London. Can you make your solution fit?”
• “We have suddenly got $5 million in our budget that will disappear by the end of this week. We’d like to look at DR.”
2005 Annual ConferenceOctober 24-26 2005, Newport, RI
2006 Winter ConferenceFebruary 16th, 2006
Sheraton Braintree Hotel, Braintree, MA
Best PracticesCheck the mathematics BEFORE you crack out
the trowel and start building your second data
center.Use simulators and models
to determine performance impacts .
Build second data center too far away and then try to squash workload into
distance induced performance problems.
Determine synchronous performance problems by
“sticking it into production and seeing
what happens”.
Worst PracticesWhat is the maximum Synchronous distance?
2005 Annual ConferenceOctober 24-26 2005, Newport, RI
2006 Winter ConferenceFebruary 16th, 2006
Sheraton Braintree Hotel, Braintree, MA
WRITES/SECOND
0.0050.00
100.00150.00200.00250.00300.00350.00
CI9
003
SC
3023
CI9
072
SC
3023
SC
3021
FU90
80
DB
3210
SC
3072
SC
3047
SC
9039
JE30
61
CI2
0A6
SY
3004
SY
3041
JE30
70
CI2
0A6
SC
3083
Volume
Writ
es p
er S
econ
d
Example: Current write response = 2ms then maximum write to single volume = 500 per second per volumeIntroducing an additional 2 milliseconds of latency due to synchronous overheads reduces write count to 250 writes per second per volume
Determining Synchronous Distance• It depends (of course)
– Write distribution (skew-ness)– Current response time– Projected response time– Write access density
2005 Annual ConferenceOctober 24-26 2005, Newport, RI
2006 Winter ConferenceFebruary 16th, 2006
Sheraton Braintree Hotel, Braintree, MA
X
DC1DC2
DC3
Synchronous Distance – Case Study• Financial institution• Has 3 sites in place• Requires 4th site as a hub for
replication• Where should the site go?
– $M’s at risk– What is the lowest latency
network available?– Synchronous required
(zero data loss)– Guess could be career altering
2005 Annual ConferenceOctober 24-26 2005, Newport, RI
2006 Winter ConferenceFebruary 16th, 2006
Sheraton Braintree Hotel, Braintree, MA
Asynchronous: Flow control problem.• All asynchronous replication
techniques use some form of buffering.
• The amount of buffering depends on the data arrival rate and the bandwidth available
• If there is insufficient bandwidth, ultimately, the buffer will over-flow
2005 Annual ConferenceOctober 24-26 2005, Newport, RI
2006 Winter ConferenceFebruary 16th, 2006
Sheraton Braintree Hotel, Braintree, MA
Asynchronous: Best and WorstBest Practices
• Measure workload (a)• Measure average transfer
size (b)• Determine flow rate (a * b)• Divide by compression ratio• Test link quality• Understand workload
fluctuations
Worst Practices• Don’t measure anything• Don’t test anything• Buy the biggest box you can
(or the smallest)• “Max out the cache” (buy
the maximum amount of cache.
• Run it on a T1
2005 Annual ConferenceOctober 24-26 2005, Newport, RI
2006 Winter ConferenceFebruary 16th, 2006
Sheraton Braintree Hotel, Braintree, MA
Best Practices• Design the Synchronous and
Asynchronous legs independently.
• Be able to recover from EITHER the synchronous or asynchronous data center depending on which is the most recent.
Worst Practices• Make the synchronous
mistakes– Put the primary and secondary
data centers too far apart– Find out it doesn’t work after
“sticking it into production”
• Make the asynchronous mistakes
– Don’t measure/test anything– Buy the biggest boxes you can– Put in maximum cache– Run it through a T1
Gotcha’s for Three Data Center Strategies
2005 Annual ConferenceOctober 24-26 2005, Newport, RI
2006 Winter ConferenceFebruary 16th, 2006
Sheraton Braintree Hotel, Braintree, MA
• Validating network characteristics is exceptionally critical.• Bandwidth and Throughput are rarely ever the same value.• Latency and Physical Circuit Distance are never the same.• IP networks are especially prone to packet loss and latency.
Quality of Networks
Latency/Packet Loss
Thro
ughp
ut• 1 Millisecond of latency is equal to (best case) 100km circuit distance.
2005 Annual ConferenceOctober 24-26 2005, Newport, RI
2006 Winter ConferenceFebruary 16th, 2006
Sheraton Braintree Hotel, Braintree, MA
• Set SLA’s with bandwidth carriers around throughput, packet loss and latency.• Examine Network Acceleration/Fast Write capabilities (i.e NetEx)• Look at running a “Proof of Concept” to validate network improvements.
Improving Network Quality
Latency/Packet Loss
Thro
ughp
ut• Network Acceleration assists with poorly performing networks.
2005 Annual ConferenceOctober 24-26 2005, Newport, RI
2006 Winter ConferenceFebruary 16th, 2006
Sheraton Braintree Hotel, Braintree, MA
• Switching from Synchronous operations to Asynchronous operations.• Technology refreshes that enhance application/database performance.• Assessing latent demand is an exceptionally difficult task.• You should always periodically measure workloads to assess whether
workload changes will impact your DR capability.
Unlocking Latent Demand
Network Throughput Required (Normal Week)
0.0010.0020.0030.0040.0050.0060.0070.0080.0090.00
100.00
12:0
0:00
AM
7:15
:00
AM
2:30
:00
PM
9:45
:00
PM
5:00
:00
AM
12:1
5:00
PM
7:30
:00
PM
2:45
:00
AM
10:0
0:00
AM
5:15
:00
PM
12:3
0:00
AM
7:45
:00
AM
3:00
:00
PM
10:1
5:00
PM
5:30
:00
AM
12:4
5:00
PM
8:00
:00
PM
3:15
:00
AM
10:3
0:00
AM
5:45
:00
PM
1:00
:00
AM
8:15
:00
AM
3:30
:00
PM
10:4
5:00
PM
Meg
abyt
es/s
ec
Current Workload Latent Demand
2005 Annual ConferenceOctober 24-26 2005, Newport, RI
2006 Winter ConferenceFebruary 16th, 2006
Sheraton Braintree Hotel, Braintree, MA
• How inter-dependant are your applications and databases?Database Federation and Tiering
OracleTier1
SybaseTier1
ExchangeTier3
IMSTier1
z/OS DB2Tier1
SAPTier2
OracleTier2
OracleTier2
SAPTier3
• Understanding these dependencies will drive how you alter your tiering and will define new classes of “recovery groups”.
• Changes to your teiring and recovery groups will alter the way you need to think about strategically dispersing your production processing.– How federated are my apps?– Tightly-coupled, loosely coupled?– Can “billing” tolerate latency
insertion on every transaction?TIER 3
TIER 2
TIER 1
2005 Annual ConferenceOctober 24-26 2005, Newport, RI
2006 Winter ConferenceFebruary 16th, 2006
Sheraton Braintree Hotel, Braintree, MA
• How do you best utilize “idle” recovery assets?Geographic Dispersal of Production Workloads
• Clearly define your recovery groups and tiers.– It’s not an all-or-nothing approach.
• Understand app/database tolerances for running active/active over distance.
• Test segregation of workloads regularly.
• Examine “hybrid” solutions as a way to perform secondary tasks at secondary or tertiary data centers.
• Imbed DR Planning in every aspect of change control.
Primary
Secondary
Tertiary
2005 Annual ConferenceOctober 24-26 2005, Newport, RI
2006 Winter ConferenceFebruary 16th, 2006
Sheraton Braintree Hotel, Braintree, MA
• Develop assessment criteria by which you can assign weighting based on your needs?
Technology Evaluation Criteria
• Categorize by Solution Architecture Type:– DBMS, Log Shipping, Hybrid Solutions, Array based (Sync, Async, PITC),
Server, Switch/Network, File, Volume/LUN
• Site Strategies Supported:– Single, Dual, Three+, In/Out of Region?
• Disaster Recovery and Operational Recovery/Resumption Point and Time Objectives:
– Define the difference between High Availability and Disaster Recovery.• Architecture Scalability:
– How many hosts? How much storage. What are the performance characteristics?– Do you have reference architectures? What is your install base?– Assess product/solution maturity.
2005 Annual ConferenceOctober 24-26 2005, Newport, RI
2006 Winter ConferenceFebruary 16th, 2006
Sheraton Braintree Hotel, Braintree, MA
Technology Evaluation Criteria (cont’d)• Deployment and Operational Complexity:
– Is there a migration plan to get me from A to B?– How will I manage this infrastructure?– Are there any monitoring or automation capabilities?
• Product/Solution Functionality Pro’s and Con’s:– Despite a leveling of the playing field, this is still a critical evaluation criteria.– RAID Flexibility, Local and Remote Replication capabilities, integration with
clustering technologies etc...
• Relative Cost:– Storage, Server, Bandwidth, Network.– Operational cost (staffing for new assets and business processes)
2005 Annual ConferenceOctober 24-26 2005, Newport, RI
2006 Winter ConferenceFebruary 16th, 2006
Sheraton Braintree Hotel, Braintree, MA
Technology Evaluation Criteria (cont’d)• Failover/Failback capabilities:
– What manual intervention is required?– Is this a full re-sync or incremental?– How does the solution handle error injection? (link drops etc).
• Database Federation:– Does the solution support inter-dependant databases/applications to enable
cross-platform restart? (i.e CICS on Mainframe coupled with Oracle etc).– Can I manage multiple Replication Groups/Tiers with this solution?– What if my recovery groups change?
• Supported transmission/network protocols:– Fiber Channel, FICON, Gig-E, 10GIG, SONET, iSCSI etc etc.
2005 Annual ConferenceOctober 24-26 2005, Newport, RI
2006 Winter ConferenceFebruary 16th, 2006
Sheraton Braintree Hotel, Braintree, MA
12, 4, 1 hrs60, 30, 0 min15, 5, 0 min0 minDEF
12, 2 hrs15, 5, 0 min15, 5, 0 min0 minGEH
12, 8, 4 hrs12, 8, 4 hrsNo HANo HAABC
DRTODRPOORTOORPO
DR ObjectivesHA ObjectivesBusiness Unit
Evaluation Criteria – RPO/RTO
• Risk Mitigation Requirements (for two or three data centers)– At least one Data Center must be “out-of-region”. – At least one Data Center must not be in a major metropolitan area.– Split GEH production systems between “in-region” Data Centers.– Split ABC productions systems between “out-of-region” Data Centers.
2005 Annual ConferenceOctober 24-26 2005, Newport, RI
2006 Winter ConferenceFebruary 16th, 2006
Sheraton Braintree Hotel, Braintree, MA
Bi-DirectionalPassiveActiveActive
Active
Active
Active
Active
Active
Active
Site 1Site Dynamic
Active
Passive
Active
Active
Passive
Passive
Site 2
Passive
Passive
Bunker
na
Bunker
na
Site 3
One Way
One Way
Bi-Directional
Bi-Directional
One Way
One Way
Replication Consideration
Three Site
Two Site
Site Solutions
Evaluation Requirements – Site Solutions• Site scenario’s to be analyzed...
2005 Annual ConferenceOctober 24-26 2005, Newport, RI
2006 Winter ConferenceFebruary 16th, 2006
Sheraton Braintree Hotel, Braintree, MA
100 %Procedural
( 0 % IT ArchitecturalRedundancy )
24 hrs x 7 days
Manual
Non Critical BusinessSmall Industries
Low Failsafe
Resources
High Failsafe
Essential ServicesUtilities, Airlines, Hospital
BanksFinancial Services
TelecommunicationsFood Manufacturer
Consumer GoodsManufacturing
ManufacturingRetail & Online
Low VolumeHigh Volume
TransportationLogistics
Low security
TransparentFailsafe
High Security
Low security
Single Data Center
Dual DataCenter
Triple DataCenter
Creating a Context• Different Industries require different strategies
2005 Annual ConferenceOctober 24-26 2005, Newport, RI
2006 Winter ConferenceFebruary 16th, 2006
Sheraton Braintree Hotel, Braintree, MA
Data Center StrategySingle Dual Triple
Dis
aste
r Ty
pe
Regional
EntireData
Center
LocalComponent
Level
Wall StreetX
LargeInsurance
X
SMBX
LargeManufacturing
X
HealthX
LargestBanks
X
So...What is your organization doing?
2005 Annual ConferenceOctober 24-26 2005, Newport, RI
2006 Winter ConferenceFebruary 16th, 2006
Sheraton Braintree Hotel, Braintree, MA
Technology Trends• The Move to Automation and Monitoring
– Automated Restart Operations.– Integration with Management Frameworks (Tivoli, OpenView etc).– End to end monitoring to pinpoint and diagnose potential problems.
• Three Data Center Strategies– Enabling the ability to provide local High Availability and extended distance
disaster recovery/restart.– Lower cost variants are on the way.
• Geographic Dispersal of Production Workloads– Recovery Group management.– Frequent cycling of workloads across data centers
• Enables more realistic DR testing.– Impact of Virtualization TBD
2005 Annual ConferenceOctober 24-26 2005, Newport, RI
2006 Winter ConferenceFebruary 16th, 2006
Sheraton Braintree Hotel, Braintree, MA
Thank You