6.Data Center Disaster Recovery and Business Continuance

download 6.Data Center Disaster Recovery and Business Continuance

of 99

Transcript of 6.Data Center Disaster Recovery and Business Continuance

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    1/99

    2009 Cisco Systems, Inc. All rights reserved. Cisco PublicBRKDCT-2987 1

    Data Center Disaster Recovery andBusiness Continuance

    BRKDCT-2987

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    2/99

    2009 Cisco Systems, Inc. All rights reserved. Cisco PublicBRKDCT-2987 2

    HousekeepingWe value your feedback- don't forget to complete youronline session evaluations after each session &complete the Overall Conference Evaluation which willbe available online from Thursday

    Visit the World of Solutions

    Please remember this is a 'non-smoking' venue!

    Please switch off your mobile phones

    Please make use of the recycling bins provided

    Please remember to wear your badge at all times

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    3/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 3BRKDCT-2987

    Cost of application downtime, lost dataand productivity

    Regulatory mandates (Homeland

    Defense, Basel II, HIPAA, GLB, SEC)Firms must recover business operations thesame business day a disruption occursOut -of-region data center, 200+ miles awayMandates backup data centers on separate

    grids

    Hurricanes

    The Northeast Blackout

    NYC Blizzard of 2003

    Business Continuance Drivers

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    4/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 4BRKDCT-2987

    Business Continuance Is More Critical than Ever

    75% of IT decision-makers have altered DisasterRecovery/Business Continuance programs as aresult of September 11

    Following a disaster 43% of directly affected

    businesses do not reopen and 29% fail within 24months as a result

    Only 15% of Global 2000 enterprises have a full-fledged business continuity plan.

    Disasters: fire, storm, floods, earthquakes, chemicalaccidents, nuclear accidents, wars

    Sources: Disaster Recovery Journal, Gartner Group

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    5/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 5BRKDCT-2987

    Agenda

    Introduction to Data Center - The Evolution

    Data Center Disaster RecoveryObjectives

    Failure ScenariosDesign Options

    Components of Disaster RecoverySite Selection - Front End GSLB

    Server High Availability - ClusteringData Replication and Synchronization - SAN Extension

    Sample Design

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    6/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 6BRKDCT-2987

    The Evolution of Data Centers

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    7/99 2010 Cisco Systems, Inc. All rights reserved. Cisco Public 7BRKDCT-2987

    Data Center Evolution

    1960 1980 2000 2010

    B

    u s

    i n e s s

    A g

    i l i t y

    NETWORKED DATACENTER PHASE

    Mainframes

    Terminal

    Client/Server

    COMPUTEEVOLUTION

    NETWORKEVOLUTION

    NetworkOptimization

    InternetComputing

    ContentNetworking

    Data Center Continuous Availability

    Data CenterConsolidation

    Data CenterDistributed

    TCP/IP

    Thin Client: HTTP

    1. Consolidation2. Integration3. Distributed4. High Availability

    Data Center Networking

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    8/99 2010 Cisco Systems, Inc. All rights reserved. Cisco Public 8BRKDCT-2987

    What is involved in a Data Center

    Application solution

    Database solution

    Linux/HP,Solaris/SunFire,WebLogic, J2EEcustom app, etc.

    Linux/HP, Solaris/SunFire, Oracle10G RAC, etc.

    Storage solutionMDS9000

    Network infrastructure solutionCisco GSRs,CISCO CATALYST6500 , Cisco CatalystCat4000

    Layer 4 7 services solution

    Network security solution

    Management and instrumentation solution

    CSM,SSLM,CSS,CE, GSS

    PIX,FWSM,IDSM,VPNSM,CSA

    Terminalservers, NAM,Cisco WorksLMS/VMS,HSE

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    9/99 2010 Cisco Systems, Inc. All rights reserved. Cisco Public 9BRKDCT-2987

    What is Distributed Data Center

    PrimaryData Center

    SecondaryData Center

    APP A APP B APP A APP C

    Data Replication

    FC FC

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    10/99 2010 Cisco Systems, Inc. All rights reserved. Cisco Public 10BRKDCT-2987

    Why Distributed Data Centers

    Provide disaster recovery and business continuance

    Avoid single , concentrated data depositary

    High availability of applications and data access

    Load balancing together with performance scalability

    Better response and optimal content routing: proximityto clients

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    11/99 2010 Cisco Systems, Inc. All rights reserved. Cisco Public 11BRKDCT-2987

    Front-end IP Access Layer

    Content Routingsite selection

    PrimaryData Center

    SecondaryData Center

    APP A APP B APP A APP C

    FC FC

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    12/99 2010 Cisco Systems, Inc. All rights reserved. Cisco Public 12BRKDCT-2987

    Application and Database Layer

    PrimaryData Center

    SecondaryData Center

    APP A APP B APP A APP C

    FC FC

    Content SwitchingLoad Balancing

    Server ClusteringHigh Availability

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    13/99 2010 Cisco Systems, Inc. All rights reserved. Cisco Public 13BRKDCT-2987

    Backend SAN Extension

    PrimaryData Center

    SecondaryData Center

    APP A APP B APP A APP C

    FC FC

    Storage & OpticalDataMirroring and Replication

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    14/99 2010 Cisco Systems, Inc. All rights reserved. Cisco Public 14BRKDCT-2987

    Data Center Disaster Recovery

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    15/99 2010 Cisco Systems, Inc. All rights reserved. Cisco Public 15BRKDCT-2987

    Agenda

    Introduction to Data Center - The Evolution

    Data Center Disaster RecoveryObjectives

    Failure ScenariosDesign Options

    Components of Disaster RecoverySite Selection - Front End GSLB

    Server High Availability - ClusteringData Replication and Synchronization - SAN Extension

    Sample Design

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    16/99 2010 Cisco Systems, Inc. All rights reserved. Cisco Public 16BRKDCT-2987

    Disaster Recovery

    Recovery of data and resumption of service - Ensuringbusiness can recover and continue after failure ordisaster

    Ability of a business to adapt, change and continue whenconfronted with various outside impacts

    Mitigating the impact of a disaster

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    17/99 2010 Cisco Systems, Inc. All rights reserved. Cisco Public 17BRKDCT-2987

    What It means For Business

    Business ResilienceContinued Operation of Business During a Failure

    Disaster Recovery

    Protecting Data Through OffsiteData Replicationand Backup

    Business ContinuanceRestoration of Business

    After a Failure

    Zero Down Time is the ultimate goal

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    18/99 2010 Cisco Systems, Inc. All rights reserved. Cisco Public 18BRKDCT-2987

    Disaster Recovery Planning

    Business Impact Analysis ( BIA)Determines the impacts of various disasters to specific businessfunctions and company assets

    Risk AnalysisIdentifies important functions and assets that are critical tocompanys operations

    Disaster Recovery Plan ( DRP )Restores operability of the target systems, applications, orcomputing facility at the secondary Data Center after the disaster

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    19/99 2010 Cisco Systems, Inc. All rights reserved. Cisco Public 19BRKDCT-2987

    Disaster Recovery Objectives

    Recovery Point Objective (RPO)The point in time (prior to the outage) in which system and data

    must be restored to

    Tolerable lost of data in event of disaster or failure

    The impact of data loss and the cost associated with the lossRecovery Time Objective (RTO)

    The period of time after an outage in which the systems and datamust be restored to the predetermined RPO

    The maximum tolerable outage timeRecovery Access Objective (RAO)

    Time required to reconnect user to the recovered application,regardless where it is recovered

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    20/99 2010 Cisco Systems, Inc. All rights reserved. Cisco Public 20BRKDCT-2987

    Recovery Point/Time vs. Cost

    Smaller RPO/RTOHigher $$$, Replication, Hot

    standby

    Larger RPO/RTOLower $$$, Tape backup/restore,

    Cold stanby

    time

    Disaster strikes

    time t 1 time t 2

    Systems recoveredand operational

    Recovery time

    ExtendedCluster

    ManualMigration

    TapeRestore

    secs mins hours days weeks

    $$$ Increasing cost

    Recovery point

    SynchronousReplication

    secsminshoursdays

    AsynchronousReplication

    PeriodicReplication

    Tapebackup

    time t 0

    $$$ Increasing cost

    Critical data isrecovered

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    21/99

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    22/99

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    23/99 2010 Cisco Systems, Inc. All rights reserved. Cisco Public 23BRKDCT-2987

    Network Failures

    InternetServiceProvider A

    ServiceProvider B

    ISP failureDual ISP connections

    Multiple ISP

    Connection failure within thenetwork

    ether-channelMultiple route paths

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    24/99 2010 Cisco Systems, Inc. All rights reserved. Cisco Public 24BRKDCT-2987

    Device Failures

    InternetServiceProvider A

    ServiceProvider B

    Routers, Switches, FWsHSRP

    VRRP

    HostsHA cluster

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    25/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 25BRKDCT-2987

    Storage Failures

    InternetServiceProvider A

    ServiceProvider B

    Disk arraysRAID

    Disk Controllers

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    26/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 26BRKDCT-2987

    Site Failures

    InternetServiceProvider A

    ServiceProvider B

    Partial Site Failure Application maintenance

    Application migration Application scheduled DRexercise

    Complete Site FailureDisaster

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    27/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 27BRKDCT-2987

    Agenda

    Introduction to Data Center - The Evolution

    Data Center Disaster RecoveryObjectives

    Failure ScenariosDesign Options

    Components of Disaster RecoverySite Selection - Front End GSLB

    Server High Availability - ClusteringData Replication and Synchronization - SAN Extension

    Sample Design

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    28/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 28BRKDCT-2987

    Cold Standby

    One or more data center with appropriately configuredspace equipped with pre-qualified environmental,electrical, and communication conditioning

    Hardware and Software installation, Network access, anddata restoration all need manual intervention

    Least expensive to implement and maintain

    Substantial delay from standby to full operation

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    29/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 29BRKDCT-2987

    Disaster Recovery Active/Standby

    PrimaryData Center

    SecondaryData Center

    (Cold Standby)

    APP A APP B APP A APP B

    FC FC

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    30/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 30BRKDCT-2987

    Warm Standby

    A data center that is partially equipped with hardware andcommunications interfaces capable of providing backupoperating support.

    Latest backups from the production data center must bedelivered

    Network access needs to be activated

    Provides better RTO and RPO than Cold StandbyBackup

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    31/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 31BRKDCT-2987

    Disaster Recovery Active/Standby

    PrimaryData Center

    SecondaryData Center

    (Warm Standby)

    APP A APP B APP A APP B

    IP/Optical Network

    FC FC

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    32/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 32BRKDCT-2987

    Hot Standby

    A data center that is environmentally ready and hassufficient hardware, software to provide data processingservice with little down or no down time.

    Hot Backup offers Disaster Recovery, with little or nohuman intervention

    Application data is replicated from the primary site

    A hot backup site provides very good RTO and RPO

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    33/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 33BRKDCT-2987

    Disaster Recovery Active/Standby

    PrimaryData Center

    SecondaryData Center

    APP A APP B APP A APP C

    IP/Optical Network

    FC FC

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    34/99

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    35/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 35BRKDCT-2987

    Multiple Tiers of Application

    Presentation Tier

    Application Tier

    Storage Tier

    InternetServiceProvider A

    ServiceProvider B

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    36/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 36BRKDCT-2987

    InternalNetwork

    Active/Active Application Processing

    Active/StandbyDatabase Processing

    Or Active/Active

    InternalNetwork

    Active/Active WebHosting

    Active/Active Data Centers

    InternetServiceProvider A

    ServiceProvider B

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    37/99

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    38/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 38BRKDCT-2987

    Agenda

    Introduction to Data Center - The Evolution

    Data Center Disaster RecoveryObjectives

    Failure ScenariosDesign Options

    Components of Disaster RecoverySite Selection - Front End GSLB

    Server High Availability - ClusteringData Replication and Synchronization - SAN Extension

    Sample Design

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    39/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 39BRKDCT-2987

    Site Selection Mechanisms

    Site selection mechanisms depend on the technologyor mix of technologies adopted for request routing :

    1. HTTP Redirect

    2. DNS Based

    3. L3 Routing with Route Health Injection (RHI)

    Health of servers and/or applications needs to betaken into account

    Optionally, other metrics (like load ) can be measuredand utilized for a better selection

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    40/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 40BRKDCT-2987

    HTTP Redirection The Idea

    Leveraging the HTTP redirect function:HTTP return code 302

    Proper site selection made after the initial DNS request

    has been resolved, via redirectionMainly as a method of providing site persistence whileproviding local server farm failure recovery

    Can be used with the Location Cookie feature of theCSS to provide redirection after wrong site selection

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    41/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 41BRKDCT-2987

    HTTP Redirection Traffic Flow

    http://www2.cisco.com/

    http://www1.cisco.com/

    http://www.cisco.com/

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    42/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 42BRKDCT-2987

    Advantages of the HTTP RedirectionApproach

    Can be implemented without any otherGSLB devices or mechanisms

    Inherent persistence to the selectedlocation

    Can be used in conjunction with othermethods to provide more sophisticated

    site selection

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    43/99

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    44/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 44BRKDCT-2987

    DNS-Based Site Selection The Idea

    The client D-proxy (local name server) performsiterative queries

    The device which acts as site selector is theauthoritative name server for the domain(s) distributed

    in multiple locationsThe site selector sends keepalives to servers orserver load balancer in the local and remote locations

    The site selector selects a site for the name

    resolution, according to the pre-defined answers andsite load balance method

    The user traffic is sent to the selected location

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    45/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 46BRKDCT-2987

    DNS-Based Site Selection Traffic Flow

    Client

    DNS Proxy

    Data Center 1

    http://www.cisco.com/

    Root Name Server for/ Authoritative Name Server for .com

    Authoritative Name Server cisco.com

    AuthoritativeName Server

    www.cisco.com

    1

    23 4

    56

    78

    9

    10

    Data Center 2

    UDP:53

    TCP:80

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    46/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 48BRKDCT-2987

    Advantages of the DNS Approach

    Protocol independent: works with anyapplication that uses name resolution

    Minimal configuration changes in the currentIP and DNS infrastructure (DNS authoritativeserver)

    Implementation can be different for specifichost names

    A-records can be changed on the fly

    Can take load or data center size intoaccount

    Can provide proximity

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    47/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 49BRKDCT-2987

    Limitations of the DNS-Based Approach

    Visibility limited to the D-proxy (not theclient)

    Can not guarantee 100% sessionpersistency

    DNS caching in the D-proxy

    DNS caching in the client application

    Order of multiple A-record answerscan be altered by D-proxies

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    48/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 50BRKDCT-2987

    Route Health Injection The Idea

    Server and application health monitoring provided bylocal Server Load Balancers

    SLB can advertise or with draw VIP address to upstreamrouting devices depending on the availability of the localserver farm

    Same VIP addresses can be advertised from multipledata centers IP Anycast

    Relying on L3 routing protocols for route propagatingand content request routing

    Disaster Recovery provided by network convergence

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    49/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 51BRKDCT-2987

    Route Health Injection Implementation

    Client BClient ARouter 13

    Router 11

    Router 12

    Router 10

    Location BPreferred Location for

    VIP x.y.w.z

    Location ABackup Location for

    VIP x.y.w.z

    Very High Cost

    Low Cost

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    50/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 52BRKDCT-2987

    Advantages of the RHI Approach

    Supports legacy application and does notrely on a DNS infrastructure

    Very good re-convergence time,especially in Intranets where L3 protocolscan be fine tuned appropriately

    Protocol-independent: works with anyapplication

    Robust protocols and proven features

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    51/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 53BRKDCT-2987

    Limitations of the RHI Approach

    Relies on host routes (32 bits), whichcannot be propagated all over theinternet (more on this later)

    Requires tight integration between theapplication-aware devices and the L3routers

    Inability to intelligently load balanceamong the data centers

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    52/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 54BRKDCT-2987

    Agenda

    Introduction to Data Center - The Evolution

    Data Center Disaster RecoveryObjectives

    Failure Scenarios

    Design Options

    Components of Disaster RecoverySite Selection - Front End GSLB

    Server High Availability - ClusteringData Replication and Synchronization - SAN Extension

    Sample Design

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    53/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 55BRKDCT-2987

    Cluster Overview A cluster is two or more serversconfigured to appear as oneTwo types of clustering: Loadbalancing ( LB) and High

    Availability ( HA)

    Clustering provides benefits foravailability , reliability, scalability ,and manageability

    LB clustering: multiple copies ofthe same application against thesame data set, usually read only

    HA clustering: multiple copies oflong running application thatrequires access to a common datadepository, usually read and write

    Application Servers

    Web Servers

    Database Servers

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    54/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 56BRKDCT-2987

    HA Cluster Connections

    Public Network (typicallyEthernet) for client /Applicationrequests

    Servers with same hardware,OS, and application software

    Private Network (typicallyEthernet) for interconnectionbetween nodes. Could be directconnect, or optionally goingthrough the public network

    Storage Disk (typically Fiber)

    shared storage array, NAS orSAN

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    55/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 57BRKDCT-2987

    Typical HA Cluster Components

    Application software that are clustered to provide High Availability. Example: Microsoft Exchange, SQL, Oracledatabase, File and Print Services

    Operating System that runs on the server hardware.Example: Microsoft Windows 2000 or 2003, Linux (and the

    other flavors of UNIX), IBM VMS or z/OS (for mainframe)Cluster Software that provides the HA clustering servicefor the application. Example: Microsoft MSCS, EMC

    AutoStart (Legato), Veritas Cluster Server, HP TruClusterand OpenVMS

    Optionally, Cluster Enabler , a software that synchronizesthe cluster software with the storage disk array software

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    56/99

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    57/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 59BRKDCT-2987

    File System Approaches for HA Clusters

    Shared Everything Equal access to all storage

    Each node mounts all storage resources

    Provides a single layout reference system for all nodes

    Changes updated in the layout reference

    Shared Nothing Traditional file system with peer-peer communication

    Each node mounts only its semi -private storage

    Data stored on the peer systems storage is accessed via the peer -peer communication

    Failed nodes storage needs to be mounted by the peer

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    58/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 60BRKDCT-2987

    Geo-clusters

    node1 node2

    Local Datacenter

    RemoteDatacenter

    WAN

    Disk Replication

    Synchronous or Asynchronous

    2 x RTT

    Geo-cluster: cluster that span multiple data centers

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    59/99

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    60/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 62BRKDCT-2987

    Split-Brain

    Split-brain happens when all of thenetwork communication linksbetween two or more cluster nodesfail.

    Both nodes could potentially goactive, and concurrently access thedisk, thus corrupting data

    node1 node2

    Data Corruption

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    61/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 63BRKDCT-2987

    Resolution for Split Brain: Quorum

    A quorum device serves as a tiebreaker to arbitrate which system hasaccess to resources.

    The quorum ensures that even if thereis no communication between thenodes, only one node can continue toaccess the disk.

    Only the node that owns the quorum(or, majority quorum votes) can bringresources online.

    Any resource can be used as thearbitrator to break the tie.

    node1 node2

    quorum

    Application data

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    62/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 64BRKDCT-2987

    Extended Layer 2 Network

    In most implementation,a common L2 network isneeded for the heartbeatbetween the nodes, aswell as public clientaccess

    Extending VLAN on ageographical basis is notconsidered best practicebecause of the impact ofbroadcasts, multicast,flooding and Spanning-Tree integration issues

    Public Layer 2 network

    Private Layer 2 networknode1 node2

    Local Datacenter

    RemoteDatacenter

    WAN

    Disk Replication:Synchronous or Asynchronous

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    63/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 65BRKDCT-2987

    Resolution: L3 Routed Solution

    In certain cases a L3 routed solutionis possible

    Microsoft MSCS Requires that 2 nodes be on thesame subnet.

    The communication between the 2

    nodes is UDP unicast Local Area Mobility (LAM) allows theplacement of the nodes on 2 differentsubnets

    Veritas VCS Allows having nodes with IPaddresses in different subnets

    The Virtual Address needs to changewhen moving from node1 to node2

    DNS can be used to provide name-multiple IP mapping

    node1 node2

    Extended SAN

    11.20.5.x 172.28.210.x

    Disk Replication:Synchronous or Asynchronous

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    64/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 66BRKDCT-2987

    Storage Disk Zoning

    What storage disk arrayshould node 2 be zoned tobefore and after a failure onnode 1

    To complete the failover youneed to change the zoningconfiguration

    Software needed tosynchronize the ClusterSoftware with the Disk Arrayssoftware, i.e. Cluster Enabler

    RW RD

    RW RD

    node1 node2

    Extended SAN

    sym1320 sym1291

    standbyactive

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    65/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 67BRKDCT-2987

    Resolution: Cluster Enabler

    The Cluster Enabler (CE) providesthe interface between theClustering Software and the Disk

    Arrays software

    When the Clustering Softwaredetects a failure and wants to failthe node, the Cluster Enablerinstructs the Disk Array to performan failover

    Cluster Enabler also allows node1to be zoned to sym1320 andnode2 to be zoned to 1291

    The Cluster Enabler running oneach node typically communicateswith the Cluster Enabler Softwarerunning on the remote node withLocal Multicast messages RW WD

    RW WD

    node1 node2

    Extended SAN

    sym1320 sym1291

    active standby

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    66/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 68BRKDCT-2987

    Agenda

    Introduction to Data Center - The Evolution

    Data Center Disaster RecoveryObjectives

    Failure Scenarios

    Design Options

    Components of Disaster RecoverySite Selection - Front End GSLB

    Server High Availability - Clustering

    Data Replication and Synchronization - SAN Extension

    Sample Design

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    67/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 69BRKDCT-2987

    Terminology

    Storage subsystemJust a bunch of disks (JBOD)

    Redundant array of independent disks (RAID)

    Storage I/O devicesHost Bus Adapter (HBA)

    Small Computer Serial Interface (SCSI)

    Storage protocols

    SCSIiSCSI

    FC (FCIP)

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    68/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 70BRKDCT-2987

    Terminology (Contd)

    Direct Attached Storage (DAS)Storage is local behind the server

    No storage sharing possible

    Costly to scale; complex to manage

    Network Attached Storage (NAS)Storage is accessed at a file level over an IP network

    Storage can be shared between servers

    Storage Area Networks (SAN)Storage is accessed at a block-level

    Separation of Storage from the Server

    High performance interconnect providing high I/O throughput

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    69/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 71BRKDCT-2987

    Storage for ApplicationsPresentation Tier

    Unrelated small data files commonly stored on internal disks

    Manual distribution

    Application Processing Tier

    Transitional, unrelated dataSmall files residing on file systems

    May use RAID to spread data over multiple disks

    Storage Tier

    Large, permanent data files or raw dataLarge batch updates, most likely Real time

    Log and data on separate volumes

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    70/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 72BRKDCT-2987

    Backup and Replication

    Offsite tape vaultingBackup tapes stored at offsite location

    Electronic vaultingTransmission of backup data to offsite location

    Remote disk replicationContinuous copying of data to offsite location

    Transparent to host

    Other methods of replicationHost-based mirroring

    Network-based replication

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    71/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 73BRKDCT-2987

    Replication: Modes of Operation

    Synchronous All data written to cache of local and remote arrays before I/O iscomplete and acknowledged to host

    AsynchronousWrite acknowledged after write to local array cache; changes(writes) are replicated to remote array asynchronously

    Semi-synchronous

    Write acknowledged with a single subsequent WRITE commandpending from remote array

    Synchronous Vs. Asynchronous Trade-

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    72/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 74BRKDCT-2987

    SynchronousImpact to ApplicationPerformance

    Distance Limited (Are BothSites within the SameThreat Radius)

    No Data Loss

    AsynchronousNo ApplicationPerformance Impact

    Unlimited Distance (SecondSite Outside Threat Radius)

    Exposure toPossible Data Loss

    y V . yOff

    Enterprises Must Evaluate the Trade-Offs

    Maximum tolerable distance ascertained byassessing each application

    Cost of data loss

    http://www.legalaid.canberra.net.au/html/scales.gif
  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    73/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 75BRKDCT-2987

    Data Replication with DB Example

    Control Files identify other filesmaking up the database andrecords content and state ofthe db.Datafile is only updatedperiodically

    Redo logs record db changesresulting from transactions

    Used to play back changes thatmay not have been written todatafile when failure occurredTypically archived as they fill tolocal and DR site destinations

    Control Files

    Datafiles Redo LogFiles

    Identify

    Recordchanges to

    DB name

    creation date

    backup performed

    redo log time period

    datafile state

    Tablespaces

    Indexes

    Data Dictionary

    Database changes

    Data Replication with DB Example

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    74/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 76BRKDCT-2987

    p p(Contd)

    Database restored to state at time of failure (time t1)by:

    1. Restoring Control Files & Datafiles from last Hot

    Backup (time t0)2. Sequentially replaying changes from subsequent

    Redo Logs (archived and online) changes madebetween time t0 and t1

    Hot Backup ofDatafiles and

    Control Files takenat Time t 0

    t0

    time

    t1

    Failure or disaster occurs attime t 1 Media Failure (e.g. disk) Human Error (datafile deletion) Database Corruption

    Archived Redo Logs Online RedoLogs

    . . . . . . . . .

    Data Replication with DB Example

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    75/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 77BRKDCT-2987

    p p(Contd)

    Mixture of sync and async replication technologies commonly usedUsually only redo logs sync replicated to remote site

    Archive logs created from redo log and copied when redo log switches

    Point in time (PiT) copies of datafiles and control files copied periodically(e.g. nightly)

    Redo Logs (Cyclic)Redo Logs (Cyclic)Copy of Every Committed

    Transaction

    Archive Logs

    Synchronously Replicatedfor Zero Loss

    Replicated/Copied

    Primary Site Secondary Site

    Replicated/Copied

    Point in TimeCopy Taken

    When DBQuiescent

    Database

    Database

    copy attime t 0

    DatabaseCopy atTime t 0

    Earlier DBBackups

    Archive Logs

    SANExtensionTransport

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    76/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 78BRKDCT-2987

    Back-End ApplicationServers

    HighDensity

    Multilayer SAN

    Director

    Enterprise-Class Storage Arrays

    IntrusionDetection

    Internet

    Server Load Balancing

    Content Caching

    Stateful

    Firewalls

    HighDensity

    Multilayer LAN

    Switch

    Front-End Application

    Servers

    Data Center Interconnection Options

    Back-End ApplicationServers

    HighDensity

    Multilayer SAN

    Director

    Enterprise-Class storage Arrays

    IntrusionDetection

    Internet

    Server Load Balancing

    Content Caching

    Stateful Firewalls

    HighDensity

    Multilayer LAN

    Switch

    Front-End Application

    Servers

    SONET/SDH

    DWDM/CWDM

    IP/Metro E

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    77/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 79BRKDCT-2987

    Limited by Optics (Power Budget)

    Data Center Transport Options

    Dark Fiber

    CWDM

    DWDM

    SONET/SDH

    DataCenter Campus Metro Regional National

    Increasing Distance

    Sync

    Sync (2Gbps)

    Sync (2Gbps lambda)

    Sync (1Gbps+ subrate)

    Sync (Metro Eth)

    Async

    Async (1Gbps+)MDS9000 FCIP

    Limited by Optics (Power Budget)

    Limited by BB_Credits O p

    t i c a

    l

    I P

    Data Center Replication with SAN

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    78/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 80BRKDCT-2987

    pExtension

    Extend the normal reach ofa Fibre Channel fabric

    Replication

    Remote host to target array

    Shared data clusters

    FC FC

    SAN ExtensionNetwork

    Replication

    Shared DataCluster or

    Remote Host Access to

    Storage

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    79/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 81BRKDCT-2987

    DCInterconnect

    Network

    Site B

    ReplicationFabrics

    FCReplication

    fabrics

    SAN Design for Data Replication

    Servers with two fibrechannel connections tostorage arrays for highavailability

    Use of multipath software isrequired in dual fabric hostdesign

    SAN extension fabrics

    typically separate fromhost access fabricsReplication fabricrequirements generallyspecified by array vendor

    Site A Server Access

    FC

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    80/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 82BRKDCT-2987

    Data Center Disaster Recovery

    Sample Design

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    81/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 83BRKDCT-2987

    Disaster Impact Radius

    Disasters are characterized bytheir impact

    Local, metro, regional, globalFire, flood, earthquake, attack

    Is the backup site within the threatradius?

    Local1 2 km

    Metro< 50km

    Regional< 400km

    PrimaryData Center

    SecondaryData Center DR Site

    Global

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    82/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 84BRKDCT-2987

    Active/Standby Architecture - Today

    Hosts 1

    Storage 1

    Synch CWDMReplication

    Hosts 2

    MDS 9509s MDS 9509s

    Hosts 3

    MDS 9509s

    MDS 9509Gateway

    MDS 9509Gateway

    Synch FCIPReplication

    MDS 9509Gateway

    Storage 2 Storage 3

    HA Cluster(s)

    Bunker

    AsynchronousFCIP Replication

    CAHigh Availability Site 1

    CAHigh Availability Site 2

    NCDisaster Recovery Site

    Dual OC12

    Electronic Journaling

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    83/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 85BRKDCT-2987

    Frame Based Replication

    ProductionCluster

    Data Center 1

    R2 BCV/R1SRDF

    PiTPiT

    PiTPiT

    MDS

    Arch

    Redo

    PROD

    EMC/DMXEMC/DMX

    DUAL OC12

    D/R

    MDS

    Arch

    Redo

    D/R

    EMC/DMX

    BCV Timefinder Timefinder

    SRDF/ASRDF/ASRDF/A

    Data Center 2

    Triple Threat

    A i /A i A hi T

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    84/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 86BRKDCT-2987

    Active/Active Architecture - Tomorrow

    User SSLM

    decryptsrequest

    CSMroutes

    request

    Requestsdirected to

    backupapplication

    ACNScachespages

    ContentEngine

    CSMprobestrack

    applicationhealth

    GSS performs Site (DC) selectionaccording to pre-configured condition, using

    FQDN

    Requestsdirected to

    primaryapplication

    Service Locator Group

    Presentation Layer

    Data Centers

    ClusteredBackendX Active

    Y Standby

    DC1

    ClusteredBackendY Active

    X Standby

    DC2

    ActiveData X

    ActiveData Y

    ActiveData Y

    ActiveData X

    Mirror

    StandbyData Y

    StandbyData X

    Asynchronous

    Replication

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    85/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 87BRKDCT-2987

    SANTap and Continuous Data Protection

    CDP Appliance

    Production Servers

    MDSSAN

    SecondaryPrimary

    SANTap Appliance based storage replication Reliable copy of WRITE operations SCSI-FCIP communication

    Continuous Data Protection Automatic and Continuous Backups Time Addressable Storage (TAS) Any Point-in-Time Recovery Application based or Network based

    SAN Tap

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    86/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 88BRKDCT-2987

    MDS

    Fabric Based Replication with CDP

    ProductionCluster

    Data Center 1

    MDS

    Arch

    Redo

    PROD

    EMC/DMX

    DUAL OC12

    D/R

    Data Center 2

    SANTap

    Replication/CDP Appliance

    Replication/CDP Appliance

    TAS/SATA

    APiT

    APiT

    APiT

    APiT

    TAS/SATA

    APiT

    APiT Arch

    Redo

    BCV

    EMC/DMX

    D/R

    SRDF/ASRDF/ASRDF/A

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    87/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 89BRKDCT-2987

    End-End Data Center Resilience

    PrimaryLocation

    IP/Optical Network

    FC

    FC

    SecondaryLocation

    DB

    CWDM/DWDM

    CSS-1

    FC

    DC-3Web/APP

    ServerFarm

    DC-2DC-1

    GSS-1 GSS-2

    CSS-2 CSS-3

    Corp.DNS

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    88/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 90BRKDCT-2987

    Design Details

    Data centers 1 and 2 are in primary location with closeenough distance that can provide DC HA for active/activeaccess

    Data Center 3 (DR) with > tolerable disaster radius, awayfor Primary DC 1 and 2

    Web/App server farms are load balanced geographicallyDB servers are within a geo-HA cluster and running in aL3 design

    Synchronize Data replication between data centers withinthe primary location

    Asynchronous Data replication is done between theprimary and secondary storage systems

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    89/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 91BRKDCT-2987

    Business Continuity Planning

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    90/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 92BRKDCT-2987

    BCP Concept: Two Tiers

    BCP Management Tier Issues BCP policy

    Champions the Process

    Executes the Plan

    BCP Process Tier Develops and maintains the Plan

    Consists of stages

    BCP Lifecycle

    BCP Management

    BCP Process

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    91/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 93BRKDCT-2987

    Management and Process ActivitiesCreate BusinessContinuity Policy

    Establish BCP SteeringCommittee

    Establish BC PlanDevelopment Project

    Establish BCP Trainingand Awareness Program

    Coordinate BCP withPertinent Laws,

    Regulations, and Industry

    Standards

    Coordinate with OtherInternal / External BCP

    Related Agencies

    Plan Development Project Maintain DisasterReadiness Project Execute BC Plan

    Risk Management

    Business Impact Analysis

    BC Strategy Development

    BC Plan Development

    BC Plan Testing

    BC Plan Maintenance andRegular Testing

    B C P M a n a g e m e n

    t -

    B C P P r o c e s s -

    l

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    92/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 94BRKDCT-2987

    Example: Ciscos Corporate Program

    BCP Concept is a modelCommonly adapted (tailored) to a specificorganizations needs

    TestingEmbedding

    BCMBCM

    TrainingBusinessContinuity

    Plan DevelopmentBCM

    Program

    Initiation

    Assessment

    BCP D li bl

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    93/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 95BRKDCT-2987

    Plan Development Project

    Risk Management

    Business Impact Analysis

    BC Strategy Development

    BC Plan Development

    BC Plan Testing

    BCP Deliverables

    Risk and Controls

    Threats, Exposures,Risk Levels, and

    Risk Controls

    Business Impacts

    Critical Processes,Operational and

    Financial Impacts,and RecoveryRequirements Continuity

    Strategy

    Alternative CriticalResources andServices, and

    Recovery Methods

    B i I A l i

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    94/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 96BRKDCT-2987

    Business Impact Analysis

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    95/99

    References

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    96/99

    2010 Cisco Systems, Inc. All rights reserved. Cisco Public 98BRKDCT-2987

    References

    www.drj.com

    www.drii.org

    www.contingencyplanning.org

    www.thebci.org

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    97/99

    C l t Y S i E l ti

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    98/99

    2009 Cisco Systems, Inc. All rights reserved. Cisco Public 100

    Complete Your Session Evaluation

    Please give us your feedback!!

    Complete the evaluation form you weregiven when you entered the room

    This is session BRKDCT-2987

    Dont forget to complete the overallevent evaluation form included inyour registration kit

    YOUR FEEDBACK IS VERYIMPORTANT FOR US!!! THANKS

  • 8/13/2019 6.Data Center Disaster Recovery and Business Continuance

    99/99