Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

58

Transcript of Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Page 1: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.
Page 2: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Multi-Site Clustering with Windows Server 2008 R2

Elden ChristensenSenior Program Manager LeadMicrosoftSession Code: SVR319

Page 3: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Session Objectives And Takeaways

Session Objective(s): Understanding the need and benefit of multi-site clustersWhat to consider as you plan, design, and deploy your first multi-site cluster

Windows Server Failover Clustering is a great solution for not only high availability, but also disaster recovery

Page 4: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Multi-Site Clustering

Introduction Networking Storage Quorum Workloads

Page 5: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Site A But what if there is a catastrophic event?

Fire, flood, earthquake …

Same Physical Location

SAN

Is my Cluster Resilient to Site Failures?

Page 6: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Site BSite AApplications are failed over to a

separate physical location

Node is moved to a physically separate site

Multi-Site Clusters for DR

Extends a cluster from being a High Availability solution, to also being a Disaster Recovery solution

SANSAN

Page 7: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Benefits of a Multi-Site Cluster

Protects against loss of an entire datacenterAutomates failover

Reduced downtimeLower complexity disaster recovery plan

Reduces administrative overheadAutomatically synchronize application and cluster changesEasier to keep consistent than standalone servers

The primary reason DR solutions fail isdependence on people

Page 8: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Multi-Site Clustering

Introduction Networking Storage Quorum Workloads

Page 9: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Network Considerations

Network Options:1. Stretch VLAN’s across sites2. Cluster nodes can reside in different subnets

Site A

Public Network

Site B10.10.10.1 20.20.20.1

30.30.30.1 40.40.40.1

Separate Network

Page 10: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Stretching the NetworkLonger distance traditionally means greater network latencyToo many missed health checks can cause false failoverHeartbeating is fully configurable

SameSubnetDelay (default = 1 second)Frequency heartbeats are sent

SameSubnetThreshold (default = 5 heartbeats)Missed heartbeats before an interface is considered down

CrossSubnetDelay (default = 1 second)Frequency heartbeats are sent to nodes on dissimilar subnets

CrossSubnetThreshold (default = 5 heartbeats)Missed heartbeats before an interface is considered down to nodes on dissimilar subnets

Command Line: Cluster.exe /propPowerShell (R2): Get-Cluster | fl *

Page 11: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Security over the WANEncrypt intra-node traffic

0 = clear text1 = signed (default)2 = encrypted

Site A Site B10.10.10.1 20.20.20.1

30.30.30.1 40.40.40.1

Page 12: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Enhanced Dependencies – ORNetwork Name resource stays up if either IP Address Resource A OR IP Address Resource B is up

OR

Network Name resource

IP Address Resource A

IP Address Resource B

Page 13: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Client Reconnect ConsiderationsNodes in dissimilar subnetsFailover changes resource’s IP AddressClients need that new IP Address from DNS to reconnect

10.10.10.111 20.20.20.222

DNS Server 1DNS Server 2DNS Replication

Record Updated

Record Created

Record Obtained

FS = 10.10.10.111

Record Updated

FS = 20.20.20.222Site A Site B

Page 14: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Solution #1: Configure NN SettingRegisterAllProvidersIP (default = 0 for FALSE)

Determines if all IP Addresses for a Network Name will be registered by DNSTRUE (1): IP Addresses can be online or offline and will still be registeredEnsure application is set to try all IP Addresses, so clients can connect quicker

HostRecordTTL (default = 1200 seconds)Controls time the DNS record lives on client for a cluster network nameShorter TTL: DNS records for clients updated sooner

Page 15: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Solution #2: Prefer Local FailoverLocal failover for higher availability

No change in IP AddressCross-site failover for disaster recovery

10.10.10.111

DNS Server 1 DNS Server 2

FS = 10.10.10.111Site A Site B

20.20.20.222

Page 16: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Solution #3: Stretch VLAN’sDeploying a VLAN minimizes client reconnection times

DNS Server 1 DNS Server 2

FS = 10.10.10.111

Site A Site B

10.10.10.11110.10.10.111

VLAN

Page 17: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Solution #4: Abstraction in DeviceNetwork device uses 3rd IP3rd IP is the one registered in DNS & used by clientExample:http://www.cisco.com/en/US/docs/solutions/Enterprise/Data_Center/App_Networking/extmsftw2k8vistacisco.pdf

10.10.10.111 20.20.20.222

DNS Server 1

DNS Server 2

FS = 30.30.30.30Site A Site B

30.30.30.30

Page 18: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

This is generic guidance…

If you have other creative ideas, that’s ok!

Page 19: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Multi-Site Clustering

Introduction Networking Storage Quorum Workloads

Page 20: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Storage in Multi-Site Clusters

Different than local clusters:Multiple storage arrays – independent per siteNodes commonly access own site storageNo “true” shared disk visible to all nodes

Site A Site B

Page 21: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Site A

Changes are made on Site A and replicated to Site B

Site B

Replica

Storage Considerations

Need a data replication mechanism between sites

Page 22: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Replication Options

Replication levels:Hardware storage-based replication

Software host-based replication

Application-based replication

Page 23: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Synchronous ReplicationHost receives “write complete” response from the storage after the data is successfully written on both storage devices

PrimaryStorage

SecondaryStorage

WriteComplete

Replication

Acknowledgement

WriteRequest

Page 24: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Asynchronous ReplicationHost receives “write complete” response from the storage after the data is successfully written to the primary storage device

PrimaryStorage

SecondaryStorage

WriteComplete

Replication

WriteRequest

Page 25: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Synchronous vs. Asynchronous

Synchronous AsynchronousNo data loss Potential data loss on hard

failuresRequires high

bandwidth/low latency connection

Enough bandwidth to keep up with data replication

Stretches over shorter distances

Stretches over longer distances

Write latencies impact application performance

No significant impact on application performance

Page 26: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Ensures node is communicating with

local storage and array state

Disk Resource

Resource Group

Custom Resource

IP Address Resources*

Network Name Resource

Establishes start order

timing

Group determines smallest unit of

failover

Storage Resource Dependencies

Ensures node is communicating with

local storage and array state

Ensures application comes online after

replication is complete

Workload Resource (example File Server)

Page 27: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Cluster Validation and Replication

Multi-Site clusters are not required to pass the Storage tests to be supported

Validation Guide and Policyhttp://go.microsoft.com/fwlink/?LinkID=119949

Page 28: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

HP’s Multi-Site Implementation & DemoMatthias PoppArchitectHP

partner

Page 29: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

HP's Multi-Site Implementation:CLX for Windows

Virtual Machine

VM Config FilePhysical Disk

HP CLX

All Physical Disk resources of one Resource Group (VM) depend on a CLX resourceVery smooth integration

Page 30: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

HP Cluster Extension –What’s new?

Support for Hyper-V Live Migration across disk arraysSupport for Windows 2008 R2 Support for Windows Hyper-V Server 2008 R2

TT337AAE – HP StorageWorks Cluster Extension EVA for Window e-LTUThere is no change to current CLX product pricing

XP Cluster Extension does not yet support Live Migration - planed for 2010

Page 31: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Live Migration with Storage FailoverInitiate Live Migration

storage based remote replication storage based remote replication

Host 1 Host 2

HP EVA Storage HP EVA Storage

Create VM on target nodeCreate VM on target nodeCopy memory pages from source server to target server via EthernetCheck disk array for replication link and disk pair states

Initiate Live MigrationCreate VM on target node

Copy memory pages from source server to target server via EthernetCheck disk array for replication link and disk pair states

Final state transferPause virtual machineMove storage connectivity from source server to target serverChange storage replication direction

Initiate Live MigrationCreate VM on target node

Copy memory pages from source server to target server via EthernetCheck disk array for replication link and disk pair states

Final state transferPause virtual machineMove storage connectivity from source server to target serverChange storage replication direction

Run new VM on target server; Delete VM on source server

Page 32: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

HP Storage for VirtualizationHyper-V Live Migration between Replicated Disk Arrays

End-user transparent app migration across data centers; across servers and storageZero Downtime Array Load Balancing

(IOPS, cache utilization, response times, power consumption, etc.)Zero Downtime Maintenance

Firmware/HBA/Server updates without user interruptionPlan maintenance without the need to check for downtimes

Follow the sun/moon data center access modelMove the app/VM closest to the users or closest to the cheapest power source

Failover, failback, Quick and Live Migration using the same management software

No need to learn x different tools and their limitations

Page 33: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

EVA CLX with Exchange 2010 Live Migrationdemo

Page 34: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Virtual Machines

Mailbox server G:\ OS Disk 30 GB

K:\ Database Disk 100 GB

Hub Transport server OS Disk 30GB

Client Access server OS Disk 30GB

Hyper-V Geo Cluster with Exchange

LAN

Command View

SCVMM

Virtual networkEVA 4400

EVA 4400

LiveMigration

Command ViewSCVMM

SAN

Replicate VHDs of all VMs

Hub Transport serverClient Access server

DR Group 003DR Group 002

DR Group 001

SAN

Mailbox server

HP Cluster ExtensionHyper-V Cluster

Page 35: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Virtual Machines

Mailbox server G:\ OS Disk 30 GB

K:\ Database Disk 100 GB

Mailbox server

Automatically re-direct storage replication during Live Migration

Virtual Machines

Hub Transport server OS Disk 30GB

Client Access server OS Disk 30GB

Hyper-V Geo Cluster with Exchange

LAN

Command View

SCVMM

Virtual networkEVA 4400

EVA 4400

LiveMigration

Command ViewSCVMM

SAN

Replicate VHDs of all VMs

Hub Transport serverClient Access server

DR Group 003DR Group 002

DR Group 001

SAN

HP Cluster ExtensionHyper-V Cluster

Page 36: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.
Page 37: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

37

Additional HP ResourcesHP website for Hyper-V

www.hp.com/go/hyper-v HP and Microsoft Frontline Partnership website

www.hp.com/go/microsoft HP website for Windows Server 2008 R2

www.hp.com/go/ws2008r2HP website for management tools

www.hp.com/go/insightHP OS Support Matrix

www.hp.com/go/osssupportInformation on HP ProLiant Network Adapter Teaming for Hyper-V

http://h20000.www2.hp.com/bc/docs/support/SupportManual/c01663264/c01663264.pdf

Technical overview on HP ProLiant Network Adapter Teaminghttp://h20000.www2.hp.com/bc/docs/support/SupportManual/c01415139/c01415139.pdf?jumpid=reg_R1002_USEN

Whitepaper: Disaster Tolerant Virtualization Architecture with HP StorageWorks Cluster Extension and Microsoft Hyper-V™

http://h20195.www2.hp.com/V2/getdocument.aspx?docname=4AA2-6905ENW.pdf

Page 38: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Multi-Site Clustering

Introduction Networking Storage Quorum Workloads

Page 39: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Quorum Overview

Disk only (not recommended)Node and Disk majority

Node majorityNode and File Share majority

VoteVote Vote Vote Vote

Majority is greater than 50%Possible Voters:

Nodes (1 each) + 1 Witness (Disk or File Share)4 Quorum Types

Page 40: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Replicated Disk WitnessA witness is a decision maker when nodes lose network connectivity

When a witness is not a single decision maker, problems occurDo not use in multi-site clusters unless directed by vendor

Replicated Storage from vendor

?Vote Vote Vote

Page 41: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Site BSite A

Cross site network connectivity broken!

Can I communicate with majority of the nodes in

the cluster?Yes, then Stay Up

Can I communicate with majority of the nodes in

the cluster?No, drop out of Cluster

Membership

5 Node Cluster: Majority = 3

Majority in Primary Site

SANSAN

Node Majority

Page 42: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Node Majority

Site BSite A

Disaster at Site 1

We are down! Can I communicate with majority of the nodes in

the cluster?No, drop out of Cluster

Membership

Majority in Primary Site

5 Node Cluster: Majority = 3

SANSAN

Need to force quorum manually

Page 43: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Forcing Quorum

Always understand why quorum was lostUsed to bring cluster online without quorumCluster starts in a special “forced” stateOnce majority achieved, no more “forced” state

Command Line:net start clussvc /fixquorum (or /fq)

PowerShell (R2):Start-ClusterNode –FixQuorum (or –fq)

Page 44: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Site A Site B

Site C

Complete resiliency and automatic recovery from the loss of any 1 site

Replicated Storage

\\Foo\Cluster1

SAN SAN

WAN

Multi-Site With File Share WitnessFile Share Witness

Page 45: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

WANSite A Site B

Site C

Complete resiliency and automatic recovery from the loss of connection between sites

Replicated Storage

SAN SAN

Multi-Site With File Share WitnessCan I communicate with

majority of the nodes (+FSW) in the cluster?

Yes, then Stay Up

File Share Witness

Can I communicate with majority of the nodes in the

cluster?No (lock failed), drop out of

Cluster Membership\\Foo\Cluster1

Page 46: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

FSW Considerations

Simple Windows File ServerSingle file server can serve as a witness for multiple clusters

Each cluster requires it’s own shareCan be clustered in a second cluster

Recommended to be at 3rd separate site so that there is no single point of failure

FSW cannot be on a node in the same cluster

Page 47: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Quorum Model Summary

No Majority: Disk OnlyNot RecommendedUse as directed by vendor

Node and Disk MajorityUse as directed by vendor

Node MajorityOdd number of nodesMore nodes in primary site

Node and File Share MajorityEven number of nodesBest availability solution – FSW in 3rd site

Page 48: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Multi-Site Clustering

Introduction Networking Storage Quorum Workloads

Page 49: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Hyper-V in a Multi-Site Cluster

Area ConsiderationsNetwork -On cross-subnet failover, if guest is …

- DHCP, then IP updated automatically- Statically configured IP, then admin needs to

configure new IP-Use VLAN preferred with live migration between sites

Storage -3rd party replication solution required-Configuration with CSV (explained next)

Quorum -No special considerations

Links: http://technet.microsoft.com/en-us/library/dd197488.aspx

Page 50: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

CSV in a Multi-Site Cluster

Architectural assumptions collide…Replication solutions assume only 1 array accessed at a timeCSV assumes all nodes can concurrently access the LUN

CSV is not required for Live MigrationTalk to your storage vendor for their support storyCSV requires VLAN’s

VHD

Nodes in Primary Site Nodes in Disaster Recovery Site

Read/OnlyRead/WriteReplication

VM attempts to access replica

Page 51: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

SQL in a Multi-Site Cluster

Area ConsiderationsNetwork -SQL does not support OR dependency

-Need to stretch VLAN between sitesStorage -No special considerations

-3rd party replication solution requiredQuorum -No special considerations

Links:http://technet.microsoft.com/en-us/library/ms189134.aspx http://technet.microsoft.com/en-us/library/ms178128.aspx

Page 52: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Exchange in a Multi-Site ClusterArea Considerations

Network -No VLAN needed-Change HostRecordTTL from 20 minutes to 5 minutes-CCR supports 2 nodes, one per site

Storage -Exchange CCR provides application-based replication

Quorum -File share witness on the Hub Transport server on primary site

Links:http://technet.microsoft.com/en-us/library/bb124721.aspx http://technet.microsoft.com/en-us/library/aa998848.aspx

Page 53: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Session Summary

Multi-Site Failover Clustering has many benefitsRedundancy is needed everywhereUnderstand your replication needsCompare VLANs with multiple subnetsPlan quorum model & nodes before deploymentFollow the checklist and best practices

Page 54: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

www.microsoft.com/teched

Sessions On-Demand & Community

http://microsoft.com/technet

Resources for IT Professionals

http://microsoft.com/msdn

Resources for Developers

www.microsoft.com/learning

Microsoft Certification & Training Resources

Resources

Page 55: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Related Content

Breakout SessionsSVR208 Gaining Higher Availability with Windows Server 2008 R2 Failover ClusteringSVR319 Multi-Site Clustering with Windows Server 2008 R2 DAT312 All You Needed to Know about Microsoft SQL Server 2008 Failover ClusteringUNC307 Microsoft Exchange Server 2010 High AvailabilitySVR211 The Challenges of Building and Managing a Scalable and Highly Available Windows Server 2008 R2 Virtualisation SolutionSVR314 From Zero to Live Migration. How to Set Up a Live Migration

Demo SessionsSVR01-DEMO Free Live Migration and High Availability with Microsoft Hyper-V Server 2008 R2

Hands-on LabsUNC12-HOL Microsoft Exchange Server 2010 High Availability and Storage Scenarios

Page 56: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Multi-Site Clustering Content

Design guide:http://technet.microsoft.com/en-us/library/dd197430.aspx

Deployment guide/checklist:http://technet.microsoft.com/en-us/library/dd197546.aspx

Page 57: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

Complete an evaluation on CommNet and enter to win an Xbox 360 Elite!

Page 58: Elden Christensen Senior Program Manager Lead Microsoft Session Code: SVR319.

© 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS,

IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.