Exchange Server 2013Site Resilience
Scott Schnoll
Agenda
• The Preferred Architecture
• Namespace Planning and Principles
• Datacenter Switchovers and Failovers
• Dynamic Quorum and DAGs
The Preferred Architecture
Site Resilience changes in Exchange 2013
Frontend/Backend recovery are independent
Most protocol access in Exchange Server 2013 is HTTPDNS resolves to multiple IP addressesHTTP clients have built-in IP failover capabilitiesClients skip past IPs that produce hard TCP failures
Namespace no longer a single point of failureSingle or multiple namespace optionsAdmins can switchover by removing VIP from DNS or disablingNo dealing with DNS latency
Preferred ArchitectureNamespace Design
For a site resilient datacenter pair, a single namespace / protocol is deployed across both datacenters
autodiscover.contoso.comHTTP: mail.contoso.comIMAP: imap.contoso.comSMTP: smtp.contoso.com
Load balancers are configured without session affinity, one VIP / datacenter
Round-robin, geo-DNS, or other solutions are used to distribute traffic equally across both datacenters
mail VIP
mail VIP
Preferred ArchitectureDAG Design
• Each datacenter should be its own Active Directory site
• Deploy unbound DAG model spanning each DAG across two datacenters
• Distribute active copies across all servers in the DAG
• Deploy 4 copies, 2 copies in each datacenter
• One copy will be a lagged copy (7 days) with automatic play down enabled
• Native Data Protection is used
• Single network is used for MAPI and replication traffic
• Third datacenter used for Witness server, if possible
• Increase DAG size density before creating new DAGs
DAG
mail VIP
mail VIP
Witness Server
Selina(somewhere in
NA)DNS Resolution
DAG
na VIP na VIP
Batman(somewhere in Europe)
DNS Resolution
DAG
eur VIP
eur VIP
Preferred Architecture
na.contoso.comeur.contoso.com
Namespace Planning & Principles
Namespace Planning
• No need for namespaces required by Exchange 2010• Can still deploy regional namespaces to control traffic• Can still have specific namespaces for protocols
• Two namespace models• Bound Model• Unbound Model
• Leverage split-DNS to minimize namespaces and control connectivity
• Deploy separate namespaces for internal and external Outlook Anywhere host names
Sue (somewhere in
NA) DNS Resolution
DAG1
mail VIP mail2 VIP
mail.contoso.com
mail2.contoso.com
DAG2
Jane(somewhere in
NA)DNS Resolution
Passive
Active
Active
Passive
Bound Model
Round-Robin between # of VIPs
Sue (somewhere in
NA) DNS Resolution
DAG
VIP #1 VIP #2
mail.contoso.com
Unbound Model
Load Balancing
• Exchange 2013 no longer requires session affinity to be maintained on the load balancer
• For each protocol session, CAS now maintains a 1:1 relationship with the Mailbox server hosting the user’s data
• Load balancer configuration and health probes will factor into namespace design
• Remember to configure health probes to monitor healthcheck.htm, otherwise LB and MA will be out of sync
CASOWA
ECP
EWS
EAS
OAB
MAPI
RPC
AutoD
Single Namespace / Layer 4
autodiscover.contoso.com
User
Layer
4LB
mail.contoso.com
health check
CASOWA
ECP
EWS
EAS
OAB
MAPI
RPC
AutoD
Single Namespace / Layer 7
autodiscover.contoso.com
User
Layer
7LB
mail.contoso.com
health check
Health check executes against each virtual directory
mapi.contoso.com
User
Layer
4LB
mail.contoso.com
ecp.contoso.com
ews.contoso.com
eas.contoso.com
oab.contoso.com
oa.contoso.com
CASOWA
ECP
EWS
EAS
OAB
MAPI
RPC
AutoD
autodiscover.contoso.com
Multiple Namespaces / Layer 4
Datacenter Switchovers and Failovers
Witness Server Placement
New Witness Server placement options availableChoose based on business needs and available options
Third location DAG witness server improves DAG recovery behaviorsAutomatic recovery on datacenter loss;Third location network infrastructure must have independent failure modes
Deployment scenario RecommendationsDAG(s) deployed in a single datacenter Locate witness server in the same datacenter as DAG members; can share one server across DAGs
DAG(s) deployed across two datacenters; No additional locations available Locate witness server in primary datacenter; can share one server across DAGs
DAG(s) deployed across two+ datacenters Locate witness server in third location; can share one server across DAGs
alternate datacenter: Portlandprimary datacenter: Redmond
Site Resilience - CAS
cas3 cas4cas1 cas2
VIP: 192.168.1.50X VIP: 10.0.1.50
mail.contoso.com: 192.168.1.50, 10.0.1.50Removing failing IP from DNS puts you in control of in service time of VIP
With multiple VIP endpoints sharing the same namespace, if one VIP fails,clients automatically failover to alternate VIP!
mail.contoso.com: 10.0.1.50
third datacenter: Stockholm alternate datacenter: Portlandprimary datacenter: Redmond
Site Resilience - Mailbox
mbx1 mbx2 mbx3 mbx4
Assuming MBX3 and MBX4 are operating and one of them can lock the witness.log file, automatic failover should occur
witnessX
alternate datacenter: Portlandprimary datacenter: Redmond
Site Resilience - Mailbox
witness
mbx1 mbx2 mbx3 mbx4
1. Mark the failed servers/site as down: Stop-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Redmond
2. Stop the Cluster Service on Remaining DAG members: Stop-Clussvc
3. Activate DAG members in 2nd datacenter: Restore-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Portland
X X X
alternate datacenter: Portlandprimary datacenter: Redmond
Site Resilience - Mailbox
witness
mbx1 mbx2 mbx3 mbx4
alternate witness
1. Mark the failed servers/site as down: Stop-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Redmond
2. Stop the Cluster Service on Remaining DAG members: Stop-Clussvc
3. Activate DAG members in 2nd datacenter: Restore-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Portland
X
Activation Block ComparisonTool Parameter Value Instance Usage
Suspend-MailboxDatabaseCopy
ActivationOnly N/A Per database copy
• Keep active off a working but questionable drive
Set-MailboxServer DatabaseCopyAutoActivationPolicy “Blocked” or “Unrestricted”
Per server • Used to control active/passive SR configurations and maintenance
• Can force admin moveSet-MailboxServer DatabaseCopyActivationDisabledAndMoveNow $true or $false Per server • Used to do faster site
failovers and maintain database availability
• Databases are not blocked from failing back
• Continuous move-off operation
DatabaseDisabledAndMoveNow
New server setting to improve site resilience
Get all active databases off server – FAST!Last resort to not move an active!
Proactively continue move databases attempts
Server can still be in serviceDatabases mounted and mail delivery!
Best Practices
Automate your recovery logic; make it reliableThink of it as rack/site maintenance
Exercise it regularly
Recovery times directly dependent on detection & decision times!Flip the bit! Don’t ask repair times, “if outage go…”Humans are the biggest threat to recovery times
Dynamic Quorum and DAGs
Dynamic Quorum
In Windows Server 2008 R2, quorum majority is fixed, based on the initial cluster configuration
In Windows Server 2012 (and later), cluster quorum majority is determined by the set of nodes that are active members of the cluster at a given time
This new feature is called Dynamic Quorum, and it is enabled for all clusters by default
Dynamic Quorum
Cluster dynamically manages vote assignment to nodes, based on the state of each nodeWhen a node shuts down or crashes, the node loses its quorum voteWhen a node rejoins the cluster, it regains its quorum vote
By adjusting the assignment of quorum votes, the cluster can dynamically increase or decrease the number of quorum votes required to keep running
Dynamic Quorum
By dynamically adjusting the quorum majority requirement, a cluster can sustain sequential node shutdowns to a single nodeThis is referred to as a “Last Man Standing” scenario
Dynamic Quorum
Does not allow a cluster to sustain a simultaneous failure of majority of voting membersTo continue running, the cluster must always maintain quorum after a node shutdown or failure
If you manually remove a node’s vote, the cluster does not dynamically add the vote back
Dynamic QuorumMajority of 7 required
Dynamic Quorum
XX
X
Majority of 4 requiredMajority of 7 required
Dynamic Quorum
XX
XX
Majority of 3 required
Dynamic Quorum
XX
XXX
Majority of 2 required
Dynamic Quorum
XX
XX
X
Majority of 2 required
Dynamic Quorum
XX
XX
X
1
0
Majority of 2 required
Dynamic Quorum
XX
XX
X
0
1
Majority of 2 required
Dynamic Quorum
XX
XX
X
0
1
Majority of 2 required
X
Dynamic Quorum
XX
XX
X
0
1
Majority of 2 required
XX
Dynamic Quorum
Use Get-ClusterNode to verify votes0 = does not have quorum vote1 = has quorum vote
Get-ClusterNode <Name> | ft name, *weight, state
Name DynamicWeight NodeWeight State---- ------------- ---------- -----EX1 1 1 Up
Dynamic Quorum
Works with most DAGsThird-party replication DAGs not tested
All internal testing has it enabled
Office 365 servers use it
Exchange is not dynamic quorum-aware
Does not change quorum requirements
Dynamic Quorum
Cluster team guidance:Generally increases the availability of the clusterEnabled by default, strongly recommended to leave enabledAllows the cluster to continue running in failure scenarios that are not possible when this option is disabled
Exchange team guidance:Leave it enabled for majority of DAG membersIn some cases where a Windows 2008 R2 DAG would have lost quorum, a Windows 2012 DAG can maintain quorumDon’t factor it into availability plans
Dynamic Witness
Witness OfflineWitness vote gets removed by the cluster
Witness OnlineIf necessary, Witness vote is added back by the cluster
Witness FailureWitness vote gets removed by the cluster
Windows Server 2012 R2 and later
Questions?
Top Related