Maximize Availability and Uptime by Clustering Your Physical Data Centers Within Metro Distances

1

MAXIMIZE AVAILABILITY AND UPTIME

BY CLUSTERING PHYSICAL DATA

CENTERS WITHIN METRO DISTANCES

MICHAEL NAKAMURA, SENIOR SOLUTIONS ARCHITECT HENRY CHU, SENIOR SOLUTIONS ARCHITECT

OCTOBER 2012

WEBTECH EDUCATIONAL SERIES

Maximize Availability and Uptime by Clustering Your Physical Data Centers

within Metro Distances

As IT infrastructures continue to be virtualized, data center architects are looking

for ways to increase the mobility and high availability of virtual machines beyond a

single data center.

Expanding data centers across multiple locations has become an increasingly

common strategy to address high-availability and disaster recovery needs for

businesses with high uptime requirements.

Join Hitachi Data Systems for this Webinar and learn how you can:

• Accelerate tier-1 virtualization adoption by providing best-in-class SLAs

• Dynamically move workloads within and across data centers to avoid

contention, and support utility-on-demand models

• Provide automated recovery of applications with high return on investment

UPCOMING WEBTECHS

November

Comprehensive and Simplified Management for VMware vSphere

environments, November 14, 11 a.m. PT, 2 p.m. ET

Microsoft SQL Server 2012 Data Warehouse solutions on Hitachi

converged platform, November 27, 9 a.m. PT, 12 p.m. ET

Check www.hds.com/webtech for

Links to the recording, the presentation and Q&A (available next week)

Schedule and registration for upcoming WebTech sessions

AGENDA

Customer challenges

VMware Metro Storage Cluster overview

Hitachi Storage Cluster for VMware vSphere

technical review

Best practices

© Hitachi Data Systems Corporation and Brocade Communications Systems, Inc. 2012. All Rights Reserved.

CUSTOMER CHALLENGES

Downtime

‒ Key component(s) failure in single data center

‒ Planned maintenance

‒ No disaster recovery without downtime

Reluctance to migrate mission-critical apps

‒ Fear of performance degradation

‒ Data recovery is an issue; inability to meet recovery time objectives (RTO) and recovery point objectives (RPO)

Lack of a single point of management

across data centers

No ability to pool resources across data

centers limits application deployment flexibility

VMWARE METRO STORAGE CLUSTER OVERVIEW

VMware vSphere Metro Storage Cluster (vMSC) is a new

certified configuration in which a storage device spans

multiple geographical storage systems

Hitachi Storage Cluster certification is complete – on

VMware Hardware Compatibility List

Implemented for disaster and downtime avoidance

WHAT IS A METRO STORAGE CLUSTER?

WHY USE A METRO STORAGE CLUSTER?

Maximize availability and uptime by clustering physical

data centers within metro distances

Leverage VMware infrastructure high-availability benefits

with storage-based synchronous replication awareness

Stretched storage clusters provide new architectures that

enable

Nondisruptive workload mobility

Cross-site load balancing of resources

Avoidance of disaster and downtime

Uniform host access model – provides a single view of a datastore

across sites

Data consistency across 2 sites in the case of failure

HITACHI STORAGE CLUSTER FOR VMWARE VSPHERE: INFRASTRUCTURE OVERVIEW

HITACHI STORAGE CLUSTER FOR VMWARE VSPHERE: MANAGEMENT OVERVIEW

vCenter Server contains

these management

components:

‒ vCenter

‒ Hitachi Dynamic Link Manager (HDLM) command

‒ vSphere CLI

‒ CCI Raid Manager

Cmd Dev presented from both Hitachi Virtual Storage Platform (VSP) systems

Best practice: Place vCenter at

a 3rd site to ensure virtual

infrastructure management is

not affected from any 1 site

during a sitewide failure

HITACHI STORAGE CLUSTER FOR VMWARE VSPHERE: ARCHITECTURE OVERVIEW

Hitachi High Availability Manager (HAM)

installed on each VSP

P-VOL and S-VOL seen as a single

volume

‒ RCU takes MCU serial number upon failover

Write data transferred from MCU to RCU

cache via synchronous Hitachi TrueCopy®

‒ Supports external storage and Hitachi Dynamic Provisioning volumes

Quorum disk on external storage

‒ Used by both MCU and RCU

‒ Unique quorum disk for each MCU-RCU relationship

‒ Allows verification of data integrity before failover

‒ Denotes location of most recent host data

HITACHI DYNAMIC LINK MANAGER (HDLM) WITH HIGH AVAILABILITY MANAGER (HAM): INTRODUCTION

Virtual storage represents P-VOL and

S-VOL as a single volume

‒ P-VOL and S-VOL have same VOL ID in SCSI inquiry

HDLM in ESX manages path

selection

‒ Active I/O sent to P-VOL

‒ S-VOL in standby state in normal operation

‒ Load balancing algorithm

Extended round robin

Extended least I/O

Extended least blocks

HAM uses synchronous TrueCopy to

replicate from P-VOL to S-VOL

HDLM WITH HAM: VMOTION AND DYNAMIC RESOURCE SCHEDULER

vMotioned VMs

‒ Hosts within the cluster will use active paths to P-VOLs

HDLM WITH HAM: VMWARE HIGH AVAILABILITY (HA)

VMware HA failover

‒ VMs failover to existing ESX nodes in HA cluster

‒ I/O continues to active P-VOL paths

HDLM WITH HAM: PATH FAILOVER

When paths to P-VOL fail,

HDLM PSP handles the

path failover

HDLM WITH HAM: STORAGE FAILOVER

When all paths to P-VOL or

MCU fail

‒ Paths to S-VOL become active

‒ Verify data integrity with quorum disk before failover

‒ RCU splits S-VOL with write- enabled status

HDLM WITH HAM: PATH RECOVERY

Storage recovery will

require reverse sync

‒ pairresync –swaps/swapp

When storage recovers

and paths to P-VOL

recover

‒ Paths to S-VOL become standby

‒ P-VOL paths become active

QUORUM FAILURE

Remote mirroring between P-VOL and S-VOL stops

P-VOL continues to process host I/O

REPLICATION LINK FAILURE

P-VOL continues to process host I/O

HDLM WITH HAM: SITE FAILURE

VM failover handled by VMware HA

Storage failover handled by HAM

Path failover to replicated storage handled by HDLM

WAN LINK FAILURE (UNDER REVIEW)

Link for replication and

remote site has failed but

links to local site are

active

‒ P-VOL cannot process host I/O

‒ HDLM switches the I/O path to S-VOL

‒ Site 1: I/O paths to S-VOL also cannot be used, so Site 1 cannot continue to access both P-VOL and S-VOL

‒ Site 2: S-VOL continues to process host I/O

BEST PRACTICE DESIGN RECOMMENDATIONS

Performance bottleneck dependent on WAN latency and

bandwidth

‒ Optionally use VMware HA with N+1 settings with combination of DRS affinity rules to keep VMs on same site where the active volume resides

Quorum disk should be located at 3rd site to ensure

quorum access is not affected from any 1 site during

sitewide failure.

vCenter should be located at 3rd site to ensure virtual

infrastructure management is not affected from any 1 site

during sitewide failure.

BEST PRACTICE DESIGN RECOMMENDATIONS

Perform storage failback during scheduled downtime

‒ Perform a clean and controlled storage failback by migrating high-uptime virtual machines to a single host via VMware vMotion and then performing storage failback

Avoid single points of failure by architecting with

redundancy in mind

23

QUESTIONS

UPCOMING WEBTECHS

November

Comprehensive and Simplified Management for VMware vSphere

environments, November 14, 11 a.m. PT, 2 p.m. ET

Microsoft SQL Server 2012 Data Warehouse solutions on Hitachi

converged platform, November 27, 9 a.m. PT, 12 p.m. ET

Check www.hds.com/webtech for

Links to the recording, the presentation and Q&A (available next week)

Schedule and registration for upcoming WebTech sessions

THANK YOU MICHAEL NAKAMURA HENRY CHU

[email protected], [email protected]

mailto:[email protected]



Maximize Availability and Uptime by Clustering Your Physical Data Centers Within Metro Distances

Technology

Transcript of Maximize Availability and Uptime by Clustering Your Physical Data Centers Within Metro Distances