How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM::...

© 2008 Hewlett-Packard Development Company, L.P.The information contained herein is subject to change without notice

How the Disaster Proof OpenVMS Cluster Recovered So Fast, and How Yours Can, Too

Keith Parris Systems/Software Engineer

HPMonday, May 19 and Wednesday, May 21

Story of the OpenVMS Cluster in the Disaster Proof Video

4 30 July 2015

Disaster Proof Demonstration and Video

5 30 July 2015

Camden Arkansas NTS

6 30 July 2015

The Failover Datacenter

7 30 July 2015

The original “green” datacenter

8 30 July 2015

Nature gets in on the act!

9 30 July 2015

KABOOM! Arkansas on the ground

10 30 July 2015

OpenVMS Disaster-Proof configuration & application

XP12000 XP24000

Shadow set

Strea

m of

I/Os

KABOOM::

Alpha

ES40

QUORUM::

Integrity

rx2620

SDBOOM::

Integrity

Superdome

All I/O’s need to complete

to all spindles before it is

considered done.

When a spindle drops out

The shadow set is reduced.

I/O’s “in flight” wait for the

Shadow set to be reduced.

The longest outstanding request for an I/O during the DP demo was 13.71 seconds.

11 30 July 2015

GQB ready for a ride!

12 30 July 2015

Disaster Proof Demo OpenVMS Cluster

How the Disaster Proof OpenVMS Cluster Recovered So Fast, and How Yours Can, Too

OpenVMS Cluster Failure Detection Mechanisms and Cluster State Transitions

15 30 July 2015

OpenVMS Cluster Connection Manager and Transient Failures

• Some failures are temporary and transient

− Especially in a LAN environment

• To prevent the disruption of unnecessary removal of a node from the cluster, when a communications failure is detected, the Connection Manager waits for a time in hopes of the problem going away by itself

− This time is called the Reconnection Interval

• SYSGEN parameter RECNXINTERVAL

− RECNXINTERVAL is dynamic and may thus be temporarily raised if needed for something like a scheduled LAN outage

16 30 July 2015

OpenVMS Cluster Connection Manager and Communications or Node Failures

• If the Reconnection Interval passes without connectivity being restored, or if the node has “gone away”, the cluster cannot continue without a reconfiguration

• This reconfiguration is called a State Transition, and one or more nodes will be removed from the cluster

17 30 July 2015

Failure and Repair/Recovery within Reconnection Interval

Failure occurs

Failure detected

(virtual circuit

broken)

Problem fixed

Fixed state detected

(virtual circuit

re-opened)

Time

RECNXINTERVAL

18 30 July 2015

Hard Failure

Failure occurs

Failure detected

(virtual circuit

broken)

State transition

(node removed

from cluster)

Time

RECNXINTERVAL

19 30 July 2015

Late Recovery

Failure occurs

Failure detected

(virtual circuit

broken)

State transition

(node removed

from cluster)

Time

RECNXINTERVAL

Problem fixed

Fix detected

Node does CLUEXIT

bugcheck

Node learns it has been

removed from cluster

20 30 July 2015

Failure Detection Mechanisms

• Mechanisms to detect a node or communications failure

− Last-Gasp Datagram

− Periodic checking

• Multicast Hello packets on LANs

• Polling on CI and DSSI

• TIMVCFAIL check

21 30 July 2015

PEDRIVER Hello Packet Timing

• Hello packet Transmit Interval

−Default is 3 seconds

−Dithered by reducing to as much as half to avoid forming”packet trains”

• so Hellos could be spaced as close as 1.5 seconds, or as far apart as 3 seconds

• Hello packet Listen Timeout

−Default is 8 seconds

−Allows detection of failure in between 8 and 9 seconds

22 30 July 2015

Failure Detection onLAN interconnects

Time t=0

Time t=3

Time t=6

Time t=9

Remote node Local node

Hello packet

Hello packet

Hello packet (lost)

Hello packet

Clock ticks

01

2

30

12

34

5

6

10

Listen Timer

23 30 July 2015

Failure Detection onLAN interconnects

Time t=0

Time t=3

Time t=6

Remote node Local node

Hello packet

Hello packet (lost)

Clock ticks

01

2

3

4

5

6

Listen Timer

7

8Virtual

Circuit

Broken

Hello packet (lost)

24 30 July 2015

TIMVCFAIL Mechanism

Local node Remote node

Time t=0

Time t=1/3 of TIMVCFAIL


Request

Response

Request

Response

25 30 July 2015

TIMVCFAIL Mechanism

Local node Remote node

Time t=0



Request

Response

Request

Time t=TIMVCFAIL

Node fails

some time during

this period

1

2

Virtual circuit broken

26 30 July 2015

Sequence of eventsDuring a State Transition

• Determine new cluster configuration

• If quorum is lost:

• QUORUM capability bit removed from all CPUs

• no process can be scheduled to run

• Disks all put into mount verification

• If quorum is not lost, continue…

• Rebuild lock database

• Stall lock requests

• I/O synchronization

• Do rebuild work

• Resume lock handling

27 30 July 2015

Measuring State Transition Effects

• Determine the type of the last lock rebuild:$ ANALYZE/SYSTEM

SDA> READ SYS$LOADABLE_IMAGES:SCSDEF

SDA> EVALUATE @(@CLU$GL_CLUB + CLUB$B_NEWRBLD_REQ) & FF

Hex = 00000002 Decimal = 2 ACP$V_SWAPPRV

• Rebuild type values:

1. Merge (locking not disabled)

2. Partial

3. Directory

4. Full

28 30 July 2015

Measuring State Transition Effects

• Determine the duration of the last lock request stall period:

SDA> DEFINE TOFF = @(@CLU$GL_CLUB+CLUB$L_TOFF)

SDA> DEFINE TON = @(@CLU$GL_CLUB+CLUB$L_TON)

SDA> EVALUATE TON-TOFF

Hex = 0000026B Decimal = 619 PDT$Q_COMQH+00003

29 30 July 2015

Minimizing Impactof State Transitions

• Configurations issues:

− Few (e.g. exactly 3) nodes

− Quorum node; no quorum disk

− Set up LAN cluster interconnect to minimize length of time packet-forwarding is blocked

• Original IEEE 802.1d Spanning Tree algorithm could take 35-40 seconds to converge and start forwarding packets again

− Two completely-independent spanning trees could help avoid communications being blocked on both at once

• Newer IEEE 802.1w Rapid Spanning Tree (and IEEE 802.1s Multiple Spanning Tree) protocols can be configured to recover in less than 1 second

Disaster Proof Demonstration Settings and Behavior

31 30 July 2015

OpenVMS System Parameter Settings for the Disaster Proof Demonstration

• SHADOW_MBR_TMO lowered from default of 120 down to 8 seconds

• RECNXINTERVAL lowered from default of 20 down to 10 seconds

• TIMVCFAIL lowered from default of 1600 to 400 (4 seconds, in 10-millisecond clock units) to detect node failure in 4 seconds, worst-case, (detecting failure at the SYSAP level)

• LAN_FLAGS bit 12 set to enable Fast LAN Transmit Timeout (give up on a failed packet transmit in 1.25 seconds, worst case, instead of an order of magnitude more in some cases)

• PE4 set to hexadecimal 0703 (Hello transmit interval of 0.7 seconds, nominal; Listen Timeout of 3 seconds), to detect node failure in 3-4 seconds at the PEDRIVER level

32 30 July 2015

Disaster Proof Demo Timeline

• Time = 0: Explosion occurs

• Time around 3.5 seconds: Node failure detected, via either PEDRIVER Hello Listen Timeout or TIMVCFAIL mechanism. VC closed; Reconnection Interval starts.

• Time = 8 seconds: Shadow Member Timeout expires; shadowset members removed.

• Time around13.5 seconds: Reconnection Interval expires; State Transition begins.

• Time = 13.71 seconds: Recovery complete; Application processing resumes.

33 30 July 2015


Explosion

Failure Detection Time

PEDRIVER Hello Listen Timeout or

TIMVCFAIL Timeout

T = 0 T = about 3.5 seconds

34 30 July 2015


Explosion

Shadow Member Timeout

Failed Shadowset Members Removed

T = 0 T = 8 seconds

35 30 July 2015


Reconnection Interval

PEDRIVER Hello Listen Timeout or

TIMVCFAIL Timeout

T = 0 T = about 3.5 seconds

Explosion

T = about 13.5 seconds

State Transition Begins

36 30 July 2015


T = 0 T = 13.71 seconds

Explosion

T = about 13.5 seconds

Node Removed

from Cluster Application Resumes

Cluster State Transition

Lock Database Rebuild

State Transition Begins

Simulation and Testing of Long Distance DR/DT Configurations

38 30 July 2015

Trends

39 30 July 2015

Trends

• Increase in disasters

• Longer inter-site distances for better protection

• Business pressures for shorter distances for performance

• Increasing pressure not to bridge LANs between sites

40 30 July 2015

• Trends

− Increase in Disasters

Trends

41 30 July 2015

“Natural disasters have quadrupled over the last two decades, from an average of 120 a year in the early 1980s to as many as 500 today.”

Continuity Insights Magazine

Nov./Dec. 2007 issue, page 10

42 30 July 2015

“There has been a six-fold increase in floods since 1980. The number of floods and wind-storms have increased from 60 in 1980 to 240 last year.”

Continuity Insights Magazine

Nov./Dec. 2007 issue, page 10

44 30 July 2015

Increase in Disasters

http://www.oxfam.org/en/files/bp108_climate_change_alarm_0711.pdf/download

http://www.oxfam.org/en/files/bp108_climate_change_alarm_0711.pdf/download

46 30 July 2015

• Trends

− Longer inter-site distances for better protection

Trends

47 30 July 2015

“Some CIOs are imagining potential disasters that go well beyond the everyday hiccups that can disrupt applications and networks. Others, recognizing how integral IT is to business today, are focusing on the need to recover instantaneously from any unforeseen event.” …“It's a different world. There are so many more things to consider than the traditional fire, flood and theft.”

“Redefining Disaster“

Mary K. Pratt, Computerworld, June 20, 2005http://www.computerworld.com/hardwaretopics/storage/story/0,10801,102576,00.html

http://www.computerworld.com/hardwaretopics/storage/story/0,10801,102576,00.html

48 30 July 2015

Northeast US Before Blackout

Source: NOAA/DMSP

49 30 July 2015

Northeast US After Blackout

Source: NOAA/DMSP

50 30 July 2015

“The blackout has pushed many companies to expand their data center infrastructures to support data replication between two or even three IT facilities -- one of which may be located on a separate power grid.”

Computerworld, August 2, 2004http://www.computerworld.com/securitytopics/security/recovery/story/0,10801,94944,00.html

http://www.computerworld.com/securitytopics/security/recovery/story/0,10801,94944,00.html

51 30 July 2015

“You have to be far enough apart to make sure that conditions in one place are not likely to be duplicated in the other.“… “A useful rule of thumb might be a minimum of about 50 km, the length of a MAN, though the other side of the continent might be necessary to play it safe.”“Disaster Recovery Sites: How Far Away is Far Enough?”

Drew Robb, Datamation, October 4, 2005http://www.enterprisestorageforum.com/continuity/features/article.php/3552971

http://www.enterprisestorageforum.com/continuity/features/article.php/3552971

52 30 July 2015

Trends:Longer inter-site distances for better protection

• In the past, protection was focused against risks like fires, floods, tornadoes. 1 to 5 miles was fine between sites.

• Right after 9/11, 60 to100 miles looked much better.

• After the Northeast Blackout of 2003, and increasing awareness of the possibility of a terrorist group obtaining a nuclear device and wiping out an entire metropolitan area is no longer inconceivable.

− Resulting pressure is for inter-site distances of 1,000 to 1,500 miles

• Challenges:

− Telecommunications links

− Latency due to speed of light adversely affects performance

53 30 July 2015

• Trends

−Business pressures for shorter distances for performance

Trends

54 30 July 2015

“A 1-millisecond advantage in trading applications can be worth $100 million a year to a major brokerage firm, by one estimate.”

Richard Martin, InformationWeek,

April 23, 2007

55 30 July 2015

“The fastest systems, running from traders' desks to exchange data centers, can execute transactions in a few milliseconds -- so fast, in fact, that the physical distance between two computers processing a transaction can slow down how fast it happens.”Richard Martin, InformationWeek,

April 23, 2007

56 30 July 2015

“This problem is called data latency --delays measured in split seconds. To overcome it, many high-frequency algorithmic traders are moving their systems as close to the Wall Street exchanges as possible.”

Richard Martin, InformationWeek,

April 23, 2007

57 30 July 2015

• Trends

− Increasing pressure not to bridge LANs between sites

Trends

58 30 July 2015

Trends:Increasing Resistance to LAN Bridging

• In the past, setting up a VLAN spanning sites for an OpenVMS disaster-tolerant cluster was common

• Networks are now IP-centric

• IP network mindset sees LAN bridging as “bad,” sometimes even “totally unacceptable”

• Alternatives:−Separate, private link for OpenVMS Multi-site Cluster

−Metropolitan Area Networks (MANs) using MPLS

−Ethernet-over-IP (EoIP)

−SCS-over-IP support planned for OpenVMS 8.4

59 30 July 2015

Site Selection and Inter-Site Distance

60 30 July 2015

Planning for DT: Site Selection

Sites must be carefully selected:

• Avoid hazards

− Especially hazards common to both (and the loss of both datacenters at once which might result from that)

• Make them a “safe” distance apart

• Select site separation in a “safe” direction

61 30 July 2015

Planning for DT: What is a “Safe Distance”

Analyze likely hazards of proposed sites:

• Natural hazards

− Fire (building, forest, gas leak, explosive materials)

− Storms (Tornado, Hurricane, Lightning, Hail, Ice)

− Flooding (excess rainfall, dam breakage, storm surge, broken water pipe)

− Earthquakes, Tsunamis

62 30 July 2015

Planning for DT: What is a “Safe Distance”

Analyze likely hazards of proposed sites:

• Man-made hazards

− Nearby transportation of hazardous materials (highway, rail)

− Terrorist with a bomb

− Disgruntled customer with a weapon

− Enemy attack in war (nearby military or industrial targets)

− Civil unrest (riots, vandalism)

63 30 July 2015

Former Atlas E Missile Silo Site in Kimball, Nebraska

64 30 July 2015

Planning for DT: Site Separation Distance

• Make sites a “safe” distance apart

• This must be a compromise. Factors:

− Risks

− Performance (inter-site latency)

− Interconnect costs

− Ease of travel between sites

− Availability of workforce

65 30 July 2015

Planning for DT: Site Separation Distance

• Select site separation distance:− 1-3 miles: protects against most building fires, natural gas leaks,

armed intruders, terrorist bombs

− 10-30 miles: protects against most tornadoes, floods, hazardous material spills, release of poisonous gas, non-nuclear military bomb strike

− 100-300 miles: protects against most hurricanes, earthquakes, tsunamis, forest fires, most biological weapons, most power outages, suitcase-sized nuclear bomb

− 1,000-3,000 miles: protects against “dirty” bombs, major region-wide power outages, and possibly military nuclear attacks

Threat Radius

66 30 July 2015

"You have to be far enough away to be beyond the immediate threat you are planning for.“…"At the same time, you have to be close enough for it to be practical to get to the remote facility rapidly.“

“Disaster Recovery Sites: How Far Away is Far Enough?” By Drew Robb

Enterprise Storage Forum, September 30, 2005



68 30 July 2015

“A Watertight Plan” By Penny Lunt Crosman, IT Architect, Sept. 1, 2005

http://www.itarchitect.com/showArticle.jhtml?articleID=169400810

“Survivors of hurricanes, floods, and the London terrorist bombings offer best practices and advice on disaster recovery planning.”

http://www.itarchitect.com/showArticle.jhtml?articleID=169400810

69 30 July 2015Source: “A Watertight Plan” By Penny Lunt Crosman, IT Architect, Sept. 1, 2005

70 30 July 2015

Planning for DT: Site Separation Direction

• Select site separation direction:

− Not along same earthquake fault-line

− Not along likely storm tracks

− Not in same floodplain or downstream of same dam

− Not on the same coastline

− Not in line with prevailing winds (that might carry hazardous materials or radioactive fallout)

Long-Distance Disaster Tolerance Using OpenVMS Clusters

Background

73 30 July 2015

Historical Context

Example: New York City, USA

• 1993 World Trade Center bombing raised awareness of DR and prompted some improvements

• Sept. 11, 2001 has had dramatic and far-reaching effects

−Scramble to find replacement office space

−Many datacenters moved off Manhattan Island, some out of NYC entirely

− Increased distances to DR sites

− Induced regulatory responses (in USA & abroad)

74 30 July 2015

Trends and Driving Forces in the US

• BC, DR and DT in a post-9/11 world:

−Recognition of greater risk to datacenters

• Particularly in major metropolitan areas

−Push toward greater distances between redundant datacenters

• It is no longer inconceivable that, for example, terrorists might obtain a nuclear device and destroy the entire NYC metropolitan area

75 30 July 2015

Trends and Driving Forces in the US

• "Draft Interagency White Paper on Sound Practices to Strengthen the Resilience of the U.S. Financial System“

−http://www.sec.gov/news/studies/34-47638.htm

• Agencies involved:

Federal Reserve System

Department of the Treasury

Securities & Exchange Commission (SEC)

• Applies to:

Financial institutions critical to the US economy

http://www.sec.gov/news/studies/34-47638.htm

76 30 July 2015

US Draft Interagency White Paper

The early “concept release” inviting input made mention of a 200-300 mile limit (only as part of an example when asking for feedback as to whether any minimum distance value should be specified or not):

“Sound practices. Have the agencies sufficiently described expectations regarding out-of-region back-up resources? Should some minimum distance from primary sites be specified for back-up facilitiesfor core clearing and settlement organizations and firms that play significant roles in critical markets (e.g., 200 -300 miles between primary and back-up sites)? What factors should be used to identify such a minimum distance?”

77 30 July 2015


This induced panic in several quarters:

• NYC feared additional economic damage of companies moving out

• Some pointed out the technology limitations of some synchronous mirroring products and of Fibre Channel at the time which typically limited them to a distance of 100 miles or 100 km

Revised draft contained no specific distance numbers; just cautionary wording

Ironically, that same non-specific wording now often results in DR datacenters 1,000 to 1,500 miles away

78 30 July 2015


“Maintain sufficient geographically dispersedresources to meet recovery and resumption objectives.”

“Long-standing principles of business continuity planning suggest that back-up arrangements should be as far away from the primary site as necessary to avoid being subject to the same set of risks as the primary location.”

79 30 July 2015


“Organizations should establish back-up facilities a significant distance away from their primary sites.”

“The agencies expect that, as technology and business processes … continue to improve and become increasingly cost effective, firms will take advantage of these developments to increase the geographic diversification of their back-up sites.”

80 30 July 2015

Ripple effect of Regulatory Activity Within the USA

• National Association of Securities Dealers (NASD):

−Rule 3510 & 3520

• New York Stock Exchange (NYSE):

−Rule 446

81 30 July 2015

Ripple effect of Regulatory Activity Outside the USA

• United Kingdom: Financial Services Authority:−Consultation Paper 142 – Operational Risk and Systems

Control

• Europe:−Basel II Accord

• Australian Prudential Regulation Authority−Prudential Standard for business continuity management

APS 232 and guidance note AGN 232.1

• Monetary Authority of Singapore (MAS)−“Guidelines on Risk Management Practices – Business

Continuity Management” affecting “Significantly Important Institutions” (SIIs)

82 30 July 2015

Resiliency Maturity Model project

• The Financial Services Technology Consortium (FTSC) has begun work on a Resiliency Maturity Model

−Taking inspiration from the Carnegie Mellon Software Engineering Institute’s Capability Maturity Model (CMM) and Networked Systems Survivability Program

− Intent is to develop industry standard metrics to evaluate an institution’s business continuity, disaster recovery, and crisis management capabilities

Long-distance Effects:Inter-site Latency

84 30 July 2015

Long-distance Cluster Issues

• Latency due to speed of light becomes significant at higher distances. Rules of thumb:

− About 1 ms per 100 miles, one-way

− About 1 ms per 50 miles round-trip latency

• Actual circuit path length can be longer than highway mileage between sites

• Latency can adversely affect performance of

− Remote I/O operations

− Remote locking operations

85 30 July 2015

200 240 400

4400

23000

0

5000

10000

15000

20000

25000

Latency (micro-seconds)

Gigabit Ethernet, zerodistance

Fast Ethernet, zerodistance

ATM 30 miles

DS-3 250 miles

OC-3 1400 miles

OpenVMS Lock Request Latencies

86 30 July 2015

Inter-site Latency:Actual Customer Measurements

Highway MileageLatency (ms) Est. Circuit Path Length

5 miles ATM OC-3 0.5 30 miles

35 miles 1.5 95 miles

25 to 35 miles,

IP DLSW link3 to 4 190-250 miles (effective)

130 miles DS-3 4.4 275 miles

“Over 150” miles 5.5 350 miles

1,250 miles DS-3 30 1,875 miles

87 30 July 2015

Differentiate between latency and bandwidth

• Can’t get around the speed of light and its latency effects over long distances

− Higher-bandwidth link doesn’t mean lower latency

Long-distance Techniques:SAN Extension

89 30 July 2015

SAN Extension

• Fibre Channel distance over fiber is limited to about 100 kilometers

−Shortage of buffer-to-buffer credits adversely affects Fibre Channel performance above about 50 kilometers

• Various vendors provide “SAN Extension” boxes to connect Fibre Channel SANs over an inter-site link

• See SAN Design Reference Guide Vol. 4 “SAN extension and bridging”:

−http://h20000.www2.hp.com/bc/docs/support/SupportManual/c00310437/c00310437.pdf

http://h20000.www2.hp.com/bc/docs/support/SupportManual/c00310437/c00310437.pdf

Long-distance Data Replication

91 30 July 2015

Disk Data Replication

• Data mirroring schemes

− Synchronous

• Slower, but no chance of data loss in conjunction with a site loss

− Asynchronous

• Faster, and works for longer distances

but can lose seconds’ or minutes’ worth of data (more under high loads) in a site disaster

92 30 July 2015

Continuous AccessSynchronous Replication

Node

FC Switch

Node

FC Switch

Mirrorset

EVA EVA

Controller in

charge of

mirrorset:

Write

93 30 July 2015


Node

FC Switch

Node

FC Switch

Mirrorset

EVA EVA

Controller in

charge of

mirrorset:

Write

Write

94 30 July 2015


Node

FC Switch

Node

FC Switch

Mirrorset

EVA EVA

Controller in

charge of

mirrorset:

Write

Write

Success status

95 30 July 2015


Node

FC Switch

Node

FC Switch

Mirrorset

EVA EVA

Controller in

charge of

mirrorset:

Write

Write

Success status

Success status

96 30 July 2015


Node

FC Switch

Node

FC Switch

Mirrorset

EVA EVA

Controller in

charge of

mirrorset:

Write

Write

Success status

Success status

Application

continues

97 30 July 2015

Continuous AccessAsynchronous Replication

Node

FC Switch

Node

FC Switch

Mirrorset

EVA EVA

Controller in

charge of

mirrorset:

Write

98 30 July 2015


Node

FC Switch

Node

FC Switch

Mirrorset

EVA EVA

Controller in

charge of

mirrorset:

Write Success status

99 30 July 2015


Node

FC Switch

Node

FC Switch

Mirrorset

EVA EVA

Controller in

charge of

mirrorset:

Write Success status

Application

continues

100 30 July 2015


Node

FC Switch

Node

FC Switch

Mirrorset

EVA EVA

Controller in

charge of

mirrorset:

Write

Write

Success status

Application

continues

101 30 July 2015

Synchronous versus Asynchronous Replication and Link Bandwidth

Time

0 8 am 12 noon 5 pm 12 pm

MB/Sec

Synchronous – RPO = 0

Asynchronous – RPO 2 hrs. max

Asynchronous – RPO many hrs.

Application write bandwidth

102 30 July 2015

Data Replication and Long Distances

• Some vendors claim synchronous mirroring is impossible at a distance over 100 kilometers, 100 miles, or 200 miles, because their product cannot support synchronous mirroring over greater distances

• OpenVMS Volume Shadowing does synchronous mirroring

−Acceptable application performance is the only limit found so far on inter-site distance for HBVS

103 30 July 2015

Long-distance SynchronousHost-based Mirroring Software Tests

• OpenVMS Host-Based Volume Shadowing (HBVS) software (host-based mirroring software)

• SAN Extension used to extend SAN using FCIP boxes

• AdTech box used to simulate distance via introduced packet latency

• No OpenVMS Cluster involved across this distance (no OpenVMS node at the remote end; just “data vaulting” to a “distant” disk controller)

104 30 July 2015

Long-distance HBVS Test Results

Delay, 1-way

(milliseconds)

Throughput

(Bytes/Second)

Distance

(Kilometers)

Distance

(Miles)

0 ms 11 megabytes 0 km 0 miles

10 ms 226 kilobytes 2,000 km 1,250 miles




300 ms 9 kilobyte 60,000 km 37,500 miles


485 ms 6.5 kilobytes 97,000 km 60,625 miles

Mitigating the Effects of Long Inter-site Distances

106 30 July 2015

Minimizing Round Trips Between Sites

• Some vendors have Fibre Channel SCSI-3 protocol tricks to do writes in 1 round trip vs. 2

−e.g. Brocade’s “FastWrite” or Cisco’s “Write Acceleration”

• Application design can also affect number of round-trips required between sites

107 30 July 2015

Mitigating Impact of Inter-Site Latency

How applications are distributed across a multi-site OpenVMS cluster can affect performance

This represents a trade-off among performance, availability, and resource utilization

108 30 July 2015

Application Scheme 1:Hot Primary/Cold Standby

• All applications normally run at the primary site

− Second site is idle, except for data replication work, until primary site fails, then it takes over processing

• Performance will be good (all-local locking)

• Fail-over time will be poor, and risk high (standby systems not active and thus not being tested)

• Wastes computing capacity at the remote site

109 30 July 2015

Application Scheme 2:Hot/Hot but Alternate Workloads

• All applications normally run at one site or the other, but not both; data is mirrored between sites, and the opposite site takes over upon a failure

• Performance will be good (all-local locking)

• Fail-over time will be poor, and risk moderate (standby systems in use, but specific applications not active and thus not being tested from that site)

• Second site’s computing capacity is actively used

110 30 July 2015

Application Scheme 3:Uniform Workload Across Sites

• All applications normally run at both sites simultaneously. (This would be considered the “norm” for most OpenVMS clusters.)

• Surviving site takes all load upon failure

• Performance may be impacted (some remote locking) if inter-site distance is large

• “Fail-over” time will be excellent, and risk low (all systems are already in use running the same applications, thus constantly being tested)

• Both sites’ computing capacity is actively used

111 30 July 2015

Work-arounds being used today

• Multi-hop replication

−Synchronous to nearby site

−Asynchronous to far-away site

• Transaction-based replication

−e.g. replicate transaction (a few hundred bytes) with Reliable Transaction Router instead of having to replicate all the database page updates (often 8 kilobytes or 64 kilobytes per page) and journal log file writes behind a database

112 30 July 2015

Data Replication over Long Distances:Multi-Hop Replication

• It may be desirable to synchronously replicate data to a nearby “short-haul” site, and asynchronously replicate from there to a more-distant site− This is sometimes called “cascaded” data replication

Synch Secondary AsynchPrimary Tertiary

100 miles 1,000 miles

Short-Haul Long-Haul

Testing & Simulation of Long Distances

114 30 July 2015

Testing / Simulation

• Before incurring the risk and expense of site selection, datacenter construction, and inter-site link procurement:

• Test within a single-datacenter test environment, with distance simulated by introducing packet latency, and bandwidth simulated by throttling traffic flow

• Techniques for simulating distance with latency:

−Hardware Network Emulators

−Software Network Emulators

115 30 July 2015

Hardware Network Emulators

• A couple of vendors / products:

−Shunra STORM Network Emulator

−Spirent AdTech

116 30 July 2015

Software Network Emulators

• A couple of examples:

−NIST Net from the National Institute of Standards and Technology

• http://snad.ncsl.nist.gov/nistnet/

−D4 (Dick’s Dynamic Delay Device) in OpenVMS

http://snad.ncsl.nist.gov/nistnet/

117 30 July 2015

D4

• Capability added to OpenVMS Gigabit Ethernet LAN drivers

• Packets can be:

−Delayed

− Lost

• Bandwidth can be throttled/limited

118 30 July 2015

D4

• Controlled by LAN SDA Extension:

−SDA> LAN DELAY PARAM /qualifiers

−SDA> LAN DELAY STATUS /qualifiers

119 30 July 2015

D4

• LAN packets are handled / affected between a pair of Gigabit Ethernet NICs

• One non-Primary CPU recommended per pair of NICs

−Use Fast_Path to move interrupts off of Primary CPU onto a non-Primary CPU for both NICs

• So a quad-CPU OpenVMS system with 6 Gigabit Ethernet NICs can handle 3 LAN traffic streams

120 30 July 2015

D4

• OpenVMS 8.3 or later, plus a LAN patch kit:−8.3 on Alpha: VMS83A_LAN-V0300 (or later)

−8.3 on Integrity: VMS83I_LAN-V0700 (or later)

−8.3-1H1: VMS831I_LAN-V0100 (or later)

• Functionality is contained in _MON images. Set SYSTEM_CHECK to 1 or:−Copy SYS$LOADABLE_IMAGES:SYS$EI1000_MON.EXE

to SYS$LOADABLE_IMAGES:SYS$EI1000.EXE

−Copy SYS$LOADABLE_IMAGES:SYS$EW5700_MON.EXE to SYS$LOADABLE_IMAGES:SYS$EW5700.EXE.

121 30 July 2015

Example D4_SETUP.COM• $ !

• $ ! Configure RX4640 system for LAN Delay Function using EIC/EID, EIE/EIF, EWA/EWB

• $ !

• $ set noon

• $ !

• $ ! Set preferred CPU of other devices

• $ !

• $ set dev fga0/pref=0

• $ set dev fgb0/pref=0

• $ set dev eia/pref=0

• $ set dev eib/pref=0

• $ set dev eig/pref=0

• $ set dev eih/pref=0

• $ set dev ewc/pref=0

• $ !

• $ ! Devices to use are the AB465A Broadcom ports (Ruchba combo)

• $ !

• $ set dev ewa/pref=1

• $ set dev ewb/pref=1

• $ !

• $ ! Devices to use are the A7012A Intel ports

• $ !

• $ set dev eic/pref=2

• $ set dev eid/pref=2

• $ !

• $ ! Devices to use are the AB545A Intel ports (quad card)

• $ !

• $ set dev eie/pref=3

• $ set dev eif/pref=3

• $ !

• $ ! Turn off LAN driver tracing on all devices

• $ !

• $ mc lancp set dev/notrace/all

• $ ! Turn on LAN driver tracing on interesting devices, excluding fork begin/end entries

• $ !

• $ ! mc lancp set dev/trace=(mask=(%xFFFFFFF3,-1),size=2048) ewa

• $ ! mc lancp set dev/trace=(mask=(%xFFFFFFF3,-1),size=2048) ewb

• $ ! mc lancp set dev/trace=(mask=(%xFFFFFFF3,-1),size=2048) eic

• $ ! mc lancp set dev/trace=(mask=(%xFFFFFFF3,-1),size=2048) eid

• $ ! mc lancp set dev/trace=(mask=(%xFFFFFFF3,-1),size=2048) eie

• $ ! mc lancp set dev/trace=(mask=(%xFFFFFFF3,-1),size=2048) eif

122 30 July 2015

SDA> LAN commands

• SDA> LAN DELAY PARAM /DEVICE=(device1,device2) /AGE=value /BANDWIDTH=value /BUFFER=value /DELAY=value /LOSS=value /TLOSS=value

123 30 July 2015

SDA> LAN commands

−/DEVICE=(device1,device2) specifies the two LAN devices to use. They must both be assigned to the same secondary CPU.

−/DELAY=value specifies the amount of delay in microseconds to be imposed on each received packet before it is transmitted on the other device. Zero is the default.

−/BANDWIDTH=value specifies the maximum bandwidth allowed in megabits per second. Zero (default) means there is no bandwidth limit.

124 30 July 2015

SDA> LAN commands

−/AGE=value specifies the packet age limit to be imposed, in microseconds. Packets older than this age are discarded. Zero (default) means there is no age limit.

−/BUFFER=value specifies the maximum amount of data in bytes to be buffered. Incoming packets that would cause this limit to be exceeded are discarded. Zero (default) means there is no buffering limit.

−/LOSS=value specifies the packet loss rate to be imposed, as the number of packets to be discarded each second. Zero (default) is no intentional packet loss.

−/TLOSS=value specifies the total number of packets to be discarded. Zero (default) means there is no limit to the number of packets that will be discarded.

125 30 July 2015

SDA> LAN commands

• SDA> LAN DELAY STATUS /DEVICE=(device1,device2) /CONTINUOUS=value /HISTOGRAM /RESET

126 30 July 2015

SDA> LAN commands

−/DEVICE=(device1,device2) specifies the two LAN devices to use. They must both be assigned to the same secondary CPU. If no devices are specified, status will be displayed for all device pairs.

−/CONTINUOUS=value specifies that the status display is to be repeated every value seconds. The default is no repetitions.

127 30 July 2015

SDA> LAN commands

−/HISTOGRAM specifies that histogram data should be displayed, which includes:• Delay Variance (not a true statistical variance) – the difference between

the expected time that a transmit was to be issued and the time it actually was. For example, if the specified delay was 50 microseconds and a packet was transmitted 55 microseconds after the packet was received, the histogram bucket incremented is for 5 microseconds. This gives you an idea how accurate the delay function is. There are 64 buckets of 1024 CPU cycles each, so for a 1000 mhz processor, each bucket is 1.024 microseconds each. Note that this does not include any additional delay, perhaps because the transmit queue on the device is backing up because of load or the effect of flow control.

• Packets Outstanding – the number of packets outstanding to the other device for transmit. There are 16 buckets of 64 packets each, so the first bucket is for 0-63 packets outstanding, etc.

• Bytes Outstanding – the number of bytes outstanding to the other device for transmit. There are 16 buckets of 64k bytes each, so the first bucket is for 0-65535 bytes, etc.

• Packet Length – the length of each received packet in 16 buckets are given in the display 64..127, 128..191, etc.

128 30 July 2015

SDA> LAN commands

−/RESET – clears the counters before the display (you can also use LAN DELAY PARAM /DEVICE=(device1,device2) to clear the counters).

129 30 July 2015

LAN DELAY STATUS ExampleWAN$SDA(X-1) Extension on VLAN4 (HP rx4640 (1.30GHz/3.0MB)) at 9-JUL-2006 13:02:10.96

---------------------------------------------------------------------------------------

Device 1: EIC (Active) Device 2: EID (Active) CPU affinity: 2

Delay (usec): 5000 Max packet age (usecs): 0 Loss rate (pk/sec): 0

Bandwidth (mbits/sec): 50 Max buffering (bytes): 0 Total loss (pks): 0

EIC Xmt (pk) 1668495 (by) 13668246768 (mpk) 8 (mby) 1264 Lost (age) 0

EIC Rcv (pk) 1668228 (by) 13666059504 (mpk) 8 (mby) 1264 Lost (buffering) 0

EIC MBits/sec (128 pk) Xmt 0.00 Rcv 0.00 X+R 0.00 Lost (intentional) 0

EIC MBits/sec (512 pk) Xmt 0.00 Rcv 0.00 X+R 0.01 Lost (pool) 0

EIC MBits/sec (4096 pk) Xmt 0.04 Rcv 0.04 X+R 0.08 Current xmt (pk) 0/8

EIC MBits/sec (All pk) Xmt 11.91 Rcv 11.91 X+R 23.83 Current xmt (by) 0/57344

EIC Failures: Link 1 Xmt 0 Rcv 0 Elapsed time (sec) 9178

EID Xmt (pk) 1668228 (by) 13666059504 (mpk) 8 (mby) 1264 Lost (age) 0

EID Rcv (pk) 1668594 (by) 13669057776 (mpk) 8 (mby) 1264 Lost (buffering) 0

EID MBits/sec (128 pk) Xmt 0.00 Rcv 0.00 X+R 0.00 Lost (intentional) 0

EID MBits/sec (512 pk) Xmt 0.00 Rcv 0.00 X+R 0.01 Lost (pool) 0

EID MBits/sec (4096 pk) Xmt 0.04 Rcv 0.04 X+R 0.08 Current xmt (pk) 100/483

EID MBits/sec (All pk) Xmt 11.91 Rcv 11.91 X+R 23.83 Current xmt (by) 819200/3956736

EID Failures: Link 1 Xmt 0 Rcv 0 Elapsed time (sec) 9178

SDA>

130 30 July 2015

LAN DELAY STATUS/HISTOGRAM Example

WAN$SDA(X-1) Extension on VLAN4 (HP rx4640 (1.30GHz/3.0MB)) at 27-AUG-2006 13:32:33.17

---------------------------------------------------------------------------------------

Device 1: EIC (Active) Device 2: EID (Active) CPU affinity: 2

Delay (usec): 0 Max packet age (usecs): 0 Loss rate (pk/sec): 0

Bandwidth (mbits/sec): 0 Max buffering (bytes): 0 Total loss (pks): 0

EIC Delay Variance (0..49+ usec): - - - 23% 44% 19% 10% - - 1% 3% - - - - -

- - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - -

EIC Packets Outstanding (0..960+): 100% - - - - - - - - - - - - - - -

EIC Bytes Outstanding (0..960k+) : 100% - - - - - - - - - - - - - - -

EIC Packet Length: 64+ 128+ 192+ 256+ 384+ 448+ 512+ 756+ 1024 1280 1519 2048 3072 4096 6144 8192

EIC Packets: 33% 1% - 32% - - - - 17% 13% - - - - - 4%

EID Delay Variance (0..49+ usec): - - - 25% 46% 23% 2% - - 1% 3% - - - - -

- - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - -

EID Packets Outstanding (0..960+): 100% - - - - - - - - - - - - - - -

EID Bytes Outstanding (0..960k+) : 100% - - - - - - - - - - - - - - -

EID Packet Length: 64+ 128+ 192+ 256+ 384+ 448+ 512+ 756+ 1024 1280 1519 2048 3072 4096 6144 8192

EIC Packets: 33% 1% - 32% - - - - 17% 13% - - - - - 4%

Real-Life Examples

132 30 July 2015

Real-Life Example:Credit Lyonnais, Paris

•Credit Lyonnais fire in May 1996

•OpenVMS multi-site cluster with data replication between sites (Volume Shadowing) saved the data

•Fire occurred over a weekend, and DR site plus quick procurement of replacement hardware allowed bank to reopen on Monday

Source: Metropole Paris

133 30 July 2015

“ In any disaster, the key is to protect the data. If you lose your CPUs, you can replace them. If you lose your network, you can rebuild it. If you lose your data, you are down for several months. In the capital markets, that means you are dead. During the fire at our headquarters, the DIGITAL VMS Clusters were very effective at protecting the data.”

Jordan DoePatrick HummelIT Director, Capital Markets Division, Credit Lyonnais

134 30 July 2015

Headquarters for Manhattan's Municipal Credit Union (MCU) were across the street from the World Trade Center, and were devastated on Sept. 11."It took several days to salvage critical data from hard-drive arrays and back-up tapes and bring the system back up” ...“During those first few chaotic days after Sept. 11, MCU allowed customers to withdraw cash from its ATMs, even when account balances could not be verified. Unfortunately, up to 4,000 people fraudulently withdrew about $15 million."

Ann Silverthorn, Network World Fusion, 10/07/2002

http://www.nwfusion.com/research/2002/1007feat2.html

http://www.nwfusion.com/research/2002/1007feat2.html

135 30 July 2015

Real-Life Examples: Commerzbank on 9/11

• Datacenter near WTC towers

• Generators took over after power failure, but dust & debris eventually caused A/C units to fail

• Data replicated to remote site 30 miles away

• One AlphaServer continued to run despite 104° F temperatures, running off the copy of the data at the opposite site after the local disk drives had succumbed to the heat

• See http://h71000.www7.hp.com/openvms/brochures/commerzbank/

http://h71000.www7.hp.com/openvms/brochures/commerzbank/

136 30 July 2015

“Because of the intense heat in our datacenter, all systems crashed except for ourAlphaServer GS160... OpenVMS wide-areaclustering and volume-shadowing technologykept our primary system running off thedrives at our remote site 30 miles away.”

Werner Boensch, Executive Vice PresidentCommerzbank, North America

https://portal03.commerzbanking.de/P-Portal/XML/IFILPortal/pgf.html?WSPL_ID=l7cbJZH6kCi1iXilKqOuXuf.-N.4gXF..-hCUn4D60a5himb6C8mZlcw1rDOH1gr-xFsXixUkcGJ0QvXU8WwxVRRjT-D42uIquJVd69hGlvyaOVHYjYMqcvpPoRM9YFJlU4K4C3vU41.PCuphfPm6NTMJ8Mp8Imi4wr4eZic3W0cLmqJsAmX.A&tab=103&url=http://www.commerzbank.com

https://portal03.commerzbanking.de/P-Portal/XML/IFILPortal/pgf.html?WSPL_ID=l7cbJZH6kCi1iXilKqOuXuf.-N.4gXF..-hCUn4D60a5himb6C8mZlcw1rDOH1gr-xFsXixUkcGJ0QvXU8WwxVRRjT-D42uIquJVd69hGlvyaOVHYjYMqcvpPoRM9YFJlU4K4C3vU41.PCuphfPm6NTMJ8Mp8Imi4wr4eZic3W0cLmqJsAmX.A&tab=103&url=http://www.commerzbank.com

137 30 July 2015

Real-Life Examples of OpenVMS: International Securities Exchange

• All-electronic stock derivatives (options) exchange

• First new stock exchange in the US in 26 years

• Went from nothing to majority market share in 3 years

• OpenVMS Disaster-Tolerant Cluster at the core, surrounded by other OpenVMS systems

• See http://h71000.www7.hp.com/openvms/brochures/ise/

http://h71000.www7.hp.com/openvms/brochures/ise/

138 30 July 2015

“OpenVMS is a proven product that’s beenbattle tested in the field. That’s why wewere extremely confident in building thetechnology architecture of the ISE onOpenVMS AlphaServer systems.”

Danny Friel, Sr. Vice President,Technology / Chief Information Officer,International Securities Exchange

139 30 July 2015

“ We just had a disaster at one of our 3 sites 4 hours ago. Both the site's 2 nodes and 78 shadow members dropped when outside contractors killed all power to the computer room during maintenance. Fortunately the mirrored site 8 miles away and a third quorum site in another direction kept the cluster up after a minute of cluster state transition.”

Lee Mah,Capital Health Authority

writing in comp.os.vms, Aug. 20, 2004

140 30 July 2015

“I have lost an entire data center due to a combination of a faulty UPScombined with a car vs. powerpole, and again when we needed to do major power maintenance. Both times, the remaining half of the cluster kept us going.”Ed Wilts, Merrill Corporation

writing in comp.os.vms, July 22, 2005

Business Continuity

142 30 July 2015

Business Continuity: Not Just IT

•The goal of Business Continuity is the ability for the entire business, not just IT, to continue operating despite a disaster.

•Not just computers and data:

−People

−Facilities

−Communications: Data networks and voice

−Transportation

−Supply chain, distribution channels

−etc.

UsefulResources

144 30 July 2015

Business Continuity Resources

• Disaster Recovery Journal:

− http://www.drj.com/

• Continuity Insights Magazine:

− http://www.continuityinsights.com//

• Contingency Planning & Management Magazine

− http://www.contingencyplanning.com/

• All are high-quality journals. The first two are available free to qualified subscribers

• All hold conferences as well

http://www.drj.com/

http://www.continuityinsights.com//

http://www.contingencyplanning.com/

145 30 July 2015

Multi-OS Disaster-Tolerant Reference Architectures Whitepaper

• Entitled “Delivering high availability and disaster tolerance in a multi-operating-system HP Integrity server environment”

• Describes DT configurations across all of HP’s platforms: HP-UX, OpenVMS, Linux, Windows, and NonStop

• http://h71028.www7.hp.com/ERC/downloads/4AA0-6737ENW.pdf

http://h71028.www7.hp.com/ERC/downloads/4AA0-6737ENW.pdf

146 30 July 2015

Tabb Research Report

• "Crisis in Continuity: Financial Markets Firms Tackle the 100 km Question"

−available from https://h30046.www3.hp.com/campaigns/2005/promo/wwfsi/index.php?mcc=landing_page&jumpid=ex_R2548_promo/fsipaper_mcc%7Clanding_page

https://h30046.www3.hp.com/campaigns/2005/promo/wwfsi/index.php?mcc=landing_page&jumpid=ex_R2548_promo/fsipaper_mcc%7Clanding_page

147 30 July 2015

Draft Interagency White Paper

• "Draft Interagency White Paper on Sound Practices to Strengthen the Resilience of the U.S. Financial System“−http://www.sec.gov/news/studies/34-47638.htm

• Agencies involved:Federal Reserve System, Department of the Treasury,

Securities & Exchange Commission (SEC)

• Applies to:Financial institutions critical to the US economy

• But many other agencies around the world are adopting similar rules

http://www.sec.gov/news/studies/34-47638.htm

148 30 July 2015

Business Continuity and Disaster Tolerance Services from HP

Web resources:

• BC Services:− http://h20219.www2.hp.com/services/cache/10107-0-0-225-121.aspx

• DT Services: − http://h20219.www2.hp.com/services/cache/10597-0-0-225-121.aspx

http://h20219.www2.hp.com/services/cache/10107-0-0-225-121.aspx

http://h20219.www2.hp.com/services/cache/10597-0-0-225-121.aspx

149 30 July 2015

OpenVMS Disaster-Tolerant Cluster Resources

• OpenVMS Documentation at OpenVMS website:− OpenVMS Cluster Systems

− HP Volume Shadowing for OpenVMS

− Guidelines for OpenVMS Cluster Configurations

• OpenVMS High-Availability and Disaster-Tolerant Cluster information at the HP corporate website: http://h71000.www7.hp.com/availability/index.htmlandhttp://h18002.www1.hp.com/alphaserver/ad/disastertolerance.html

• More-detailed seminar and workshop notes at http://www2.openvms.org/kparris/ and http://www.geocities.com/keithparris/

• Book “VAXcluster Principles” by Roy G. Davis, Digital Press, 1993, ISBN 1-55558-112-9

http://h71000.www7.hp.com/availability/index.html

http://h18002.www1.hp.com/alphaserver/ad/disastertolerance.html

http://www2.openvms.org/kparris/

http://www.geocities.com/keithparris/

150 30 July 2015

Questions?

151 30 July 2015

Speaker Contact Info:

•Keith Parris

•E-mail: [email protected] [email protected]

•Web: http://www2.openvms.org/kparris/

mailto:[email protected]

mailto:[email protected]

http://www2.openvms.org/kparris/

How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM::...

Documents

Transcript of How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM::...