How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM::...
Transcript of How the Disaster Cluster Recovered · I/Os KABOOM:: Alpha ES40 QUORUM:: Integrity rx2620 SDBOOM::...
© 2008 Hewlett-Packard Development Company, L.P.The information contained herein is subject to change without notice
How the Disaster Proof OpenVMS Cluster Recovered So Fast, and How Yours Can, Too
Keith Parris Systems/Software Engineer
HPMonday, May 19 and Wednesday, May 21
Story of the OpenVMS Cluster in the Disaster Proof Video
4 30 July 2015
Disaster Proof Demonstration and Video
5 30 July 2015
Camden Arkansas NTS
6 30 July 2015
The Failover Datacenter
7 30 July 2015
The original “green” datacenter
8 30 July 2015
Nature gets in on the act!
9 30 July 2015
KABOOM! Arkansas on the ground
10 30 July 2015
OpenVMS Disaster-Proof configuration & application
XP12000 XP24000
Shadow set
Strea
m of
I/Os
KABOOM::
Alpha
ES40
QUORUM::
Integrity
rx2620
SDBOOM::
Integrity
Superdome
All I/O’s need to complete
to all spindles before it is
considered done.
When a spindle drops out
The shadow set is reduced.
I/O’s “in flight” wait for the
Shadow set to be reduced.
The longest outstanding request for an I/O during the DP demo was 13.71 seconds.
11 30 July 2015
GQB ready for a ride!
12 30 July 2015
Disaster Proof Demo OpenVMS Cluster
How the Disaster Proof OpenVMS Cluster Recovered So Fast, and How Yours Can, Too
OpenVMS Cluster Failure Detection Mechanisms and Cluster State Transitions
15 30 July 2015
OpenVMS Cluster Connection Manager and Transient Failures
• Some failures are temporary and transient
− Especially in a LAN environment
• To prevent the disruption of unnecessary removal of a node from the cluster, when a communications failure is detected, the Connection Manager waits for a time in hopes of the problem going away by itself
− This time is called the Reconnection Interval
• SYSGEN parameter RECNXINTERVAL
− RECNXINTERVAL is dynamic and may thus be temporarily raised if needed for something like a scheduled LAN outage
16 30 July 2015
OpenVMS Cluster Connection Manager and Communications or Node Failures
• If the Reconnection Interval passes without connectivity being restored, or if the node has “gone away”, the cluster cannot continue without a reconfiguration
• This reconfiguration is called a State Transition, and one or more nodes will be removed from the cluster
17 30 July 2015
Failure and Repair/Recovery within Reconnection Interval
Failure occurs
Failure detected
(virtual circuit
broken)
Problem fixed
Fixed state detected
(virtual circuit
re-opened)
Time
RECNXINTERVAL
18 30 July 2015
Hard Failure
Failure occurs
Failure detected
(virtual circuit
broken)
State transition
(node removed
from cluster)
Time
RECNXINTERVAL
19 30 July 2015
Late Recovery
Failure occurs
Failure detected
(virtual circuit
broken)
State transition
(node removed
from cluster)
Time
RECNXINTERVAL
Problem fixed
Fix detected
Node does CLUEXIT
bugcheck
Node learns it has been
removed from cluster
20 30 July 2015
Failure Detection Mechanisms
• Mechanisms to detect a node or communications failure
− Last-Gasp Datagram
− Periodic checking
• Multicast Hello packets on LANs
• Polling on CI and DSSI
• TIMVCFAIL check
21 30 July 2015
PEDRIVER Hello Packet Timing
• Hello packet Transmit Interval
−Default is 3 seconds
−Dithered by reducing to as much as half to avoid forming”packet trains”
• so Hellos could be spaced as close as 1.5 seconds, or as far apart as 3 seconds
• Hello packet Listen Timeout
−Default is 8 seconds
−Allows detection of failure in between 8 and 9 seconds
22 30 July 2015
Failure Detection onLAN interconnects
Time t=0
Time t=3
Time t=6
Time t=9
Remote node Local node
Hello packet
Hello packet
Hello packet (lost)
Hello packet
Clock ticks
01
2
30
12
34
5
6
10
Listen Timer
23 30 July 2015
Failure Detection onLAN interconnects
Time t=0
Time t=3
Time t=6
Remote node Local node
Hello packet
Hello packet (lost)
Clock ticks
01
2
3
4
5
6
Listen Timer
7
8Virtual
Circuit
Broken
Hello packet (lost)
24 30 July 2015
TIMVCFAIL Mechanism
Local node Remote node
Time t=0
Time t=1/3 of TIMVCFAIL
Time t=2/3 of TIMVCFAIL
Request
Response
Request
Response
25 30 July 2015
TIMVCFAIL Mechanism
Local node Remote node
Time t=0
Time t=1/3 of TIMVCFAIL
Time t=2/3 of TIMVCFAIL
Request
Response
Request
Time t=TIMVCFAIL
Node fails
some time during
this period
1
2
Virtual circuit broken
26 30 July 2015
Sequence of eventsDuring a State Transition
• Determine new cluster configuration
• If quorum is lost:
• QUORUM capability bit removed from all CPUs
• no process can be scheduled to run
• Disks all put into mount verification
• If quorum is not lost, continue…
• Rebuild lock database
• Stall lock requests
• I/O synchronization
• Do rebuild work
• Resume lock handling
27 30 July 2015
Measuring State Transition Effects
• Determine the type of the last lock rebuild:$ ANALYZE/SYSTEM
SDA> READ SYS$LOADABLE_IMAGES:SCSDEF
SDA> EVALUATE @(@CLU$GL_CLUB + CLUB$B_NEWRBLD_REQ) & FF
Hex = 00000002 Decimal = 2 ACP$V_SWAPPRV
• Rebuild type values:
1. Merge (locking not disabled)
2. Partial
3. Directory
4. Full
28 30 July 2015
Measuring State Transition Effects
• Determine the duration of the last lock request stall period:
SDA> DEFINE TOFF = @(@CLU$GL_CLUB+CLUB$L_TOFF)
SDA> DEFINE TON = @(@CLU$GL_CLUB+CLUB$L_TON)
SDA> EVALUATE TON-TOFF
Hex = 0000026B Decimal = 619 PDT$Q_COMQH+00003
29 30 July 2015
Minimizing Impactof State Transitions
• Configurations issues:
− Few (e.g. exactly 3) nodes
− Quorum node; no quorum disk
− Set up LAN cluster interconnect to minimize length of time packet-forwarding is blocked
• Original IEEE 802.1d Spanning Tree algorithm could take 35-40 seconds to converge and start forwarding packets again
− Two completely-independent spanning trees could help avoid communications being blocked on both at once
• Newer IEEE 802.1w Rapid Spanning Tree (and IEEE 802.1s Multiple Spanning Tree) protocols can be configured to recover in less than 1 second
Disaster Proof Demonstration Settings and Behavior
31 30 July 2015
OpenVMS System Parameter Settings for the Disaster Proof Demonstration
• SHADOW_MBR_TMO lowered from default of 120 down to 8 seconds
• RECNXINTERVAL lowered from default of 20 down to 10 seconds
• TIMVCFAIL lowered from default of 1600 to 400 (4 seconds, in 10-millisecond clock units) to detect node failure in 4 seconds, worst-case, (detecting failure at the SYSAP level)
• LAN_FLAGS bit 12 set to enable Fast LAN Transmit Timeout (give up on a failed packet transmit in 1.25 seconds, worst case, instead of an order of magnitude more in some cases)
• PE4 set to hexadecimal 0703 (Hello transmit interval of 0.7 seconds, nominal; Listen Timeout of 3 seconds), to detect node failure in 3-4 seconds at the PEDRIVER level
32 30 July 2015
Disaster Proof Demo Timeline
• Time = 0: Explosion occurs
• Time around 3.5 seconds: Node failure detected, via either PEDRIVER Hello Listen Timeout or TIMVCFAIL mechanism. VC closed; Reconnection Interval starts.
• Time = 8 seconds: Shadow Member Timeout expires; shadowset members removed.
• Time around13.5 seconds: Reconnection Interval expires; State Transition begins.
• Time = 13.71 seconds: Recovery complete; Application processing resumes.
33 30 July 2015
Disaster Proof Demo Timeline
Explosion
Failure Detection Time
PEDRIVER Hello Listen Timeout or
TIMVCFAIL Timeout
T = 0 T = about 3.5 seconds
34 30 July 2015
Disaster Proof Demo Timeline
Explosion
Shadow Member Timeout
Failed Shadowset Members Removed
T = 0 T = 8 seconds
35 30 July 2015
Disaster Proof Demo Timeline
Reconnection Interval
PEDRIVER Hello Listen Timeout or
TIMVCFAIL Timeout
T = 0 T = about 3.5 seconds
Explosion
T = about 13.5 seconds
State Transition Begins
36 30 July 2015
Disaster Proof Demo Timeline
T = 0 T = 13.71 seconds
Explosion
T = about 13.5 seconds
Node Removed
from Cluster Application Resumes
Cluster State Transition
Lock Database Rebuild
State Transition Begins
Simulation and Testing of Long Distance DR/DT Configurations
38 30 July 2015
Trends
39 30 July 2015
Trends
• Increase in disasters
• Longer inter-site distances for better protection
• Business pressures for shorter distances for performance
• Increasing pressure not to bridge LANs between sites
40 30 July 2015
• Trends
− Increase in Disasters
Trends
41 30 July 2015
“Natural disasters have quadrupled over the last two decades, from an average of 120 a year in the early 1980s to as many as 500 today.”
Continuity Insights Magazine
Nov./Dec. 2007 issue, page 10
42 30 July 2015
“There has been a six-fold increase in floods since 1980. The number of floods and wind-storms have increased from 60 in 1980 to 240 last year.”
Continuity Insights Magazine
Nov./Dec. 2007 issue, page 10
44 30 July 2015
Increase in Disasters
http://www.oxfam.org/en/files/bp108_climate_change_alarm_0711.pdf/download
46 30 July 2015
• Trends
− Longer inter-site distances for better protection
Trends
47 30 July 2015
“Some CIOs are imagining potential disasters that go well beyond the everyday hiccups that can disrupt applications and networks. Others, recognizing how integral IT is to business today, are focusing on the need to recover instantaneously from any unforeseen event.” …“It's a different world. There are so many more things to consider than the traditional fire, flood and theft.”
“Redefining Disaster“
Mary K. Pratt, Computerworld, June 20, 2005http://www.computerworld.com/hardwaretopics/storage/story/0,10801,102576,00.html
48 30 July 2015
Northeast US Before Blackout
Source: NOAA/DMSP
49 30 July 2015
Northeast US After Blackout
Source: NOAA/DMSP
50 30 July 2015
“The blackout has pushed many companies to expand their data center infrastructures to support data replication between two or even three IT facilities -- one of which may be located on a separate power grid.”
Computerworld, August 2, 2004http://www.computerworld.com/securitytopics/security/recovery/story/0,10801,94944,00.html
51 30 July 2015
“You have to be far enough apart to make sure that conditions in one place are not likely to be duplicated in the other.“… “A useful rule of thumb might be a minimum of about 50 km, the length of a MAN, though the other side of the continent might be necessary to play it safe.”“Disaster Recovery Sites: How Far Away is Far Enough?”
Drew Robb, Datamation, October 4, 2005http://www.enterprisestorageforum.com/continuity/features/article.php/3552971
52 30 July 2015
Trends:Longer inter-site distances for better protection
• In the past, protection was focused against risks like fires, floods, tornadoes. 1 to 5 miles was fine between sites.
• Right after 9/11, 60 to100 miles looked much better.
• After the Northeast Blackout of 2003, and increasing awareness of the possibility of a terrorist group obtaining a nuclear device and wiping out an entire metropolitan area is no longer inconceivable.
− Resulting pressure is for inter-site distances of 1,000 to 1,500 miles
• Challenges:
− Telecommunications links
− Latency due to speed of light adversely affects performance
53 30 July 2015
• Trends
−Business pressures for shorter distances for performance
Trends
54 30 July 2015
“A 1-millisecond advantage in trading applications can be worth $100 million a year to a major brokerage firm, by one estimate.”
Richard Martin, InformationWeek,
April 23, 2007
55 30 July 2015
“The fastest systems, running from traders' desks to exchange data centers, can execute transactions in a few milliseconds -- so fast, in fact, that the physical distance between two computers processing a transaction can slow down how fast it happens.”Richard Martin, InformationWeek,
April 23, 2007
56 30 July 2015
“This problem is called data latency --delays measured in split seconds. To overcome it, many high-frequency algorithmic traders are moving their systems as close to the Wall Street exchanges as possible.”
Richard Martin, InformationWeek,
April 23, 2007
57 30 July 2015
• Trends
− Increasing pressure not to bridge LANs between sites
Trends
58 30 July 2015
Trends:Increasing Resistance to LAN Bridging
• In the past, setting up a VLAN spanning sites for an OpenVMS disaster-tolerant cluster was common
• Networks are now IP-centric
• IP network mindset sees LAN bridging as “bad,” sometimes even “totally unacceptable”
• Alternatives:−Separate, private link for OpenVMS Multi-site Cluster
−Metropolitan Area Networks (MANs) using MPLS
−Ethernet-over-IP (EoIP)
−SCS-over-IP support planned for OpenVMS 8.4
59 30 July 2015
Site Selection and Inter-Site Distance
60 30 July 2015
Planning for DT: Site Selection
Sites must be carefully selected:
• Avoid hazards
− Especially hazards common to both (and the loss of both datacenters at once which might result from that)
• Make them a “safe” distance apart
• Select site separation in a “safe” direction
61 30 July 2015
Planning for DT: What is a “Safe Distance”
Analyze likely hazards of proposed sites:
• Natural hazards
− Fire (building, forest, gas leak, explosive materials)
− Storms (Tornado, Hurricane, Lightning, Hail, Ice)
− Flooding (excess rainfall, dam breakage, storm surge, broken water pipe)
− Earthquakes, Tsunamis
62 30 July 2015
Planning for DT: What is a “Safe Distance”
Analyze likely hazards of proposed sites:
• Man-made hazards
− Nearby transportation of hazardous materials (highway, rail)
− Terrorist with a bomb
− Disgruntled customer with a weapon
− Enemy attack in war (nearby military or industrial targets)
− Civil unrest (riots, vandalism)
63 30 July 2015
Former Atlas E Missile Silo Site in Kimball, Nebraska
64 30 July 2015
Planning for DT: Site Separation Distance
• Make sites a “safe” distance apart
• This must be a compromise. Factors:
− Risks
− Performance (inter-site latency)
− Interconnect costs
− Ease of travel between sites
− Availability of workforce
65 30 July 2015
Planning for DT: Site Separation Distance
• Select site separation distance:− 1-3 miles: protects against most building fires, natural gas leaks,
armed intruders, terrorist bombs
− 10-30 miles: protects against most tornadoes, floods, hazardous material spills, release of poisonous gas, non-nuclear military bomb strike
− 100-300 miles: protects against most hurricanes, earthquakes, tsunamis, forest fires, most biological weapons, most power outages, suitcase-sized nuclear bomb
− 1,000-3,000 miles: protects against “dirty” bombs, major region-wide power outages, and possibly military nuclear attacks
Threat Radius
66 30 July 2015
"You have to be far enough away to be beyond the immediate threat you are planning for.“…"At the same time, you have to be close enough for it to be practical to get to the remote facility rapidly.“
“Disaster Recovery Sites: How Far Away is Far Enough?” By Drew Robb
Enterprise Storage Forum, September 30, 2005
http://www.enterprisestorageforum.com/continuity/features/article.php/3552971
68 30 July 2015
“A Watertight Plan” By Penny Lunt Crosman, IT Architect, Sept. 1, 2005
http://www.itarchitect.com/showArticle.jhtml?articleID=169400810
“Survivors of hurricanes, floods, and the London terrorist bombings offer best practices and advice on disaster recovery planning.”
69 30 July 2015Source: “A Watertight Plan” By Penny Lunt Crosman, IT Architect, Sept. 1, 2005
70 30 July 2015
Planning for DT: Site Separation Direction
• Select site separation direction:
− Not along same earthquake fault-line
− Not along likely storm tracks
− Not in same floodplain or downstream of same dam
− Not on the same coastline
− Not in line with prevailing winds (that might carry hazardous materials or radioactive fallout)
Long-Distance Disaster Tolerance Using OpenVMS Clusters
Background
73 30 July 2015
Historical Context
Example: New York City, USA
• 1993 World Trade Center bombing raised awareness of DR and prompted some improvements
• Sept. 11, 2001 has had dramatic and far-reaching effects
−Scramble to find replacement office space
−Many datacenters moved off Manhattan Island, some out of NYC entirely
− Increased distances to DR sites
− Induced regulatory responses (in USA & abroad)
74 30 July 2015
Trends and Driving Forces in the US
• BC, DR and DT in a post-9/11 world:
−Recognition of greater risk to datacenters
• Particularly in major metropolitan areas
−Push toward greater distances between redundant datacenters
• It is no longer inconceivable that, for example, terrorists might obtain a nuclear device and destroy the entire NYC metropolitan area
75 30 July 2015
Trends and Driving Forces in the US
• "Draft Interagency White Paper on Sound Practices to Strengthen the Resilience of the U.S. Financial System“
−http://www.sec.gov/news/studies/34-47638.htm
• Agencies involved:
Federal Reserve System
Department of the Treasury
Securities & Exchange Commission (SEC)
• Applies to:
Financial institutions critical to the US economy
76 30 July 2015
US Draft Interagency White Paper
The early “concept release” inviting input made mention of a 200-300 mile limit (only as part of an example when asking for feedback as to whether any minimum distance value should be specified or not):
“Sound practices. Have the agencies sufficiently described expectations regarding out-of-region back-up resources? Should some minimum distance from primary sites be specified for back-up facilitiesfor core clearing and settlement organizations and firms that play significant roles in critical markets (e.g., 200 -300 miles between primary and back-up sites)? What factors should be used to identify such a minimum distance?”
77 30 July 2015
US Draft Interagency White Paper
This induced panic in several quarters:
• NYC feared additional economic damage of companies moving out
• Some pointed out the technology limitations of some synchronous mirroring products and of Fibre Channel at the time which typically limited them to a distance of 100 miles or 100 km
Revised draft contained no specific distance numbers; just cautionary wording
Ironically, that same non-specific wording now often results in DR datacenters 1,000 to 1,500 miles away
78 30 July 2015
US Draft Interagency White Paper
“Maintain sufficient geographically dispersedresources to meet recovery and resumption objectives.”
“Long-standing principles of business continuity planning suggest that back-up arrangements should be as far away from the primary site as necessary to avoid being subject to the same set of risks as the primary location.”
79 30 July 2015
US Draft Interagency White Paper
“Organizations should establish back-up facilities a significant distance away from their primary sites.”
“The agencies expect that, as technology and business processes … continue to improve and become increasingly cost effective, firms will take advantage of these developments to increase the geographic diversification of their back-up sites.”
80 30 July 2015
Ripple effect of Regulatory Activity Within the USA
• National Association of Securities Dealers (NASD):
−Rule 3510 & 3520
• New York Stock Exchange (NYSE):
−Rule 446
81 30 July 2015
Ripple effect of Regulatory Activity Outside the USA
• United Kingdom: Financial Services Authority:−Consultation Paper 142 – Operational Risk and Systems
Control
• Europe:−Basel II Accord
• Australian Prudential Regulation Authority−Prudential Standard for business continuity management
APS 232 and guidance note AGN 232.1
• Monetary Authority of Singapore (MAS)−“Guidelines on Risk Management Practices – Business
Continuity Management” affecting “Significantly Important Institutions” (SIIs)
82 30 July 2015
Resiliency Maturity Model project
• The Financial Services Technology Consortium (FTSC) has begun work on a Resiliency Maturity Model
−Taking inspiration from the Carnegie Mellon Software Engineering Institute’s Capability Maturity Model (CMM) and Networked Systems Survivability Program
− Intent is to develop industry standard metrics to evaluate an institution’s business continuity, disaster recovery, and crisis management capabilities
Long-distance Effects:Inter-site Latency
84 30 July 2015
Long-distance Cluster Issues
• Latency due to speed of light becomes significant at higher distances. Rules of thumb:
− About 1 ms per 100 miles, one-way
− About 1 ms per 50 miles round-trip latency
• Actual circuit path length can be longer than highway mileage between sites
• Latency can adversely affect performance of
− Remote I/O operations
− Remote locking operations
85 30 July 2015
200 240 400
4400
23000
0
5000
10000
15000
20000
25000
Latency (micro-seconds)
Gigabit Ethernet, zerodistance
Fast Ethernet, zerodistance
ATM 30 miles
DS-3 250 miles
OC-3 1400 miles
OpenVMS Lock Request Latencies
86 30 July 2015
Inter-site Latency:Actual Customer Measurements
Highway MileageLatency (ms) Est. Circuit Path Length
5 miles ATM OC-3 0.5 30 miles
35 miles 1.5 95 miles
25 to 35 miles,
IP DLSW link3 to 4 190-250 miles (effective)
130 miles DS-3 4.4 275 miles
“Over 150” miles 5.5 350 miles
1,250 miles DS-3 30 1,875 miles
87 30 July 2015
Differentiate between latency and bandwidth
• Can’t get around the speed of light and its latency effects over long distances
− Higher-bandwidth link doesn’t mean lower latency
Long-distance Techniques:SAN Extension
89 30 July 2015
SAN Extension
• Fibre Channel distance over fiber is limited to about 100 kilometers
−Shortage of buffer-to-buffer credits adversely affects Fibre Channel performance above about 50 kilometers
• Various vendors provide “SAN Extension” boxes to connect Fibre Channel SANs over an inter-site link
• See SAN Design Reference Guide Vol. 4 “SAN extension and bridging”:
−http://h20000.www2.hp.com/bc/docs/support/SupportManual/c00310437/c00310437.pdf
Long-distance Data Replication
91 30 July 2015
Disk Data Replication
• Data mirroring schemes
− Synchronous
• Slower, but no chance of data loss in conjunction with a site loss
− Asynchronous
• Faster, and works for longer distances
but can lose seconds’ or minutes’ worth of data (more under high loads) in a site disaster
92 30 July 2015
Continuous AccessSynchronous Replication
Node
FC Switch
Node
FC Switch
Mirrorset
EVA EVA
Controller in
charge of
mirrorset:
Write
93 30 July 2015
Continuous AccessSynchronous Replication
Node
FC Switch
Node
FC Switch
Mirrorset
EVA EVA
Controller in
charge of
mirrorset:
Write
Write
94 30 July 2015
Continuous AccessSynchronous Replication
Node
FC Switch
Node
FC Switch
Mirrorset
EVA EVA
Controller in
charge of
mirrorset:
Write
Write
Success status
95 30 July 2015
Continuous AccessSynchronous Replication
Node
FC Switch
Node
FC Switch
Mirrorset
EVA EVA
Controller in
charge of
mirrorset:
Write
Write
Success status
Success status
96 30 July 2015
Continuous AccessSynchronous Replication
Node
FC Switch
Node
FC Switch
Mirrorset
EVA EVA
Controller in
charge of
mirrorset:
Write
Write
Success status
Success status
Application
continues
97 30 July 2015
Continuous AccessAsynchronous Replication
Node
FC Switch
Node
FC Switch
Mirrorset
EVA EVA
Controller in
charge of
mirrorset:
Write
98 30 July 2015
Continuous AccessAsynchronous Replication
Node
FC Switch
Node
FC Switch
Mirrorset
EVA EVA
Controller in
charge of
mirrorset:
Write Success status
99 30 July 2015
Continuous AccessAsynchronous Replication
Node
FC Switch
Node
FC Switch
Mirrorset
EVA EVA
Controller in
charge of
mirrorset:
Write Success status
Application
continues
100 30 July 2015
Continuous AccessAsynchronous Replication
Node
FC Switch
Node
FC Switch
Mirrorset
EVA EVA
Controller in
charge of
mirrorset:
Write
Write
Success status
Application
continues
101 30 July 2015
Synchronous versus Asynchronous Replication and Link Bandwidth
Time
0 8 am 12 noon 5 pm 12 pm
MB/Sec
Synchronous – RPO = 0
Asynchronous – RPO 2 hrs. max
Asynchronous – RPO many hrs.
Application write bandwidth
102 30 July 2015
Data Replication and Long Distances
• Some vendors claim synchronous mirroring is impossible at a distance over 100 kilometers, 100 miles, or 200 miles, because their product cannot support synchronous mirroring over greater distances
• OpenVMS Volume Shadowing does synchronous mirroring
−Acceptable application performance is the only limit found so far on inter-site distance for HBVS
103 30 July 2015
Long-distance SynchronousHost-based Mirroring Software Tests
• OpenVMS Host-Based Volume Shadowing (HBVS) software (host-based mirroring software)
• SAN Extension used to extend SAN using FCIP boxes
• AdTech box used to simulate distance via introduced packet latency
• No OpenVMS Cluster involved across this distance (no OpenVMS node at the remote end; just “data vaulting” to a “distant” disk controller)
104 30 July 2015
Long-distance HBVS Test Results
Delay, 1-way
(milliseconds)
Throughput
(Bytes/Second)
Distance
(Kilometers)
Distance
(Miles)
0 ms 11 megabytes 0 km 0 miles
10 ms 226 kilobytes 2,000 km 1,250 miles
50 ms 45 kilobytes 10,000 km 6,250 miles
100 ms 24 kilobytes 20,000 km 12,500 miles
200 ms 15 kilobytes 40,000 km 25,000 miles
300 ms 9 kilobyte 60,000 km 37,500 miles
400 ms 8 kilobytes 80,000 km 50,000 miles
485 ms 6.5 kilobytes 97,000 km 60,625 miles
Mitigating the Effects of Long Inter-site Distances
106 30 July 2015
Minimizing Round Trips Between Sites
• Some vendors have Fibre Channel SCSI-3 protocol tricks to do writes in 1 round trip vs. 2
−e.g. Brocade’s “FastWrite” or Cisco’s “Write Acceleration”
• Application design can also affect number of round-trips required between sites
107 30 July 2015
Mitigating Impact of Inter-Site Latency
How applications are distributed across a multi-site OpenVMS cluster can affect performance
This represents a trade-off among performance, availability, and resource utilization
108 30 July 2015
Application Scheme 1:Hot Primary/Cold Standby
• All applications normally run at the primary site
− Second site is idle, except for data replication work, until primary site fails, then it takes over processing
• Performance will be good (all-local locking)
• Fail-over time will be poor, and risk high (standby systems not active and thus not being tested)
• Wastes computing capacity at the remote site
109 30 July 2015
Application Scheme 2:Hot/Hot but Alternate Workloads
• All applications normally run at one site or the other, but not both; data is mirrored between sites, and the opposite site takes over upon a failure
• Performance will be good (all-local locking)
• Fail-over time will be poor, and risk moderate (standby systems in use, but specific applications not active and thus not being tested from that site)
• Second site’s computing capacity is actively used
110 30 July 2015
Application Scheme 3:Uniform Workload Across Sites
• All applications normally run at both sites simultaneously. (This would be considered the “norm” for most OpenVMS clusters.)
• Surviving site takes all load upon failure
• Performance may be impacted (some remote locking) if inter-site distance is large
• “Fail-over” time will be excellent, and risk low (all systems are already in use running the same applications, thus constantly being tested)
• Both sites’ computing capacity is actively used
111 30 July 2015
Work-arounds being used today
• Multi-hop replication
−Synchronous to nearby site
−Asynchronous to far-away site
• Transaction-based replication
−e.g. replicate transaction (a few hundred bytes) with Reliable Transaction Router instead of having to replicate all the database page updates (often 8 kilobytes or 64 kilobytes per page) and journal log file writes behind a database
112 30 July 2015
Data Replication over Long Distances:Multi-Hop Replication
• It may be desirable to synchronously replicate data to a nearby “short-haul” site, and asynchronously replicate from there to a more-distant site− This is sometimes called “cascaded” data replication
Synch Secondary AsynchPrimary Tertiary
100 miles 1,000 miles
Short-Haul Long-Haul
Testing & Simulation of Long Distances
114 30 July 2015
Testing / Simulation
• Before incurring the risk and expense of site selection, datacenter construction, and inter-site link procurement:
• Test within a single-datacenter test environment, with distance simulated by introducing packet latency, and bandwidth simulated by throttling traffic flow
• Techniques for simulating distance with latency:
−Hardware Network Emulators
−Software Network Emulators
115 30 July 2015
Hardware Network Emulators
• A couple of vendors / products:
−Shunra STORM Network Emulator
−Spirent AdTech
116 30 July 2015
Software Network Emulators
• A couple of examples:
−NIST Net from the National Institute of Standards and Technology
• http://snad.ncsl.nist.gov/nistnet/
−D4 (Dick’s Dynamic Delay Device) in OpenVMS
117 30 July 2015
D4
• Capability added to OpenVMS Gigabit Ethernet LAN drivers
• Packets can be:
−Delayed
− Lost
• Bandwidth can be throttled/limited
118 30 July 2015
D4
• Controlled by LAN SDA Extension:
−SDA> LAN DELAY PARAM /qualifiers
−SDA> LAN DELAY STATUS /qualifiers
119 30 July 2015
D4
• LAN packets are handled / affected between a pair of Gigabit Ethernet NICs
• One non-Primary CPU recommended per pair of NICs
−Use Fast_Path to move interrupts off of Primary CPU onto a non-Primary CPU for both NICs
• So a quad-CPU OpenVMS system with 6 Gigabit Ethernet NICs can handle 3 LAN traffic streams
120 30 July 2015
D4
• OpenVMS 8.3 or later, plus a LAN patch kit:−8.3 on Alpha: VMS83A_LAN-V0300 (or later)
−8.3 on Integrity: VMS83I_LAN-V0700 (or later)
−8.3-1H1: VMS831I_LAN-V0100 (or later)
• Functionality is contained in _MON images. Set SYSTEM_CHECK to 1 or:−Copy SYS$LOADABLE_IMAGES:SYS$EI1000_MON.EXE
to SYS$LOADABLE_IMAGES:SYS$EI1000.EXE
−Copy SYS$LOADABLE_IMAGES:SYS$EW5700_MON.EXE to SYS$LOADABLE_IMAGES:SYS$EW5700.EXE.
121 30 July 2015
Example D4_SETUP.COM• $ !
• $ ! Configure RX4640 system for LAN Delay Function using EIC/EID, EIE/EIF, EWA/EWB
• $ !
• $ set noon
• $ !
• $ ! Set preferred CPU of other devices
• $ !
• $ set dev fga0/pref=0
• $ set dev fgb0/pref=0
• $ set dev eia/pref=0
• $ set dev eib/pref=0
• $ set dev eig/pref=0
• $ set dev eih/pref=0
• $ set dev ewc/pref=0
• $ !
• $ ! Devices to use are the AB465A Broadcom ports (Ruchba combo)
• $ !
• $ set dev ewa/pref=1
• $ set dev ewb/pref=1
• $ !
• $ ! Devices to use are the A7012A Intel ports
• $ !
• $ set dev eic/pref=2
• $ set dev eid/pref=2
• $ !
• $ ! Devices to use are the AB545A Intel ports (quad card)
• $ !
• $ set dev eie/pref=3
• $ set dev eif/pref=3
• $ !
• $ ! Turn off LAN driver tracing on all devices
• $ !
• $ mc lancp set dev/notrace/all
• $ ! Turn on LAN driver tracing on interesting devices, excluding fork begin/end entries
• $ !
• $ ! mc lancp set dev/trace=(mask=(%xFFFFFFF3,-1),size=2048) ewa
• $ ! mc lancp set dev/trace=(mask=(%xFFFFFFF3,-1),size=2048) ewb
• $ ! mc lancp set dev/trace=(mask=(%xFFFFFFF3,-1),size=2048) eic
• $ ! mc lancp set dev/trace=(mask=(%xFFFFFFF3,-1),size=2048) eid
• $ ! mc lancp set dev/trace=(mask=(%xFFFFFFF3,-1),size=2048) eie
• $ ! mc lancp set dev/trace=(mask=(%xFFFFFFF3,-1),size=2048) eif
122 30 July 2015
SDA> LAN commands
• SDA> LAN DELAY PARAM /DEVICE=(device1,device2) /AGE=value /BANDWIDTH=value /BUFFER=value /DELAY=value /LOSS=value /TLOSS=value
123 30 July 2015
SDA> LAN commands
−/DEVICE=(device1,device2) specifies the two LAN devices to use. They must both be assigned to the same secondary CPU.
−/DELAY=value specifies the amount of delay in microseconds to be imposed on each received packet before it is transmitted on the other device. Zero is the default.
−/BANDWIDTH=value specifies the maximum bandwidth allowed in megabits per second. Zero (default) means there is no bandwidth limit.
124 30 July 2015
SDA> LAN commands
−/AGE=value specifies the packet age limit to be imposed, in microseconds. Packets older than this age are discarded. Zero (default) means there is no age limit.
−/BUFFER=value specifies the maximum amount of data in bytes to be buffered. Incoming packets that would cause this limit to be exceeded are discarded. Zero (default) means there is no buffering limit.
−/LOSS=value specifies the packet loss rate to be imposed, as the number of packets to be discarded each second. Zero (default) is no intentional packet loss.
−/TLOSS=value specifies the total number of packets to be discarded. Zero (default) means there is no limit to the number of packets that will be discarded.
125 30 July 2015
SDA> LAN commands
• SDA> LAN DELAY STATUS /DEVICE=(device1,device2) /CONTINUOUS=value /HISTOGRAM /RESET
126 30 July 2015
SDA> LAN commands
−/DEVICE=(device1,device2) specifies the two LAN devices to use. They must both be assigned to the same secondary CPU. If no devices are specified, status will be displayed for all device pairs.
−/CONTINUOUS=value specifies that the status display is to be repeated every value seconds. The default is no repetitions.
127 30 July 2015
SDA> LAN commands
−/HISTOGRAM specifies that histogram data should be displayed, which includes:• Delay Variance (not a true statistical variance) – the difference between
the expected time that a transmit was to be issued and the time it actually was. For example, if the specified delay was 50 microseconds and a packet was transmitted 55 microseconds after the packet was received, the histogram bucket incremented is for 5 microseconds. This gives you an idea how accurate the delay function is. There are 64 buckets of 1024 CPU cycles each, so for a 1000 mhz processor, each bucket is 1.024 microseconds each. Note that this does not include any additional delay, perhaps because the transmit queue on the device is backing up because of load or the effect of flow control.
• Packets Outstanding – the number of packets outstanding to the other device for transmit. There are 16 buckets of 64 packets each, so the first bucket is for 0-63 packets outstanding, etc.
• Bytes Outstanding – the number of bytes outstanding to the other device for transmit. There are 16 buckets of 64k bytes each, so the first bucket is for 0-65535 bytes, etc.
• Packet Length – the length of each received packet in 16 buckets are given in the display 64..127, 128..191, etc.
128 30 July 2015
SDA> LAN commands
−/RESET – clears the counters before the display (you can also use LAN DELAY PARAM /DEVICE=(device1,device2) to clear the counters).
129 30 July 2015
LAN DELAY STATUS ExampleWAN$SDA(X-1) Extension on VLAN4 (HP rx4640 (1.30GHz/3.0MB)) at 9-JUL-2006 13:02:10.96
---------------------------------------------------------------------------------------
Device 1: EIC (Active) Device 2: EID (Active) CPU affinity: 2
Delay (usec): 5000 Max packet age (usecs): 0 Loss rate (pk/sec): 0
Bandwidth (mbits/sec): 50 Max buffering (bytes): 0 Total loss (pks): 0
EIC Xmt (pk) 1668495 (by) 13668246768 (mpk) 8 (mby) 1264 Lost (age) 0
EIC Rcv (pk) 1668228 (by) 13666059504 (mpk) 8 (mby) 1264 Lost (buffering) 0
EIC MBits/sec (128 pk) Xmt 0.00 Rcv 0.00 X+R 0.00 Lost (intentional) 0
EIC MBits/sec (512 pk) Xmt 0.00 Rcv 0.00 X+R 0.01 Lost (pool) 0
EIC MBits/sec (4096 pk) Xmt 0.04 Rcv 0.04 X+R 0.08 Current xmt (pk) 0/8
EIC MBits/sec (All pk) Xmt 11.91 Rcv 11.91 X+R 23.83 Current xmt (by) 0/57344
EIC Failures: Link 1 Xmt 0 Rcv 0 Elapsed time (sec) 9178
EID Xmt (pk) 1668228 (by) 13666059504 (mpk) 8 (mby) 1264 Lost (age) 0
EID Rcv (pk) 1668594 (by) 13669057776 (mpk) 8 (mby) 1264 Lost (buffering) 0
EID MBits/sec (128 pk) Xmt 0.00 Rcv 0.00 X+R 0.00 Lost (intentional) 0
EID MBits/sec (512 pk) Xmt 0.00 Rcv 0.00 X+R 0.01 Lost (pool) 0
EID MBits/sec (4096 pk) Xmt 0.04 Rcv 0.04 X+R 0.08 Current xmt (pk) 100/483
EID MBits/sec (All pk) Xmt 11.91 Rcv 11.91 X+R 23.83 Current xmt (by) 819200/3956736
EID Failures: Link 1 Xmt 0 Rcv 0 Elapsed time (sec) 9178
SDA>
130 30 July 2015
LAN DELAY STATUS/HISTOGRAM Example
WAN$SDA(X-1) Extension on VLAN4 (HP rx4640 (1.30GHz/3.0MB)) at 27-AUG-2006 13:32:33.17
---------------------------------------------------------------------------------------
Device 1: EIC (Active) Device 2: EID (Active) CPU affinity: 2
Delay (usec): 0 Max packet age (usecs): 0 Loss rate (pk/sec): 0
Bandwidth (mbits/sec): 0 Max buffering (bytes): 0 Total loss (pks): 0
EIC Delay Variance (0..49+ usec): - - - 23% 44% 19% 10% - - 1% 3% - - - - -
- - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - -
EIC Packets Outstanding (0..960+): 100% - - - - - - - - - - - - - - -
EIC Bytes Outstanding (0..960k+) : 100% - - - - - - - - - - - - - - -
EIC Packet Length: 64+ 128+ 192+ 256+ 384+ 448+ 512+ 756+ 1024 1280 1519 2048 3072 4096 6144 8192
EIC Packets: 33% 1% - 32% - - - - 17% 13% - - - - - 4%
EID Delay Variance (0..49+ usec): - - - 25% 46% 23% 2% - - 1% 3% - - - - -
- - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - -
EID Packets Outstanding (0..960+): 100% - - - - - - - - - - - - - - -
EID Bytes Outstanding (0..960k+) : 100% - - - - - - - - - - - - - - -
EID Packet Length: 64+ 128+ 192+ 256+ 384+ 448+ 512+ 756+ 1024 1280 1519 2048 3072 4096 6144 8192
EIC Packets: 33% 1% - 32% - - - - 17% 13% - - - - - 4%
Real-Life Examples
132 30 July 2015
Real-Life Example:Credit Lyonnais, Paris
•Credit Lyonnais fire in May 1996
•OpenVMS multi-site cluster with data replication between sites (Volume Shadowing) saved the data
•Fire occurred over a weekend, and DR site plus quick procurement of replacement hardware allowed bank to reopen on Monday
Source: Metropole Paris
133 30 July 2015
“ In any disaster, the key is to protect the data. If you lose your CPUs, you can replace them. If you lose your network, you can rebuild it. If you lose your data, you are down for several months. In the capital markets, that means you are dead. During the fire at our headquarters, the DIGITAL VMS Clusters were very effective at protecting the data.”
Jordan DoePatrick HummelIT Director, Capital Markets Division, Credit Lyonnais
134 30 July 2015
Headquarters for Manhattan's Municipal Credit Union (MCU) were across the street from the World Trade Center, and were devastated on Sept. 11."It took several days to salvage critical data from hard-drive arrays and back-up tapes and bring the system back up” ...“During those first few chaotic days after Sept. 11, MCU allowed customers to withdraw cash from its ATMs, even when account balances could not be verified. Unfortunately, up to 4,000 people fraudulently withdrew about $15 million."
Ann Silverthorn, Network World Fusion, 10/07/2002
http://www.nwfusion.com/research/2002/1007feat2.html
135 30 July 2015
Real-Life Examples: Commerzbank on 9/11
• Datacenter near WTC towers
• Generators took over after power failure, but dust & debris eventually caused A/C units to fail
• Data replicated to remote site 30 miles away
• One AlphaServer continued to run despite 104° F temperatures, running off the copy of the data at the opposite site after the local disk drives had succumbed to the heat
• See http://h71000.www7.hp.com/openvms/brochures/commerzbank/
136 30 July 2015
“Because of the intense heat in our datacenter, all systems crashed except for ourAlphaServer GS160... OpenVMS wide-areaclustering and volume-shadowing technologykept our primary system running off thedrives at our remote site 30 miles away.”
Werner Boensch, Executive Vice PresidentCommerzbank, North America
137 30 July 2015
Real-Life Examples of OpenVMS: International Securities Exchange
• All-electronic stock derivatives (options) exchange
• First new stock exchange in the US in 26 years
• Went from nothing to majority market share in 3 years
• OpenVMS Disaster-Tolerant Cluster at the core, surrounded by other OpenVMS systems
• See http://h71000.www7.hp.com/openvms/brochures/ise/
138 30 July 2015
“OpenVMS is a proven product that’s beenbattle tested in the field. That’s why wewere extremely confident in building thetechnology architecture of the ISE onOpenVMS AlphaServer systems.”
Danny Friel, Sr. Vice President,Technology / Chief Information Officer,International Securities Exchange
139 30 July 2015
“ We just had a disaster at one of our 3 sites 4 hours ago. Both the site's 2 nodes and 78 shadow members dropped when outside contractors killed all power to the computer room during maintenance. Fortunately the mirrored site 8 miles away and a third quorum site in another direction kept the cluster up after a minute of cluster state transition.”
Lee Mah,Capital Health Authority
writing in comp.os.vms, Aug. 20, 2004
140 30 July 2015
“I have lost an entire data center due to a combination of a faulty UPScombined with a car vs. powerpole, and again when we needed to do major power maintenance. Both times, the remaining half of the cluster kept us going.”Ed Wilts, Merrill Corporation
writing in comp.os.vms, July 22, 2005
Business Continuity
142 30 July 2015
Business Continuity: Not Just IT
•The goal of Business Continuity is the ability for the entire business, not just IT, to continue operating despite a disaster.
•Not just computers and data:
−People
−Facilities
−Communications: Data networks and voice
−Transportation
−Supply chain, distribution channels
−etc.
UsefulResources
144 30 July 2015
Business Continuity Resources
• Disaster Recovery Journal:
− http://www.drj.com/
• Continuity Insights Magazine:
− http://www.continuityinsights.com//
• Contingency Planning & Management Magazine
− http://www.contingencyplanning.com/
• All are high-quality journals. The first two are available free to qualified subscribers
• All hold conferences as well
145 30 July 2015
Multi-OS Disaster-Tolerant Reference Architectures Whitepaper
• Entitled “Delivering high availability and disaster tolerance in a multi-operating-system HP Integrity server environment”
• Describes DT configurations across all of HP’s platforms: HP-UX, OpenVMS, Linux, Windows, and NonStop
• http://h71028.www7.hp.com/ERC/downloads/4AA0-6737ENW.pdf
146 30 July 2015
Tabb Research Report
• "Crisis in Continuity: Financial Markets Firms Tackle the 100 km Question"
−available from https://h30046.www3.hp.com/campaigns/2005/promo/wwfsi/index.php?mcc=landing_page&jumpid=ex_R2548_promo/fsipaper_mcc%7Clanding_page
147 30 July 2015
Draft Interagency White Paper
• "Draft Interagency White Paper on Sound Practices to Strengthen the Resilience of the U.S. Financial System“−http://www.sec.gov/news/studies/34-47638.htm
• Agencies involved:Federal Reserve System, Department of the Treasury,
Securities & Exchange Commission (SEC)
• Applies to:Financial institutions critical to the US economy
• But many other agencies around the world are adopting similar rules
148 30 July 2015
Business Continuity and Disaster Tolerance Services from HP
Web resources:
• BC Services:− http://h20219.www2.hp.com/services/cache/10107-0-0-225-121.aspx
• DT Services: − http://h20219.www2.hp.com/services/cache/10597-0-0-225-121.aspx
149 30 July 2015
OpenVMS Disaster-Tolerant Cluster Resources
• OpenVMS Documentation at OpenVMS website:− OpenVMS Cluster Systems
− HP Volume Shadowing for OpenVMS
− Guidelines for OpenVMS Cluster Configurations
• OpenVMS High-Availability and Disaster-Tolerant Cluster information at the HP corporate website: http://h71000.www7.hp.com/availability/index.htmlandhttp://h18002.www1.hp.com/alphaserver/ad/disastertolerance.html
• More-detailed seminar and workshop notes at http://www2.openvms.org/kparris/ and http://www.geocities.com/keithparris/
• Book “VAXcluster Principles” by Roy G. Davis, Digital Press, 1993, ISBN 1-55558-112-9
150 30 July 2015
Questions?
151 30 July 2015
Speaker Contact Info:
•Keith Parris
•E-mail: [email protected] [email protected]
•Web: http://www2.openvms.org/kparris/