Post on 25-Dec-2015
1
Going Beyond Recovery to Continuity: Lessons Learned
Dave SwartzVice President & CIO
The George Washington University
2
Brief Background on GW• Main campus
– Washington, DC – ~100 buildings – Blocks from the White House,
IMF/World Bank, State Dept.• 27,000 people
– 20K students (50% UG and 50% graduate and professional students)
– 7K faculty and staff– Of the 20K there are 8K resident
students• Major medical center – the ER for
the leadership of our government• Two other smaller campuses in
region• 2.5 Gb into Internet and Internet-2 • 15K voice connections and 17K data
connections• Two major data centers – 34 miles
apart
White HouseWhite House
PentagonPentagon
IMF/WB,State Dept.
GW
3
Some Drivers for Business Continuity at GW
• Explosions in Man Holes in Street– Recurring unexplained accumulations of flammable liquids in the storm drains
explodes shutting power off a few buildings for days.• Flood hits Academic Center with Data Center
– A backed up city sewer system causes a flood in a building not designed for a data center.
• Change Management Issues– Our Facilities group is prone to taking significant actions without much notice,
including cutting off power or cooling to a building.• Email Systems Failure
– Lost the SAN and was down for 24 hours for basic email and it was 3 days until the archive could be restored.
• Cybersecurity Incidents– After a major worm infestation and also a hack on a trusted host in 2000, GW
creates its Information Security Program. • 9/11
– “The tragic events of Sept. 11 and their aftermath have resulted in changes in the way all of us conduct our lives,” said President Stephen Joel Trachtenberg. “Just as GW strives for academic excellence, we also want to take all appropriate steps to ensure the safety and well being of our community and the continued operation of the university”.
– GW was close to ground zero that day and all land-based phones and cell phones were congested for much of the day.
• Sarbanes-Oxley– A risk conscious Board of Trustees has lead to a number of initiatives to address
BC at GW.
4
Who Owns BC at GW?• John Petrie, AVP for Public Safety & Emergency Mgmt.,
holds the AB degree from Villanova University and a master’s and doctorate from The Fletcher School of Law and Diplomacy.
• A career Naval officer, he was the head of the Naval Station at Norfolk, the world’s largest Naval complex, and also professor and head of research at the War College.
• The AVP position was created after 9/11 and was designed to broaden, coordinate, and execute the University’s crisis management, business continuity, emergency preparedness and public safety plans and activities.
• “We need to have people at the local level comfortable with what’s expected of them and what they have the authority to do,” Petrie says. “If they are confident and comfortable, then the chances of their being able to prepare, respond, or recover are easier.”
• John’s number one priority is the safety and welfare of people.
• He sits on regional and national emergency management response groups and represents the regional universities in exercises.
• References:
– BC Plan - http://www.gwu.edu/~response/contents.cfm
– Advisories and Alerts - http://www.gwu.edu/~gwalert/
John Petrie, AVP for Public Safety & Emergency Mgt
John has help to lead the development and administration of BC plans and
testing, and an integratedsystem of advisories, alerts and
real-time communications.
5
Role of IT in Campus BC• Address the risks of IT failures• IT has helped to coordinate and fund the development of the
main 19 core office departmental plans– Many core departments had to be assisted to get their BC plans
done since they felt IT had things under control, so why do they have to plan?
– They also had difficulty freeing themselves from other priorities – needed their VP to make BC a priority!
• IT has also helped to deliver:– Campus Alerts (web page, portal, email, 3rd party call service)– Back up web site– Redundant email system and broadcast server (reflector and Listserv) – Alternate routing to different area code for our main incoming and outgoing
phone lines – Emergency intercom broadcasts over speaker phones – A network of Blackberries and support for management – Online directories and BC response plans – A fully configured and supported command center.
6
The Planning Process• Identify sources of risks and plan
accordingly• Provide assistance
– Standard templates and questions to facilitate preparation of plans (available on request)
– Expert assistance to develop plan– Review of plans
• Enlist support– Of senior management, the Board
and all core offices• Prioritize efforts
– Not every department needs a comprehensive plan. At GW we identified 19 core offices that needed detailed plans.
• Make the plan easily available• Test the plan and the ability to
think on your feet regularly• Keep plans current
– All plans require periodic review, validation and update.
The online plan for GW is called theIncident Planning, Response, and Recovery Manual, included are individual BC Plans.
7
The GW IT Recovery Profile
• Rebuild & Replace Disaster Recovery– Tape backup and priority
shipment of equipment– Weeks to recovery
• Hot-Site Disaster Recovery– Off site arrangements with a
hot-site provider– Several days to recovery
• High Availability Operations– Redundant data centers,
networks and telecom – Less than one day and ideally
less than a couple of hours to recovery.
0
50
100
150
200
250
300
350
400
450
2000 2002 2004 2006
420 (projected)
84
12 < 2
Hours to Recovery
Rebuild & Replace
Hot-Site
High-Availability
8
Dealing with Risk Continuity rather than Recovery
• Common areas of IT risk were addressed with a focus on major risks and points of failure:– Data Center– Telecommunications– Network and ISP– Data– Security– Power and Cooling– Change and Service
Management– Classrooms
1. Continuity of operations needs to be built into the architecture and culture from the bottom up.
2. If you live and use it day to day then it is less of a big deal when a disaster hits.
3. BC at a comprehensive local level is essential to enable IT to deliver the sustainability of data and information services.
9
Data Center Redundancy• We have created dual data
centers– separated by 34 miles– connected by a DWDM link
over a redundant dark fiber ring
• We split Test/Dev from the Prod instances.
• We also deploy VMware and virtualize servers across centers.
• Not all of production is at one site, but split on a 35-65% basis.
• We mirror data between data centers.
• We have staff split between centers.
• We routinely test failover during maintenance and upgrades.
• This design enables continuity of operations without the need to recover from most disasters.
DWDM DWDM
Ethernet Connection
Dark Fiber
SAN Fiber
L700 L700
EMCSYM-0
M3BCV
M2
EMCSYM-1
BCV
M2
M1M3
M1
WAN Attached Host WAN Attached Host
SAN Attached HostSAN Attached Host
Media Manager
WANWAN
Back-up Manager
LOUDOUN COUNTY DATA CENTER FOGGY BOTTOM DATA CENTER
SAN SAN
ATA DISKS10Tb
ATA DISKS10Tb
DATA CENTER BACKUP ARCHITECTURE
10
Telecommunications Redundancy• We have several PBX switches (Avaya S8700s)
interconnected, load balanced, and spatially distributed.
– Two are on the main campus and separated. The third is on a remote campus 34 miles away in a different area code.
• We have the ability to re-route incoming and outgoing calls through different campuses and area codes.
• There are redundant emergency 911 and analog lines as a back up to our main trunks.
• Some specific phone numbers are protected and given regional priority for accessibility and sustainability during a major incident.
• We maintain copper connections for voice to permit inline power off of diesel generators to 15,000 phones.
11
Data Redundancy
• All enterprise data is mirrored between data centers, including ERP, data marts, email, one-card, portal, and web systems.
• The main campus file servers are automatically backed up. Legacy departmental systems are slowly transitioning to central support and sustainability – a difficult political process.
• Desktops in many core offices have a standard image and automatically store to a central suite of file servers.
• Critical documents are being stored online in an enterprise document management system and archived to tape.
• We regularly test data backups to make sure we can restore from them.
• One of the most critical aspects of continuity is rapid access to the data!
On-site fire rated vault in addition to off-site storage
12
Information Security• Protecting the university from security
risks that can interrupt operations and cost millions of dollars in lost productivity and liability is an important priority in BC.
• Like an onion, the best approach is defense in depth.
• One of our newest efforts after securing campus file servers is our desktop initiatitive.
– We now use Novell Patchlinks, Cisco Clean Access and IPS to automate updates, verify conformance to standards and non-infection.
– As a result, desktop infection problems have declined to a trickle.
• Creating a focused Information Security program, setting standards, and centralizing services, are critical to success.
“Rounding Up Rogue Servers”,article in July 2005 Chronicle.
13
Power and Cooling
• Power Redundancy– Conditioned Commercial
Power– 450KW Diesel Generator
w/Maintenance Tap– Automatic Transfer Switch– Uninterruptible Power
Supplies (UPS)– Multiple Power supplies in
each computer system– 48 hours supply diesel (going
to 96 hrs) with priority shipments from three regional vendors possible
• Redundant Air Conditioning Systems– Chilled Water Plant & Two 60
Ton Dry Coolers– Glycol & Chilled Water Air
Handlers
14
Change & Service Management
Change Control via Integration
Work Requests
C3
Prob Tickets & Service
OrdersRemedy Kintana
Asset Management
TBD
S/W License Mgmt Remedy
Upside
App. Change Control
Aperture
Adoption of integrated change control is one of the major factors to improvement and reliability of operations.
15
Classrooms• What happens if we lose some
classroom space? How could we continue to conduct classes?
1. Using R25i (Resource25 3.3) to complement Schedule25 we can identify and reallocate any available university space to classrooms
2. Using Bb and Elluminate we can conduct classes virtually from home. a. We are piloting this approach
now for snow days and other unscheduled ad hoc gatherings such as study sessions.
b. We are also suggesting that faculty teach one virtual class every month so they have practice.
3. Podcasting = Apreso + iPodsa. GW is supporting Podcasting of
its non-credit lecture series to provide access to recorded presentations.
b. Could this be expanded for credit classes? Depends on support from faculty.
16
Selling BCnot the WHAT, but the HOW
• Rational Approach– The risk or probability of the event multiplied by the potential loss
provides a suggested magnitude to the investment for protecting a university from disaster. Not many use this approach.
• Peer Group Benchmarks – A very common and accepted approach is to compare the university
against the market basket of peer institutions to see what they are doing.
• Leverage the Crisis – The emotional side of living through a crisis tends to ease the flow of
funds, so capture the opportunity when it arises.
• Partnering with the Board and Audit Team – The Board has the ability to drive improvements. The External and
Internal Audit Teams are agents of the Board and should be viewed as a partner, not a threat, as they are often viewed.
17
Risks of Complexity
Standardization, documentation, and tight change control help to reduce risks from complexity.
Virtualization, distant centers, and split operations add complexity, which has its own attendant risks.
18
Factors Related to Distance
• How far away is far enough for a second center?– GW has selected 34 miles
– USC has designated a “bunker” just a few miles away
– Others are saying 70+ miles.
• It really depends– You need to consider the types of risks in your region.
• The greater the distance– The greater the cost or lesser the functionality and immediacy of response.
• You may want to – Have a secondary high-availability or hot-site nearby and a tertiary cold-site
much farther away.
• You need to consider – The impacts on your staff and their ability to make it to the different sites both for
routine maintenance as well as during a disaster
– Some types of clustering do not work at a distance
– Real-time mirroring is also adversely affected by distance.
19
Support those Blackberries
• A critical element of the GW BC program is a network of Blackberries. All senior management at GW have them and use them everyday.
• Blackberries are more like a laptop than a phone and require expert assistance
• They have cell phone and radio capability
• They can send and receive email and instant text messages
• They have the ability to surf the web and access calendars, directories and online documents that can be used to support BC
• We have a dedicated expert with backup to provide support to the Blackberries and the command centers.
20
Doesn’t it cost a great deal?
• GW had a hot-site, – Costing several hundred thousand
dollars per year.• Went to a high-availability 2nd
site.– One-time cost about $1 million– The ongoing costs were not more
than the previous base budget due to the reallocation of the funds from the hot-site contract.
• Increase in base needed was:– $136K/yr: $1 million loaned at 6%
over 10 years• To offset costs we are leasing
excess space:– We are recovering the incremental
operating costs of the 2nd site. • More reliable service without large
additional costs - A NO-BRAINER!
Inve
stm
ent
2 Weeks 1 Week
Rebuild & Replace
Hot Site / Mobile Recovery
High Availability
72 Hours 48 Hours 24 Hours Minutes
GW Cost Curve
ExpectedCost Curve
Time to Restoration of Operations
Cost
A myth propagated by hot-site vendors is that the cost of customer owned high-availability is prohibitive
21
Partnerships
• National Capital Regional Emergency Response Partnership
– Emergency Response groups across the region coordinate efforts and share experiences
– First Responder Access Card (FRAC)– Regional exercises– Information sharing with key groups
• University Partnerships:– Cost and resource sharing or exchange
programs– Georgetown University & GW back one
another up– MAX (Mid-Atlantic Crossroads gigapop)
• Vendor Partnerships:– Have helped GW identify best practices and
utilize new technology useful to BC.– Their support in a disaster can be critical
The FRAC helps to get approved personnel across road-blocks and barriers.
22
Questions?
Dave Swartz