Designing High Availability Networks, Systems, and Software for the University Environment Deke...
-
Upload
darrell-mccormick -
Category
Documents
-
view
215 -
download
0
Transcript of Designing High Availability Networks, Systems, and Software for the University Environment Deke...
![Page 1: Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649cef5503460f949be898/html5/thumbnails/1.jpg)
Designing High Availability Networks, Systems, and Software
for the University Environment
Deke Kassabian and Shumon Huque
The University of Pennsylvania
January 14, 2004
![Page 2: Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649cef5503460f949be898/html5/thumbnails/2.jpg)
About Penn
The University of Pennsylvania was founded by Ben Franklin in 1751
Penn is part of the Ivy League Located in western Philadelphia Community of more than 30,000 people
![Page 3: Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649cef5503460f949be898/html5/thumbnails/3.jpg)
General Goals Networked services available as expected
by our users Minimized time to repair (TTR) for when
outages do occur Ability to perform maintenance and
upgrades (planned downtime) non-disruptively
Cost effectiveness in meeting these goals
![Page 4: Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649cef5503460f949be898/html5/thumbnails/4.jpg)
Definitions
Availability High Availability (HA) Rapid Recovery (RR) Disaster Recovery (DR) Basic Systems
![Page 5: Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649cef5503460f949be898/html5/thumbnails/5.jpg)
Definitions
Disaster Recovery (DR) -The process of restoring a service to full operation after an interruption in service
![Page 6: Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649cef5503460f949be898/html5/thumbnails/6.jpg)
Definitions
Basic System - a Basic System is a {Network, System, Service} with only the most basic of protections against outages
Examples: A network recoverable using spare parts A single computer system with RAID disk A service recoverable from tape backups
![Page 7: Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649cef5503460f949be898/html5/thumbnails/7.jpg)
Definitions
Availability - the percentage of total time that a {Network, System, Service} is available for use
Related points: Advertised periods of availability Availability as advertised Absolute availability
![Page 8: Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649cef5503460f949be898/html5/thumbnails/8.jpg)
Definitions
High Availability (HA) - a {Network, System, Service} with specific design elements intended to keep availability above a high threshold (eg, 99.99%)
![Page 9: Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649cef5503460f949be898/html5/thumbnails/9.jpg)
Definitions
Rapid Recovery (RR) - a {Network, System, Service} with specific design elements intended to recover from downtime very quickly (eg, 15 minutes)
![Page 10: Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649cef5503460f949be898/html5/thumbnails/10.jpg)
Metrics Economics of high availability (the
costs of non-available) Calculating availability How availability measurements are
performed
![Page 11: Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649cef5503460f949be898/html5/thumbnails/11.jpg)
Economics of high availability What is the cost of an outage in your
Student Courseware systems and student record systems
Financial systems Primary campus web site and Email servers DNS, DHCP and AuthN systems Internet connection(s) Development / Gifts systems
How much should you be willing to spend to minimize downtime of any or all of these?
![Page 12: Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649cef5503460f949be898/html5/thumbnails/12.jpg)
Calculating availability
Availability can be measured directly through periodic polling (eg, SNMP, Mon, Nagios)
A formula for predicting availability of a single component
MTBF(MTBF+TTR) 1 TTR
(MTBF+TTR)or
![Page 13: Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649cef5503460f949be898/html5/thumbnails/13.jpg)
Design Principals Towards HA
Minimize points of catastrophic failure Maximize redundancy Minimize fault zones Minimize complexity and cost
Applying the above principles to Networks Systems Services
![Page 14: Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649cef5503460f949be898/html5/thumbnails/14.jpg)
Specific examples at Penn High Availability Services Rapid Recovery Services
![Page 15: Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649cef5503460f949be898/html5/thumbnails/15.jpg)
High Availability Design Strategies employed to achieve HA:
Server redundancy Hardware component redundancy Storage redundancy (RAID) Network redundancy Redundant power, A/C, cooling etc Application protocols that can transparently
failover to alternate servers Secondary offsite hosting (of some services like
DNS)
![Page 16: Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649cef5503460f949be898/html5/thumbnails/16.jpg)
Rapid Recovery Design Strategies employed to achieve RR:
Standby servers and storage Some HA design elements:
Hardware redundancy, storage redundancy, network redundancy, power, A/C redundancy etc
Note: services deployed in the RR model typically don’t have an easy way to transparently failover to alternate servers (eg. E-mail, Web etc)
![Page 17: Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649cef5503460f949be898/html5/thumbnails/17.jpg)
Network Aggregation Point Abbreviation: NAP Machine rooms in separate campus locations
that house critical network electronics and servers.
Good environmentals and extensive connectivity to campus fiber-optic cable plant
Both HA and RR services utilize multiple NAPs
![Page 18: Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649cef5503460f949be898/html5/thumbnails/18.jpg)
Central Infra. Networks AKA “NOC Networks” (historical name) 3 highly redundant IP networks that house systems
providing critical infrastructure services Each network is triply connected to campus routing
core via distinct NAP locations Network wiring traverses physically diverse fiber
conduit pathways Use of router redundancy protocols (VRRP) & Layer-
2 path redundancy (802.1D) for high availability
![Page 19: Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649cef5503460f949be898/html5/thumbnails/19.jpg)
HA Server Platforms Two sets of three replicated servers
3 KDC servers: central authentication 3 NOC servers: everything else
Kerberos runs on separate systems mainly for security reasons.
![Page 20: Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649cef5503460f949be898/html5/thumbnails/20.jpg)
High Availability: KDCs KDCs (3):
3 distinct machines (kdc1, kdc2, kdc3) Run only Kerberos AS and TGS Each located in a different campus machine room Each connected to a distinct IP network
Via a distinct IP core router Additionally each network is triply connected to the
campus routing core via 3 NAPs
![Page 21: Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649cef5503460f949be898/html5/thumbnails/21.jpg)
High Availability: NOCs 3 “NOC” systems (a historical name)
Provide: DNS, DHCP, NTP, RADIUS plus a few homegrown services
Same physical and network connectivity as the KDCs
In addition: some servers have a secondary interface on a different NOC network (for reasons to be explained later)
![Page 22: Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649cef5503460f949be898/html5/thumbnails/22.jpg)
![Page 23: Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649cef5503460f949be898/html5/thumbnails/23.jpg)
![Page 24: Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649cef5503460f949be898/html5/thumbnails/24.jpg)
![Page 25: Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649cef5503460f949be898/html5/thumbnails/25.jpg)
![Page 26: Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649cef5503460f949be898/html5/thumbnails/26.jpg)
![Page 27: Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649cef5503460f949be898/html5/thumbnails/27.jpg)
HA Application Failover Kerberos DNS RADIUS NTP DHCP
Current spec supports only 2 failover systems
Non-HA homegrown services: PennNames
![Page 28: Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649cef5503460f949be898/html5/thumbnails/28.jpg)
Rapid Recovery service Example: E-mail and Web service A set of servers and storage is replicated at two sites: primary
and standby Primary site: active servers and storage Secondary site: standby servers and replicated storage Data from 1st site is synchronously replicated to 2nd Two separate fibrechannel networks interconnect systems and
storage at both sites Catastrophic failure event: system can be manually reconfigured
to use the standby servers and/or secondary storage ( ~ 30 minutes)
Servers are located on the HA primary infrastructure network
![Page 29: Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649cef5503460f949be898/html5/thumbnails/29.jpg)
![Page 30: Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649cef5503460f949be898/html5/thumbnails/30.jpg)
Experiences at Penn Where these approaches have been helpful
Higher availability, non-disruptive maintenance Where they have not
Complexity can be hard to manage! Where cost has been high
Replicated systems and networks, high-end storage solutions
Real availability experience DNS, a critical service, went from 99.0% to
99.999% availability!
![Page 31: Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649cef5503460f949be898/html5/thumbnails/31.jpg)
Future Enhancements Making RR services highly available:
“clustering”, IETF rserpool etc Metropolitan area DR (or better) Rolling disaster protection Others:
IP Multipathing Trunking links to servers
802.3ad, SMLT, DMLT or similar Rapid Spanning Tree (IEEE 802.1w) Multi-master KADM service
Improved management and monitoring infrastructure
![Page 32: Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649cef5503460f949be898/html5/thumbnails/32.jpg)
Feedback
Questions, comments Your designs, experiences, successes
Contact Info:[email protected]@isc.upenn.edu