Scalable high availability technologies
Gergely Tomka
prototype template (5428278)\screen library_new_final.ppt 11/28/2012
Session Outline
2
• What is high availability (a practical approach)
• Variations:
• Hardware solutions
• Hot/Cold clusters
• Large clusters
• Requirements, implementations and results
prototype template (5428278)\screen library_new_final.ppt 11/28/2012
High availability
• The service must be available when we want it
• 24/7/365 availability – is it real?
• Problems:
− How to detect a service failure?
− How to detect if the service is working?
− How to stop the service reliably?
− How to start a service reliably?
• Where is the single point of failure?
3
prototype template (5428278)\screen library_new_final.ppt 11/28/2012
Single Point Of Failure
• Desktop machine: one disk, one NIC, one CPU, one PSU
• Server machine: one motherboard/system bus, multiple cpus, psus, nics
• Mainframe: three cores, two is working on the same issue, and the third is checking if the results are the same.
• Two node cluster from server machines: no SPOF in hardware!
• Are you sure?
4
prototype template (5428278)\screen library_new_final.ppt 11/28/2012
Hardware redundancy
• Every cable must be duplicated
• Every network device must be replicated
• The network protocol must be able to handle the loss of a link
• SAN connections and devices most be doubled too
• System operators must be redundant and available too, no lonely heroes...
5
prototype template (5428278)\screen library_new_final.ppt 11/28/2012
Single Point Of Failure
• Problems with the two nodes:
− Connection for checking service availability
− Common resources, mostly SAN disks
− Different levels of disaster: citywide? Or just a small fire in the serverroom?
− Price
• Example of risk assesment:
• A huge bomb could make 20 km destruction
• Place the two servers at least 30 km away
• If there was more than one 10 Mt H-bomb, trading will stop anyway.
6
prototype template (5428278)\screen library_new_final.ppt 11/28/2012
Hot/Cold clusters
• Two nodes, one is working, one is waiting for trouble
• Users must define:
− Resources: disks, network interfaces, processes
− Scripts: for checking, starting and stopping the service or the resources
• Veritas Cluster Services or Coyote could be used for this purpose
• Failover is when the roles are changed
• Heartbeat: various solutions to allow the nodes to check on each other, usually via network.
7
prototype template (5428278)\screen library_new_final.ppt 11/28/2012
Failure modes of Hot/cold clusters
• Maintenance script issues: service is down, but the script shows it up, etc.
− Solution: better scripts
• Owner issues: The users usually can manage the service without VCS, which means a service is not under control.
− Hard to detect, as the service is up, but the VCS tools are not in sync with reality
− Good recipe for disaster
• Lost connection: lost heartbeat, can lead to split brain
− Network or overload
• Split brain: when the two nodes are both hot
− Perfect recipe for disaster, especially with shared storage
− Gazelle/SCSI disk reservation
• Failed resources
8
prototype template (5428278)\screen library_new_final.ppt 11/28/2012
Special cases
• Hot/hot node: two services, one is the default for each nodes.
• Various fencing solutions:
− Human: only manual failover
− SCSI disk reservation
9
prototype template (5428278)\screen library_new_final.ppt 11/28/2012
Multinode clusters
• Ultimately one server is not powerful enough
• Variations:
− Load-balancing clusters
− Grids
− ESX
• Applications must be modified
• Performance limit is the sky
• Can be made efficiently
10
prototype template (5428278)\screen library_new_final.ppt 11/28/2012
Load balancing
11
prototype template (5428278)\screen library_new_final.ppt 11/28/2012
Load balancers
• The load balancer is a tool to distribute load between computing nodes
• Works with multiple clients, small jobs (transactions)
• It is, in itself a single point of failure
• Clustering, quick failover, preservation of states is important
• Methods for distributing the load:
− DNS : slow, not flexible, easy
− Router: quick, flexible, but the traffic must go through the load balancer
− Layer 7/App level load balancing: the load balancer must examine the traffic, not just passing it to the nodes
12
prototype template (5428278)\screen library_new_final.ppt 11/28/2012
Load balancers
• Selecting nodes:
− Round robin
− Least loaded
− Standby nodes
• F5 could be used, but the Linux Virtual Server is also very good for testing
13
prototype template (5428278)\screen library_new_final.ppt 11/28/2012
Grid
• In simple terms a grid is a group of computers working on the same tasks
• Works with big customers and parallel tasks
• Distributing the tasks and collecting the results is a big task
− Overloaded fileservers
− Torrent-like task and input data distribution
− Bandwidth
• The application must be rewritten for hundreds of independent nodes
14
prototype template (5428278)\screen library_new_final.ppt 11/28/2012
Grid
• Nodes must be:
− Cheap
− Efficient
− Easy to replace
− Properly monitored
15
prototype template (5428278)\screen library_new_final.ppt 11/28/2012
Grid effect
16
prototype template (5428278)\screen library_new_final.ppt 11/28/2012
Grid Management
• Managing hundreds of nodes is a different art
• No individual alerts for errors
• No hardware repairs
• Statistical analysis of logfiles
• Geospatial analysis: find broken rooms/racks/networks/buildings
17
prototype template (5428278)\screen library_new_final.ppt 11/28/2012
Sample Grid troubleshooting
/afs/grid_utils/bin/search "Waiting for busy volume" /var/log/syslog |
/afs/grid_utils/bin/message_filter.pl
Shows that it occurs 50.000 times a day on 17.000 hosts, roughly 3/host/day, winner has
13/day. So it's not a small group of hosts, not a single personality, not application
related, probably not a huge deal at all, just background noise.
Is it AFS-cell related? No, the distribution of the error is even among AFS cells:
$ cut -d" " -f15 afs.txt | sort | uniq -c
Is it tied to a single or few AFS volumes? Not sure, we have 157 volumes with the error per
day but only 12 with 1000+ errors, 20 with 10-1000 errors, the rest less than 3. Top AFS
volume IDs: 5879181, 5369925, 5368179.
$ cut -d" " -f11 afs.txt | sort | uniq -c | sort -nr
Is it tied to a certain time of the day? I cut out the hours:minutes from the timestamp. Not
really, it's nicely distributed throughout the day, every hour has a few, however it
stands out, that it always occurs at :00 minutes, every hour, with higher activity at
7:00, 15:00, 21:00 - now that's suspicious.
$ cut -d" " -f4 afs | cut -d":" -f1,2 | sort | uniq -c
18
prototype template (5428278)\screen library_new_final.ppt 11/28/2012
ESX
• VmWare ESXi is one of the current solutions for virtualized desktops
• Advantages of virtual desktops:
− Smaller, quieter devices in the workspace (thin client, old desktop machines)
− Desktop applications could be closer to the data
• Disadvantages:
− Added complexity
− Very new idea, lots of new problems
19
prototype template (5428278)\screen library_new_final.ppt 11/28/2012
ESX
• Effects on teamwork:
− Windows desktops
− Vcenter is running on Windows servers
− ESXi is a stripped down Linux
− Storage is on netapp filers
20
prototype template (5428278)\screen library_new_final.ppt 11/28/2012
ESX clusters
• Vmotion – seamless migration of virtual hosts between nodes
− Vcenter software is necessary
− Cluster size is only 16 nodes
− A Vcenter can handle only 300 nodes
− Good for maintenance and for performance
• Vcenter is a serious disadvantage when you need hundreds of nodes
• Thousands of desktops are troublesome
− Very different culture in maintenance (reboot!)
− Disk activity, scheduled tasks (antivirus)
21
prototype template (5428278)\screen library_new_final.ppt 11/28/2012
Storage solutions
• Cache systems
− Can help on read operations
− Cannot help on write operations
• Deduplication
− For ESX
− Need more CPU to recognize similar blocks/files, but it’s using less space
• Speed
− Handmade NFS fileserver, 10GB writes got saturated immediately
22
prototype template (5428278)\screen library_new_final.ppt 11/28/2012
Q & A
23
Top Related