Clusters Paul Krzyzanowski [email protected] [email protected] Distributed Systems Except as otherwise...
-
Upload
theodora-haynes -
Category
Documents
-
view
221 -
download
2
Transcript of Clusters Paul Krzyzanowski [email protected] [email protected] Distributed Systems Except as otherwise...
![Page 1: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/1.jpg)
Clusters
Paul [email protected]
Distributed Systems
Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License.
![Page 2: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/2.jpg)
Designing highly available systems
Incorporate elements of fault-tolerant design– Replication, TMR
Fully fault tolerant system will offernon-stop availability
– You can’t achieve this!
Problem: expensive!
![Page 3: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/3.jpg)
Designing highly scalable systems
SMP architecture
Problem:performance gain as f(# processors) is sublinear
– Contention for resources (bus, memory, devices)– Also … the solution is expensive!
![Page 4: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/4.jpg)
Clustering
Achieve reliability and scalability by interconnecting multiple independent systems
Cluster: group of standard, autonomous servers configured so they appear on the network as a single machine
approach single system image
![Page 5: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/5.jpg)
Ideally…• Bunch of off-the shelf machines• Interconnected on a high speed LAN• Appear as one system to external users• Processors are load-balanced
– May migrate– May run on different systems– All IPC mechanisms and file access
available• Fault tolerant
– Components may fail– Machines may be taken down
![Page 6: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/6.jpg)
we don’t get all that (yet)
(at least not in one package)
![Page 7: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/7.jpg)
Clustering types• Supercomputing (HPC)
• Batch processing
• High availability (HA)
• Load balancing
![Page 8: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/8.jpg)
High Performance Computing(HPC)
![Page 9: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/9.jpg)
The evolution of supercomputers• Target complex applications:
– Large amounts of data– Lots of computation– Parallelizable application
• Many custom efforts– Typically Linux + message passing
software + remote exec + remote monitoring
![Page 10: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/10.jpg)
Clustering for performanceExample: One popular effort
– Beowulf• Initially built to address problems associated
with large data sets in Earth and Space Science applications
• From Center of Excellence in Space Data & Information Sciences (CESDIS), division of University Space Research Association at the Goddard Space Flight Center
![Page 11: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/11.jpg)
What makes it possible• Commodity off-the-shelf computers are
cost effective
• Publicly available software:– Linux, GNU compilers & tools– MPI (message passing interface)– PVM (parallel virtual machine)
• Low cost, high speed networking
• Experience with parallel software– Difficult: solutions tend to be custom
![Page 12: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/12.jpg)
What can you run?• Programs that do not require fine-grain
communication• Nodes are dedicated to the cluster
– Performance of nodes not subject to external factors
• Interconnect network isolated from external network– Network load is determined only by
application• Global process ID provided
– Global signaling mechanism
![Page 13: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/13.jpg)
Beowulf configurationIncludes:
– BPROC: Beowulf distributed process space• Start processes on other machines• Global process ID, global signaling
– Network device drivers• Channel bonding, scalable I/O
– File system (file sharing is generally not critical)• NFS root • unsynchronized• synchronized periodically via rsync
![Page 14: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/14.jpg)
Programming tools: MPI• Message Passing Interface• API for sending/receiving messages
– Optimizations for shared memory & NUMA
– Group communication suppot• Other features:
– Scalable file I/O– Dynamic process management– Synchronization (barriers)– Combining results
![Page 15: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/15.jpg)
Programming tools: PVM• Software that emulates a general-purpose
heterogeneous computing framework on interconnected computers
• Present a view of virtual processing elements– Create tasks– Use global task IDs– Manage groups of tasks– Basic message passing
![Page 16: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/16.jpg)
Beowulf programming tools• PVM and MPI libraries• Distributed shared memory
– Page based: software-enforced ownership and consistency policy
• Cluster monitor• Global ps, top, uptime tools
• Process management– Batch system– Write software to control synchronization and load
balancing with MPI and/or PVM– Preemptive distributed scheduling: not part of Beowulf
(two packages: Condor and Mosix)
![Page 17: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/17.jpg)
Another example• Rocks Cluster Distribution
– Based on CentOS Linux– Mass installation is a core part of the system
• Mass re-installation for application-specific configurations
– Front-end central server + compute & storage nodes
– Rolls: collection of packages• Base roll includes: PBS (portable batch system),
PVM (parallel virtual machine), MPI (message passing interface), job launchers, …
![Page 18: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/18.jpg)
Another example• Microsoft HPC Server 2008
– Windows Server 2008 + clustering package– Systems Management
• Management Console: plug-in to System Center UI with support for Windows PowerShell
• RIS (Remote Installation Service)
– Networking• MS-MPI (Message Passing Interface)
• ICS (Internet Connection Sharing) : NAT for cluster nodes• Network Direct RDMA (Remote DMA)
– Job scheduler– Storage: iSCSI SAN and SMB support– Failover support
![Page 19: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/19.jpg)
Batch Processing
![Page 20: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/20.jpg)
Batch processing• Common application: graphics rendering
– Maintain a queue of frames to be rendered
– Have a dispatcher to remotely exec process
• Virtually no IPC needed
• Coordinator dispatches jobs
![Page 21: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/21.jpg)
Single-queue work distributionRender Farms:Pixar:
• 1,024 2.8 GHz Xeon processors running Linux and Renderman• 2 TB RAM, 60 TB disk space• Custom Linux software for articulating, animating/lighting
(Marionette), scheduling (Ringmaster), and rendering (RenderMan)• Cars: each frame took 8 hours to Render. Consumes ~32 GB
storage on a SAN
DreamWorks:• >3,000 servers and >1,000 Linux desktops
HP xw9300 workstations and HP DL145 G2 servers with 8 GB/server• Shrek 3: 20 million CPU render hours. Platform LSF used for
scheduling + Maya for modeling + Avid for editing+ Python for pipelining – movie uses 24 TB storage
![Page 22: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/22.jpg)
Single-queue work distributionRender Farms:–ILM:
• 3,000 processor (AMD) renderfarm; expands to 5,000 by harnessing desktop machines
• 20 Linux-based SpinServer NAS storage systems and 3,000 disks from Network Appliance
• 10 Gbps ethernet
–Sony Pictures’ Imageworks:• Over 1,200 processors• Dell and IBM workstations• almost 70 TB data for Polar Express
![Page 23: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/23.jpg)
Batch Processing
OpenPBS.org:– Portable Batch System– Developed by Veridian MRJ for NASA
• Commands– Submit job scripts
• Submit interactive jobs• Force a job to run
– List jobs– Delete jobs– Hold jobs
![Page 24: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/24.jpg)
Grid Computing• Characteristics
– Coordinate computing resources that are not subject to centralized control
– Not application-specific• use general-purpose protocols for accessing
resources, authenticating, discovery, …
– Some form of quality of service guarantee• Capabilities
– Computational resources: starting, monitoring
– Storage access– Network resources
![Page 25: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/25.jpg)
Open Grid Services Architecture (OGSA)
• Globus Toolkit• All grid resources are modeled as services
– Grid Resource Information Protocol (GRIP)• Register resources; based on LDAP
– Grid Resource Access and Management Protocol (GRAM)• Allocation & monitoring of resources; based
on HTTP
– GridFTP• Data access
![Page 26: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/26.jpg)
Globus services• Define service’s interface (GWSDL)
– Interface definition – extension of WSDL
• Implement service (Java)
• Define deployment parameters (WSDD)– Make service available through Grid-enhanced web
server
• Compile & Generate GAR file (Ant)– Java build tool (apache project) – similar to make
• Deploy service with Ant– Copy key files to public directory tree.
![Page 27: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/27.jpg)
Load Balancingfor the web
![Page 28: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/28.jpg)
Functions of a load balancer
Load balancing
Failover
Planned outage management
![Page 29: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/29.jpg)
Redirection
Simplest techniqueHTTP REDIRECT error code
![Page 30: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/30.jpg)
Redirection
Simplest techniqueHTTP REDIRECT error code
Public domain keyboard image from http://en.wikipedia.org/wiki/Image:Computer_keyboard.gifPublic domain Mac Pro image from http://en.wikipedia.org/wiki/Image:Macpro.png
www.mysite.com
![Page 31: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/31.jpg)
Redirection
Simplest techniqueHTTP REDIRECT error code
www.mysite.com
REDIRECTwww03.mysite.com
![Page 32: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/32.jpg)
Redirection
Simplest techniqueHTTP REDIRECT error code
www03.mysite.com
![Page 33: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/33.jpg)
Redirection• Trivial to implement• Successive requests automatically go to
the same web server– Important for sessions
• Visible to customer– Some don’t like it
• Bookmarks will usually tag a specific site
![Page 34: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/34.jpg)
Software load balancere.g.: IBM Interactive Network Dispatcher Software
Forwards request via load balancing– Leaves original source address– Load balancer not in path of outgoing
traffic (high bandwidth)– Kernel extensions for routing TCP and UDP
requests• Each client accepts connections on its own
address and dispatcher’s address• Dispatcher changes MAC address of packets.
![Page 35: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/35.jpg)
Software load balancer
www.mysite.com
![Page 36: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/36.jpg)
Software load balancer
www.mysite.com
src=bobby, dest=www03
![Page 37: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/37.jpg)
Software load balancer
www.mysite.com
response
src=bobby, dest=www03
![Page 38: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/38.jpg)
Load balancing router
Routers have been getting smarter– Most support packet filtering– Add load balancing
Cisco LocalDirector, Altheon, F5 Big-IP
![Page 39: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/39.jpg)
Load balancing router• Assign one or more virtual addresses to physical
address– Incoming request gets mapped to physical
address• Special assignments can be made per port
– e.g. all FTP traffic goes to one machineBalancing decisions:
– Pick machine with least # TCP connections– Factor in weights when selecting machines– Pick machines round-robin– Pick fastest connecting machine (SYN/ACK time)
![Page 40: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/40.jpg)
High Availability(HA)
![Page 41: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/41.jpg)
High availability (HA)
Class LevelAnnual Downtime
Continuous 100% 0
Six nines(carrier class switches)
99.9999% 30 seconds
Fault Tolerant(carrier-class servers)
99.999% 5 minutes
Fault Resilient 99.99% 53 minutes
High Availability
99.9% 8.3 hours
Normal availability
99-99.5% 44-87 hours
![Page 42: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/42.jpg)
Clustering: high availability
Fault tolerant designStratus, NEC, Marathon technologies
– Applications run uninterrupted on a redundant subsystem
• NEC and Stratus has applications running in lockstep synchronization
– Two identical connected systems– If one server fails, other takes over instantly
Costly and inefficient– But does what it was designed to do
![Page 43: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/43.jpg)
Clustering: high availability• Availability addressed by many:
– Sun, IBM, HP, Microsoft, SteelEye Lifekeeper, …
• If one server fails– Fault is isolated to that node– Workload spread over surviving nodes– Allows scheduled maintenance without
disruption– Nodes may need to take over IP addresses
![Page 44: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/44.jpg)
Example: Windows Server 2003 clustering
• Network load balancing– Address web-server bottlenecks
• Component load balancing– Scale middle-tier software (COM objects)
• Failover support for applications– 8-node failover clusters– Applications restarted on surviving node– Shared disk configuration using SCSI or fibre
channel– Resource group: {disk drive, IP address, network
name, service} can be moved during failover
![Page 45: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/45.jpg)
Example: Windows Server 2003 clustering
Top tier: cluster abstractions– Failover manager, resource monitor,
cluster registryMiddle tier: distributed operations
– Global status update, quorum (keeps track of who’s in charge), membership
Bottom tier: OS and drivers– Cluster disk driver, cluster network
drivers– IP address takeover
![Page 46: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/46.jpg)
Clusters
Architectural models
![Page 47: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/47.jpg)
HA issues
How do you detect failover?
How long does it take to detect?
How does a dead application move/restart?
Where does it move to?
![Page 48: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/48.jpg)
Heartbeat network• Machines need to detect faulty systems
– “ping” mechanism
• Need to distinguish system faults from network faults– Useful to maintain redundant networks– Send a periodic heartbeat to test a machine’s liveness– Watch out for split-brain!
• Ideally, use a network with a bounded response time– Lucent RCC used a serial line interconnect– Microsoft Cluster Server supports a dedicated “private
network”• Two network cards connected with a pass-through cable or hub
![Page 49: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/49.jpg)
Failover Configuration Models
Active/Passive (N+M nodes)– M dedicated failover node(s) for N active nodes
Active/Active– Failed workload goes to remaining nodes
![Page 50: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/50.jpg)
Design options for failoverCold failover
– Application restartWarm failover
– Application checkpoints itself periodically– Restart last checkpointed image– May use writeahead log (tricky)
Hot failover– Application state is lockstep synchronized– Very difficult, expensive (resources),
prone to software faults
![Page 51: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/51.jpg)
Design options for failover
With either type of failover …
Multi-directional failover– Failed applications migrate to / restart
on available systems
Cascading failover– If the backup system fails, application
can be restarted on another surviving system
![Page 52: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/52.jpg)
System support for HA• Hot-pluggable devices
– Minimize downtime for component swapping
• Redundant devices– Redundant power supplies– Parity on memory– Mirroring on disks (or RAID for HA)– Switchover of failed components
• Diagnostics– On-line serviceability
![Page 53: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/53.jpg)
Shared resources (disk)
Shared disk– Allows multiple systems to share access
to disk drives– Works well if applications do not
generate much disk I/O– Disk access must be synchronized
Synchronization via a distributed lock manager (DLM)
![Page 54: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/54.jpg)
Shared resources (disk)
Shared nothing– No shared devices– Each system has its own storage
resources– No need to deal with DLMs– If a machine A needs resources on B, A
sends a message to B• If B fails, storage requests have to be
switched over to a live node
![Page 55: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/55.jpg)
Cluster interconnectsTraditional WANs and LANs may be slow as cluster interconnect
– Connecting server nodes, storage nodes, I/O channels, even memory pages
– Storage Area Network (SAN)• Fibre channel connectivity to external storage devices• Any node can be configured to access any storage
through a fibre channel switch
– System Area Network (SAN)• Switched interconnect to switch cluster resources• Low-latency I/O without processor intervention• Scalable switching fabric• (Compaq, Tandem’s ServerNet)• Microsoft Windows 2000 supports Winsock Direct for SAN
communication
![Page 56: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/56.jpg)
Achieving High Availability
heartbeat 2
heartbeat 3
Server A Server B
Fibre channel switch
Fibre channel switch
Fabric A Fabric B
Storage Area Networks
Local Area Networks
switch Bswitch A heartbeat
![Page 57: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/57.jpg)
Achieving High Availability
heartbeat 2
heartbeat 3
Server A Server B
Ethernet switch A’
Ethernet switch B’
ethernet A
ethernet B
Local Area Networks(for iSCSI)
Local Area Networks
switch BSwitch A heartbeat
![Page 58: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/58.jpg)
HA Storage: RAID
Redundant Array of Independent (Inexpensive) Disks
![Page 59: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/59.jpg)
RAID 0: Performance
Striping• Advantages:
– Performance– All storage capacity can be used
• Disadvantage:– Not fault tolerant
Disk 0Disk 0
Block 4Block 4
Block 2Block 2
Block 0Block 0
Disk 1Disk 1
Block 5Block 5
Block 3Block 3
Block 1Block 1
![Page 60: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/60.jpg)
RAID 1: HA
Mirroring• Advantages:
– Double read speed– No rebuild necessary if a disk fails: just
copy• Disadvantage:
– Only half the space
Disk 0Disk 0
Block 2Block 2
Block 1Block 1
Block 0Block 0
Disk 1Disk 1
Block 2Block 2
Block 1Block 1
Block 0Block 0
![Page 61: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/61.jpg)
RAID 3 and RAID 4: HA
Separate parity disk• RAID 3 uses byte-level striping
RAID 4 uses block-level striping• Advantages:
– Very fast reads– High efficiency: low ratio of parity/data
• Disadvantages:– Slow random I/O
performance– Only one I/O
at a time forRAID 3 Disk 2Disk 2
Block 2cBlock 2c
Block 1cBlock 1c
Block 0cBlock 0c
Disk 3Disk 3
Parity 2Parity 2
Parity 1Parity 1
Parity 0Parity 0
Disk 1Disk 1
Block 2bBlock 2b
Block 1bBlock 1b
Block 0bBlock 0b
Disk 0Disk 0
Block 2aBlock 2a
Block 1aBlock 1a
Block 0aBlock 0a
![Page 62: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/62.jpg)
RAID 5: HA
Interleaved parity• Advantages:
– Very fast reads– High efficiency: low ratio of parity/data
• Disadvantage:– Slower writes– Complex
controller
Disk 2Disk 2
Block 2bBlock 2b
Parity 1Parity 1
Block 0cBlock 0c
Disk 3Disk 3
Block 2cBlock 2c
Block 1cBlock 1c
Parity 0Parity 0
Disk 1Disk 1
Parity 2Parity 2
Block 1bBlock 1b
Block 0bBlock 0b
Disk 0Disk 0
Block 2aBlock 2a
Block 1aBlock 1a
Block 0aBlock 0a
![Page 63: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/63.jpg)
RAID 1+0
Combine mirroring and striping– Striping across a set of disks– Mirroring of the entire set onto another set
![Page 64: Clusters Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Distributed Systems Except as otherwise noted, the content of this presentation is licensed under.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3be21/html5/thumbnails/64.jpg)
The end