Single System Abstractions for Clusters of Workstations
description
Transcript of Single System Abstractions for Clusters of Workstations
Single System Abstractionsfor
Clusters of Workstations
Bienvenido Vélez
What is a cluster?
Transparency is a goal
A collection of loosely connected self-contained computers cooperating to provide the abstraction of a single one
Fine grain parallelism
Coarse grain concurrency
Fast interconnects
Massively parallel processor
Multi-programmed system
Independent Nodes
Characterized bySystem Abstraction
Possible System Abstractions
Question
• Transparency• Availability• Scalability
Compare three approaches to provide abstraction of a single system for clusters of workstations using the following criteria:
Contributions
• Improvements to the Microsoft Cluster Service+ better availability and scalability
• Adaptive Replication+ Automatically adapting
replication levels to maintain availability as cluster grows
Outline
• Comparison of approaches+ Transparent remote execution (GLUnix)
+ Preemptive load balancing (MOSIX)
+ Highly available servers (Microsoft Cluster Service)
• Contributions+ Improvements to the MS Cluster Service
+ Adaptive Replication
• Conclusions
“glurun make”
masterdaemon
master nodeExe
cute
(mak
e, en
v)
remote node(selected by master)
exec make
node daemon
fork
home node
Startup(glurun)
user
stdin
stdout, stderr
signa
ls
GLUnixTransparent Remote Execution
• Dynamic load balancing
GLUnixVirtues and Limitations
• Transparency+ home node transparency limited by user-level
implementation
+ interactive jobs supported
– special commands for running cluster jobs
• Availability+ detects and masks node failures
+ master process is single point of failure
• Scalability+ master process performance bottleneck
node process
1
2 4
3
5
MOSIXPreemptive Load Balancing
• probabilistic diffusion of load information
• redirects system calls to home node
MOSIXPreemptive Load Balancing
Exchange local loadwith random node
Consider migratinga process to a node
with minimal cost
delay
• Keeps load information from fixed number nodes
• load = average size of ready queue
• cost = f(cpu time) + f(communication) + f(migration time)
MOSIXVirtues and Limitations
• Transparency+ limited home node transparency
• Availability+ masks node failures
– no process restart
– preemptive load balancing limits portability and performance
• Scalability– flooding and swinging possible
– low communication overhead
SQL
status
Microsoft Cluster Service (MSCS)Highly available server processes
clients
•replicated consistent node/server status database
• migrates servers from failed nodes
Web
clients
status
MSCS MSCS
Microsoft Cluster Service Hardware Configuration
SCSI
ethernet
quorum HTML RDB
single points of failure
status
statusstatus
bottleneckWeb SQL
MSCSVirtues and Limitations
Transparency+ server migration transparent to clients
Availability+ servers migrated from failed nodes
– shared disk are single points of failure
Scalability– manual static configuration
– manual static load balancing
– shared disk bus is performance bottleneck
Summary of Approaches
GLUnix
MOSIX
MSCS
System
home nodelimited
home nodetransparent
clients
Transparency
single point of failuremasks failures
no fail-over
masks failuresno fail-over
server fail-oversingle point of failure
Availability
load balancingbottleneck
load balancing
bottleneck
Scalability
Transaction-basedReplication
replication
{ write[x1], …, write[xn] }
write[x]
transactions
node 1 node 2 node n
operates on
object
operates on
copies
Re-designing MSCS
• Idea: New core resource group fixed on every node+ special disk resource
+ distributed transaction processing resource
+ transactional replicated file storage resource
• Implement consensus with transactions (El-Abbadi-Toueg algorithm)+ changes to configuration DB
+ cluster membership service
• Improvements+ eliminates complex global update and regroup
protocols
+ switchover not required for application data
+ provides new generally useful service• Transactional replicated object storage
resource DLLresource DLL
ReplicatedStorage Svc
TransactionService
Cluster Service
Resource
Monitor
RPC
RPC
resourcemanager
nodemanager
Node
network
Re-designed MSCSwith transactional replicated
object storage
Adaptive ReplicationProblem
What should a replication servicedo when nodes are added to the cluster?
replication vs. migration
Goal: Maintain availability
• Must alternate migration with replication
• Replication (R) should happen significantly less often that migration (M)
Hypothesis
Replication increases number of copies of objects
Xy
Xy
Xy
2 nodesadded
2 nodes
4 nodes
Xy
Xy
Xy
Migration re-distributes objects across all nodes
Xy
X y
2 nodesadded
2 nodes
4 nodes
Xy
x y
Simplifying Assumptions
• System keeps same number of copies k of each object
• System has n nodes
• Initially n = k
• n increases k nodes at a time
• ignore partitions in computing availability
ConjectureHighest availability can be obtained if objects partitioned in q = n / k groups
living disjoint sets of nodes.
X’ X’X’
Example: k = 3, n = 6, q = 2
X” X”X”q
k
Lets call this optimal migration
Adaptive Replication Necessary
Let each node have availability p
The availability of the system is:
A(k,n) = 1 - q * pk
Since optimal migration always increases q, migration decreases availability (albeit slowly)
Adaptive replication may be necessary to maintain availability
Adaptive ReplicationFurther Work
• determine when it matters in real situations
• relax assumptions
• formalize arguments
“Home Node” Single System Image
LCM layers supported Mechanisms Used
NET, CGP, FGPactive Messages,
trasparent remote execution,message passing API
Berkeley NOW
NET, CGP preemptive load balancingkernel-to-kernel RPCMOSIX
CGPnode regroup,
resource failoverswitchover
MSCS
System
NET, FGP user level protocol stack with semaphoresParaStation
Talk focuses on Coarse Grain Layer
GLUnixCharacteristics
• Provides special user commands for managing cluster jobs
• Both batch and interactive jobs can be executed remotely
• Supports dynamic load balancing
load balance
Select candidate process p with maximal
impact on local load
p can migrate?
signal p to consider migration
return
less loaded node
exists?
no
Select target node Nthat minimizes
cost C[N] of running pthere
consider
return
migrate to NOK?
yes
no
yes
no
yes
MOSIX: preemptive load balancing
1
123
2 3
log segment(dirty data blocks)
stripe group
data stripes
parity stripe
client
data block
writes are always
sequential
xFS distribued log-based file system
xFSVirtues and Limitations
+ Exploits aggregate bandwidth of all disks
+ No need to buy expensive RAID’s
+ No single point of failure
– Reliability: Relies of accumulating dirty blocks to generate large sequential writes
– Adaptive replication potentially more difficult
Microsoft Cluster Service (MSCS) GOAL
Off-the-shelf Server Application
Cluster-aware Server Application
Wrapper
Highly Available
MSCSAbstractions
• Node
• Resource+ e.g. disks, IP addresses, server
• Resource dependency+ e.g. DBMS depends on disk
holding its data
• Resource group+ e.g. server and its IP number
• Quorum resource+ logs configuration data+ breaks ties during membership
changes
MSCSGeneral Characteristics
• Global state of all nodes and resources consistently replicated across all nodes (write all using atomic multicast protocol)
• Node and resource failures detected
• Resources of failed nodes migrated to surviving nodes
• Failed resources restarted
resource DLLresource DLL
resourceresource
Cluster Service
Resource
Monitor
RPC
RPC
resourcemanager
nodemanager
Node
network
MSCS System Architecture
regroup
Activate
Closing
Pruning
Cleanup 1
Cleanup 2
end
• determine nodes in its connected component
• determine if its component is the primary• elect new tie-breaker• if node new tie breaker then broadcast component as new membership
• if not in the new membership halt
• install new membership from new tie breaker• acknowledge “ready to commit”
• if own quorum disk, log membership change
MSCS virtually synchronous regroup operation
MSCSPrimary Component Determination Rule
• node connected to a majority of previous membership
• node connected to half (>=2) of the previous members and one of those is a tie-breaker
• node isolated and previous membership had two nodes and node owned quorum resource during previous membership
A node is in the primary component if one of the following holds
SCSI
nodefailure
SCSI
MSCS switchover
Alternative: Replication
Every disk asingle pointof failure!
Summary of Approaches
Berkeley NOW
MOSIX
MSCS
System
home nodelimited
home nodetransparent
server
Transparency
single point of failureno fail-over
masks failuresno fail-over
tolerates partitions
single point of failurelow MTTR
tolerates partitions
Availability
load balancingbottleneck
load balancinglow msg overhead
bottleneck
Performance
LCM layers supported Mechanisms Used
NET, CGP, FGPactive Messages
transparent remote executionMessage passing API
Berkeley NOW
NET, CGP preemptive load balancingkernel-to-kernel RPCMOSIX
CGP cluster membership servicesresource fail-overMSCS
System
NET, FGP user level protocol stacknetwork interface hardwareParaStation
Comparing ApproachesDesign Goals
centralizedprocesses run to completiononce assigned to processorBerkeley NOW
distributed : probabilisticprocesses brought offline at
source and online at destination
MOSIX
replicated : consistentprocess migrated at any point
during executionMSCS
Approach DescriptionSystem
Comparing ApproachesGlobal Information Management
detected by master daemontimeouts
failed nodes removed from central configuration DBBerkeley NOW
detected by individual nodestimeouts
failed nodes removed fromreplicated configuration DB
resources restarted/migrated
MOSIX
detected by individual nodesheartbeats
failed nodes removed fromlocal configuration DB
MSCS
Failure detection Recovery actionSystem
master process process pairsBerkeley NOW
none N.A.MOSIX
quorum resourceshared disks
virtual partitions replication algorithm
MSCS
Single Points of Failure Possible solutionSystem
Comparing ApproachesFault-tolerance
manual sys admin manually assignsprocesses to nodesMSCS
static processes statically assignedto processors
dynamicuses dynamic load
information to assignprocesses to processors
Berkeley NOW
Approach DescriptionSystem
preemptive migrates processes in themiddle of their executionMOSIX
Comparing Approaches Load Balancing
none processes run to completiononce assigned to processorBerkeley NOW
cooperativeshutdown/restart
processes brought offline at source and online at
destinationMSCS
transparent process migrated at any pointduring executionMOSIX
Process Migration Approach DescriptionSystem
Comparing ApproachesProcess Migration
Example: k = 3, n = 3
X xx
Each letter (e.g. x above) represents a group of objects
with copies in the same subset of nodes
error-correctingcodes
redundancy
replication
fail-over/failback
switch-over
primary copy voting(quorum consensus)
RAID xFS
MSCS
voting w/ views(virtual partitions)
HARP