Single System Abstractions for Clusters of Workstations

Single System Abstractionsfor

Clusters of Workstations

Bienvenido Vélez

What is a cluster?

Transparency is a goal

A collection of loosely connected self-contained computers cooperating to provide the abstraction of a single one

Fine grain parallelism

Coarse grain concurrency

Fast interconnects

Massively parallel processor

Multi-programmed system

Independent Nodes

Characterized bySystem Abstraction

Possible System Abstractions

Question

• Transparency• Availability• Scalability

Compare three approaches to provide abstraction of a single system for clusters of workstations using the following criteria:

Contributions

• Improvements to the Microsoft Cluster Service+ better availability and scalability

• Adaptive Replication+ Automatically adapting

replication levels to maintain availability as cluster grows

Outline

• Comparison of approaches+ Transparent remote execution (GLUnix)

+ Preemptive load balancing (MOSIX)

+ Highly available servers (Microsoft Cluster Service)

• Contributions+ Improvements to the MS Cluster Service

+ Adaptive Replication

• Conclusions

“glurun make”

masterdaemon

master nodeExe

cute

(mak

e, en

v)

remote node(selected by master)

exec make

node daemon

fork

home node

Startup(glurun)

user

stdin

stdout, stderr

signa

ls

GLUnixTransparent Remote Execution

• Dynamic load balancing

GLUnixVirtues and Limitations

• Transparency+ home node transparency limited by user-level

implementation

+ interactive jobs supported

– special commands for running cluster jobs

• Availability+ detects and masks node failures

+ master process is single point of failure

• Scalability+ master process performance bottleneck

node process

1

2 4

3

5

MOSIXPreemptive Load Balancing

• probabilistic diffusion of load information

• redirects system calls to home node

MOSIXPreemptive Load Balancing

Exchange local loadwith random node

Consider migratinga process to a node

with minimal cost

delay

• Keeps load information from fixed number nodes

• load = average size of ready queue

• cost = f(cpu time) + f(communication) + f(migration time)

MOSIXVirtues and Limitations

• Transparency+ limited home node transparency

• Availability+ masks node failures

– no process restart

– preemptive load balancing limits portability and performance

• Scalability– flooding and swinging possible

– low communication overhead

SQL

status

Microsoft Cluster Service (MSCS)Highly available server processes

clients

•replicated consistent node/server status database

• migrates servers from failed nodes

Web

clients

status

MSCS MSCS

Microsoft Cluster Service Hardware Configuration

SCSI

ethernet

quorum HTML RDB

single points of failure

status

statusstatus

bottleneckWeb SQL

MSCSVirtues and Limitations

Transparency+ server migration transparent to clients

Availability+ servers migrated from failed nodes

– shared disk are single points of failure

Scalability– manual static configuration

– manual static load balancing

– shared disk bus is performance bottleneck

Summary of Approaches

GLUnix

MOSIX

MSCS

System

home nodelimited

home nodetransparent

clients

Transparency

single point of failuremasks failures

no fail-over

masks failuresno fail-over

server fail-oversingle point of failure

Availability

load balancingbottleneck

load balancing

bottleneck

Scalability

Transaction-basedReplication

replication

{ write[x1], …, write[xn] }

write[x]

transactions

node 1 node 2 node n

operates on

object

operates on

copies

Re-designing MSCS

• Idea: New core resource group fixed on every node+ special disk resource

+ distributed transaction processing resource

+ transactional replicated file storage resource

• Implement consensus with transactions (El-Abbadi-Toueg algorithm)+ changes to configuration DB

+ cluster membership service

• Improvements+ eliminates complex global update and regroup

protocols

+ switchover not required for application data

+ provides new generally useful service• Transactional replicated object storage

resource DLLresource DLL

ReplicatedStorage Svc

TransactionService

Cluster Service

Resource

Monitor

RPC

RPC

resourcemanager

nodemanager

Node

network

Re-designed MSCSwith transactional replicated

object storage

Adaptive ReplicationProblem

What should a replication servicedo when nodes are added to the cluster?

replication vs. migration

Goal: Maintain availability

• Must alternate migration with replication

• Replication (R) should happen significantly less often that migration (M)

Hypothesis

Replication increases number of copies of objects

Xy

Xy

Xy

2 nodesadded

2 nodes

4 nodes

Xy

Xy

Xy

Migration re-distributes objects across all nodes

Xy

X y

2 nodesadded

2 nodes

4 nodes

Xy

x y

Simplifying Assumptions

• System keeps same number of copies k of each object

• System has n nodes

• Initially n = k

• n increases k nodes at a time

• ignore partitions in computing availability

ConjectureHighest availability can be obtained if objects partitioned in q = n / k groups

living disjoint sets of nodes.

X’ X’X’

Example: k = 3, n = 6, q = 2

X” X”X”q

k

Lets call this optimal migration

Adaptive Replication Necessary

Let each node have availability p

The availability of the system is:

A(k,n) = 1 - q * pk

Since optimal migration always increases q, migration decreases availability (albeit slowly)

Adaptive replication may be necessary to maintain availability

Adaptive ReplicationFurther Work

• determine when it matters in real situations

• relax assumptions

• formalize arguments

“Home Node” Single System Image

LCM layers supported Mechanisms Used

NET, CGP, FGPactive Messages,

trasparent remote execution,message passing API

Berkeley NOW

NET, CGP preemptive load balancingkernel-to-kernel RPCMOSIX

CGPnode regroup,

resource failoverswitchover

MSCS

System

NET, FGP user level protocol stack with semaphoresParaStation

Talk focuses on Coarse Grain Layer

GLUnixCharacteristics

• Provides special user commands for managing cluster jobs

• Both batch and interactive jobs can be executed remotely

• Supports dynamic load balancing

load balance

Select candidate process p with maximal

impact on local load

p can migrate?

signal p to consider migration

return

less loaded node

exists?

no

Select target node Nthat minimizes

cost C[N] of running pthere

consider

return

migrate to NOK?

yes

no

yes

no

yes

MOSIX: preemptive load balancing

1

123

2 3

log segment(dirty data blocks)

stripe group

data stripes

parity stripe

client

data block

writes are always

sequential

xFS distribued log-based file system

xFSVirtues and Limitations

+ Exploits aggregate bandwidth of all disks

+ No need to buy expensive RAID’s

+ No single point of failure

– Reliability: Relies of accumulating dirty blocks to generate large sequential writes

– Adaptive replication potentially more difficult

Microsoft Cluster Service (MSCS) GOAL

Off-the-shelf Server Application

Cluster-aware Server Application

Wrapper

Highly Available

MSCSAbstractions

• Node

• Resource+ e.g. disks, IP addresses, server

• Resource dependency+ e.g. DBMS depends on disk

holding its data

• Resource group+ e.g. server and its IP number

• Quorum resource+ logs configuration data+ breaks ties during membership

changes

MSCSGeneral Characteristics

• Global state of all nodes and resources consistently replicated across all nodes (write all using atomic multicast protocol)

• Node and resource failures detected

• Resources of failed nodes migrated to surviving nodes

• Failed resources restarted

resource DLLresource DLL

resourceresource

Cluster Service

Resource

Monitor

RPC

RPC

resourcemanager

nodemanager

Node

network

MSCS System Architecture

regroup

Activate

Closing

Pruning

Cleanup 1

Cleanup 2

end

• determine nodes in its connected component

• determine if its component is the primary• elect new tie-breaker• if node new tie breaker then broadcast component as new membership

• if not in the new membership halt

• install new membership from new tie breaker• acknowledge “ready to commit”

• if own quorum disk, log membership change

MSCS virtually synchronous regroup operation

MSCSPrimary Component Determination Rule

• node connected to a majority of previous membership

• node connected to half (>=2) of the previous members and one of those is a tie-breaker

• node isolated and previous membership had two nodes and node owned quorum resource during previous membership

A node is in the primary component if one of the following holds

SCSI

nodefailure

SCSI

MSCS switchover

Alternative: Replication

Every disk asingle pointof failure!

Summary of Approaches

Berkeley NOW

MOSIX

MSCS

System

home nodelimited

home nodetransparent

server

Transparency

single point of failureno fail-over

masks failuresno fail-over

tolerates partitions

single point of failurelow MTTR

tolerates partitions

Availability

load balancingbottleneck

load balancinglow msg overhead

bottleneck

Performance

LCM layers supported Mechanisms Used

NET, CGP, FGPactive Messages

transparent remote executionMessage passing API

Berkeley NOW

NET, CGP preemptive load balancingkernel-to-kernel RPCMOSIX

CGP cluster membership servicesresource fail-overMSCS

System

NET, FGP user level protocol stacknetwork interface hardwareParaStation

Comparing ApproachesDesign Goals

centralizedprocesses run to completiononce assigned to processorBerkeley NOW

distributed : probabilisticprocesses brought offline at

source and online at destination

MOSIX

replicated : consistentprocess migrated at any point

during executionMSCS

Approach DescriptionSystem

Comparing ApproachesGlobal Information Management

detected by master daemontimeouts

failed nodes removed from central configuration DBBerkeley NOW

detected by individual nodestimeouts

failed nodes removed fromreplicated configuration DB

resources restarted/migrated

MOSIX

detected by individual nodesheartbeats

failed nodes removed fromlocal configuration DB

MSCS

Failure detection Recovery actionSystem

master process process pairsBerkeley NOW

none N.A.MOSIX

quorum resourceshared disks

virtual partitions replication algorithm

MSCS

Single Points of Failure Possible solutionSystem

Comparing ApproachesFault-tolerance

manual sys admin manually assignsprocesses to nodesMSCS

static processes statically assignedto processors

dynamicuses dynamic load

information to assignprocesses to processors

Berkeley NOW

Approach DescriptionSystem

preemptive migrates processes in themiddle of their executionMOSIX

Comparing Approaches Load Balancing

none processes run to completiononce assigned to processorBerkeley NOW

cooperativeshutdown/restart

processes brought offline at source and online at

destinationMSCS

transparent process migrated at any pointduring executionMOSIX

Process Migration Approach DescriptionSystem

Comparing ApproachesProcess Migration

Example: k = 3, n = 3

X xx

Each letter (e.g. x above) represents a group of objects

with copies in the same subset of nodes

error-correctingcodes

redundancy

replication

fail-over/failback

switch-over

primary copy voting(quorum consensus)

RAID xFS

MSCS

voting w/ views(virtual partitions)

HARP

Single System Abstractions for Clusters of Workstations

Documents

Transcript of Single System Abstractions for Clusters of Workstations