Single System Abstractions for Clusters of Workstations

45
Single System Abstractions for Clusters of Workstations Bienvenido Vélez

description

Single System Abstractions for Clusters of Workstations. Bienvenido Vélez. What is a cluster?. A collection of loosely connected self-contained computers cooperating to provide the abstraction of a single one. Possible System Abstractions. System Abstraction. Characterized by. - PowerPoint PPT Presentation

Transcript of Single System Abstractions for Clusters of Workstations

Page 1: Single System Abstractions for  Clusters of Workstations

Single System Abstractionsfor

Clusters of Workstations

Bienvenido Vélez

Page 2: Single System Abstractions for  Clusters of Workstations

What is a cluster?

Transparency is a goal

A collection of loosely connected self-contained computers cooperating to provide the abstraction of a single one

Fine grain parallelism

Coarse grain concurrency

Fast interconnects

Massively parallel processor

Multi-programmed system

Independent Nodes

Characterized bySystem Abstraction

Possible System Abstractions

Page 3: Single System Abstractions for  Clusters of Workstations

Question

• Transparency• Availability• Scalability

Compare three approaches to provide abstraction of a single system for clusters of workstations using the following criteria:

Page 4: Single System Abstractions for  Clusters of Workstations

Contributions

• Improvements to the Microsoft Cluster Service+ better availability and scalability

• Adaptive Replication+ Automatically adapting

replication levels to maintain availability as cluster grows

Page 5: Single System Abstractions for  Clusters of Workstations

Outline

• Comparison of approaches+ Transparent remote execution (GLUnix)

+ Preemptive load balancing (MOSIX)

+ Highly available servers (Microsoft Cluster Service)

• Contributions+ Improvements to the MS Cluster Service

+ Adaptive Replication

• Conclusions

Page 6: Single System Abstractions for  Clusters of Workstations

“glurun make”

masterdaemon

master nodeExe

cute

(mak

e, en

v)

remote node(selected by master)

exec make

node daemon

fork

home node

Startup(glurun)

user

stdin

stdout, stderr

signa

ls

GLUnixTransparent Remote Execution

• Dynamic load balancing

Page 7: Single System Abstractions for  Clusters of Workstations

GLUnixVirtues and Limitations

• Transparency+ home node transparency limited by user-level

implementation

+ interactive jobs supported

– special commands for running cluster jobs

• Availability+ detects and masks node failures

+ master process is single point of failure

• Scalability+ master process performance bottleneck

Page 8: Single System Abstractions for  Clusters of Workstations

node process

1

2 4

3

5

MOSIXPreemptive Load Balancing

• probabilistic diffusion of load information

• redirects system calls to home node

Page 9: Single System Abstractions for  Clusters of Workstations

MOSIXPreemptive Load Balancing

Exchange local loadwith random node

Consider migratinga process to a node

with minimal cost

delay

• Keeps load information from fixed number nodes

• load = average size of ready queue

• cost = f(cpu time) + f(communication) + f(migration time)

Page 10: Single System Abstractions for  Clusters of Workstations

MOSIXVirtues and Limitations

• Transparency+ limited home node transparency

• Availability+ masks node failures

– no process restart

– preemptive load balancing limits portability and performance

• Scalability– flooding and swinging possible

– low communication overhead

Page 11: Single System Abstractions for  Clusters of Workstations

SQL

status

Microsoft Cluster Service (MSCS)Highly available server processes

clients

•replicated consistent node/server status database

• migrates servers from failed nodes

Web

clients

status

MSCS MSCS

Page 12: Single System Abstractions for  Clusters of Workstations

Microsoft Cluster Service Hardware Configuration

SCSI

ethernet

quorum HTML RDB

single points of failure

status

statusstatus

bottleneckWeb SQL

Page 13: Single System Abstractions for  Clusters of Workstations

MSCSVirtues and Limitations

Transparency+ server migration transparent to clients

Availability+ servers migrated from failed nodes

– shared disk are single points of failure

Scalability– manual static configuration

– manual static load balancing

– shared disk bus is performance bottleneck

Page 14: Single System Abstractions for  Clusters of Workstations

Summary of Approaches

GLUnix

MOSIX

MSCS

System

home nodelimited

home nodetransparent

clients

Transparency

single point of failuremasks failures

no fail-over

masks failuresno fail-over

server fail-oversingle point of failure

Availability

load balancingbottleneck

load balancing

bottleneck

Scalability

Page 15: Single System Abstractions for  Clusters of Workstations

Transaction-basedReplication

replication

{ write[x1], …, write[xn] }

write[x]

transactions

node 1 node 2 node n

operates on

object

operates on

copies

Page 16: Single System Abstractions for  Clusters of Workstations

Re-designing MSCS

• Idea: New core resource group fixed on every node+ special disk resource

+ distributed transaction processing resource

+ transactional replicated file storage resource

• Implement consensus with transactions (El-Abbadi-Toueg algorithm)+ changes to configuration DB

+ cluster membership service

• Improvements+ eliminates complex global update and regroup

protocols

+ switchover not required for application data

+ provides new generally useful service• Transactional replicated object storage

Page 17: Single System Abstractions for  Clusters of Workstations

resource DLLresource DLL

ReplicatedStorage Svc

TransactionService

Cluster Service

Resource

Monitor

RPC

RPC

resourcemanager

nodemanager

Node

network

Re-designed MSCSwith transactional replicated

object storage

Page 18: Single System Abstractions for  Clusters of Workstations

Adaptive ReplicationProblem

What should a replication servicedo when nodes are added to the cluster?

replication vs. migration

Goal: Maintain availability

• Must alternate migration with replication

• Replication (R) should happen significantly less often that migration (M)

Hypothesis

Page 19: Single System Abstractions for  Clusters of Workstations

Replication increases number of copies of objects

Xy

Xy

Xy

2 nodesadded

2 nodes

4 nodes

Xy

Xy

Xy

Page 20: Single System Abstractions for  Clusters of Workstations

Migration re-distributes objects across all nodes

Xy

X y

2 nodesadded

2 nodes

4 nodes

Xy

x y

Page 21: Single System Abstractions for  Clusters of Workstations

Simplifying Assumptions

• System keeps same number of copies k of each object

• System has n nodes

• Initially n = k

• n increases k nodes at a time

• ignore partitions in computing availability

Page 22: Single System Abstractions for  Clusters of Workstations

ConjectureHighest availability can be obtained if objects partitioned in q = n / k groups

living disjoint sets of nodes.

X’ X’X’

Example: k = 3, n = 6, q = 2

X” X”X”q

k

Lets call this optimal migration

Page 23: Single System Abstractions for  Clusters of Workstations

Adaptive Replication Necessary

Let each node have availability p

The availability of the system is:

A(k,n) = 1 - q * pk

Since optimal migration always increases q, migration decreases availability (albeit slowly)

Adaptive replication may be necessary to maintain availability

Page 24: Single System Abstractions for  Clusters of Workstations

Adaptive ReplicationFurther Work

• determine when it matters in real situations

• relax assumptions

• formalize arguments

Page 25: Single System Abstractions for  Clusters of Workstations

“Home Node” Single System Image

Page 26: Single System Abstractions for  Clusters of Workstations

LCM layers supported Mechanisms Used

NET, CGP, FGPactive Messages,

trasparent remote execution,message passing API

Berkeley NOW

NET, CGP preemptive load balancingkernel-to-kernel RPCMOSIX

CGPnode regroup,

resource failoverswitchover

MSCS

System

NET, FGP user level protocol stack with semaphoresParaStation

Talk focuses on Coarse Grain Layer

Page 27: Single System Abstractions for  Clusters of Workstations

GLUnixCharacteristics

• Provides special user commands for managing cluster jobs

• Both batch and interactive jobs can be executed remotely

• Supports dynamic load balancing

Page 28: Single System Abstractions for  Clusters of Workstations

load balance

Select candidate process p with maximal

impact on local load

p can migrate?

signal p to consider migration

return

less loaded node

exists?

no

Select target node Nthat minimizes

cost C[N] of running pthere

consider

return

migrate to NOK?

yes

no

yes

no

yes

MOSIX: preemptive load balancing

Page 29: Single System Abstractions for  Clusters of Workstations

1

123

2 3

log segment(dirty data blocks)

stripe group

data stripes

parity stripe

client

data block

writes are always

sequential

xFS distribued log-based file system

Page 30: Single System Abstractions for  Clusters of Workstations

xFSVirtues and Limitations

+ Exploits aggregate bandwidth of all disks

+ No need to buy expensive RAID’s

+ No single point of failure

– Reliability: Relies of accumulating dirty blocks to generate large sequential writes

– Adaptive replication potentially more difficult

Page 31: Single System Abstractions for  Clusters of Workstations

Microsoft Cluster Service (MSCS) GOAL

Off-the-shelf Server Application

Cluster-aware Server Application

Wrapper

Highly Available

Page 32: Single System Abstractions for  Clusters of Workstations

MSCSAbstractions

• Node

• Resource+ e.g. disks, IP addresses, server

• Resource dependency+ e.g. DBMS depends on disk

holding its data

• Resource group+ e.g. server and its IP number

• Quorum resource+ logs configuration data+ breaks ties during membership

changes

Page 33: Single System Abstractions for  Clusters of Workstations

MSCSGeneral Characteristics

• Global state of all nodes and resources consistently replicated across all nodes (write all using atomic multicast protocol)

• Node and resource failures detected

• Resources of failed nodes migrated to surviving nodes

• Failed resources restarted

Page 34: Single System Abstractions for  Clusters of Workstations

resource DLLresource DLL

resourceresource

Cluster Service

Resource

Monitor

RPC

RPC

resourcemanager

nodemanager

Node

network

MSCS System Architecture

Page 35: Single System Abstractions for  Clusters of Workstations

regroup

Activate

Closing

Pruning

Cleanup 1

Cleanup 2

end

• determine nodes in its connected component

• determine if its component is the primary• elect new tie-breaker• if node new tie breaker then broadcast component as new membership

• if not in the new membership halt

• install new membership from new tie breaker• acknowledge “ready to commit”

• if own quorum disk, log membership change

MSCS virtually synchronous regroup operation

Page 36: Single System Abstractions for  Clusters of Workstations

MSCSPrimary Component Determination Rule

• node connected to a majority of previous membership

• node connected to half (>=2) of the previous members and one of those is a tie-breaker

• node isolated and previous membership had two nodes and node owned quorum resource during previous membership

A node is in the primary component if one of the following holds

Page 37: Single System Abstractions for  Clusters of Workstations

SCSI

nodefailure

SCSI

MSCS switchover

Alternative: Replication

Every disk asingle pointof failure!

Page 38: Single System Abstractions for  Clusters of Workstations

Summary of Approaches

Berkeley NOW

MOSIX

MSCS

System

home nodelimited

home nodetransparent

server

Transparency

single point of failureno fail-over

masks failuresno fail-over

tolerates partitions

single point of failurelow MTTR

tolerates partitions

Availability

load balancingbottleneck

load balancinglow msg overhead

bottleneck

Performance

Page 39: Single System Abstractions for  Clusters of Workstations

LCM layers supported Mechanisms Used

NET, CGP, FGPactive Messages

transparent remote executionMessage passing API

Berkeley NOW

NET, CGP preemptive load balancingkernel-to-kernel RPCMOSIX

CGP cluster membership servicesresource fail-overMSCS

System

NET, FGP user level protocol stacknetwork interface hardwareParaStation

Comparing ApproachesDesign Goals

Page 40: Single System Abstractions for  Clusters of Workstations

centralizedprocesses run to completiononce assigned to processorBerkeley NOW

distributed : probabilisticprocesses brought offline at

source and online at destination

MOSIX

replicated : consistentprocess migrated at any point

during executionMSCS

Approach DescriptionSystem

Comparing ApproachesGlobal Information Management

Page 41: Single System Abstractions for  Clusters of Workstations

detected by master daemontimeouts

failed nodes removed from central configuration DBBerkeley NOW

detected by individual nodestimeouts

failed nodes removed fromreplicated configuration DB

resources restarted/migrated

MOSIX

detected by individual nodesheartbeats

failed nodes removed fromlocal configuration DB

MSCS

Failure detection Recovery actionSystem

master process process pairsBerkeley NOW

none N.A.MOSIX

quorum resourceshared disks

virtual partitions replication algorithm

MSCS

Single Points of Failure Possible solutionSystem

Comparing ApproachesFault-tolerance

Page 42: Single System Abstractions for  Clusters of Workstations

manual sys admin manually assignsprocesses to nodesMSCS

static processes statically assignedto processors

dynamicuses dynamic load

information to assignprocesses to processors

Berkeley NOW

Approach DescriptionSystem

preemptive migrates processes in themiddle of their executionMOSIX

Comparing Approaches Load Balancing

Page 43: Single System Abstractions for  Clusters of Workstations

none processes run to completiononce assigned to processorBerkeley NOW

cooperativeshutdown/restart

processes brought offline at source and online at

destinationMSCS

transparent process migrated at any pointduring executionMOSIX

Process Migration Approach DescriptionSystem

Comparing ApproachesProcess Migration

Page 44: Single System Abstractions for  Clusters of Workstations

Example: k = 3, n = 3

X xx

Each letter (e.g. x above) represents a group of objects

with copies in the same subset of nodes

Page 45: Single System Abstractions for  Clusters of Workstations

error-correctingcodes

redundancy

replication

fail-over/failback

switch-over

primary copy voting(quorum consensus)

RAID xFS

MSCS

voting w/ views(virtual partitions)

HARP