CSE 513 Introduction to Operating Systems Class 9...

92
1 CSE 513 Introduction to Operating Systems Class 9 - Distributed and Multiprocessor Operating Systems Jonathan Walpole Dept. of Comp. Sci. and Eng. Oregon Health and Science University

Transcript of CSE 513 Introduction to Operating Systems Class 9...

1

CSE 513Introduction to Operating Systems

Class 9 - Distributed andMultiprocessor Operating Systems

Jonathan WalpoleDept. of Comp. Sci. and Eng.

Oregon Health and Science University

2

Why use parallel or distributed systems?

q Speed - reduce time to answerq Scale - increase size of problemq Reliability - increase resilience to errorsq Communication - span geographical distance

3

Overview

q Multiprocessor systemsq Multi-computer systemsq Distributed systems

Multiprocessor, multi-computer anddistributed architectures

v shared memory multiprocessorv message passing multi-computer (cluster)v wide area distributed system

Multiprocessor Systems

6

Multiprocessor systems

q Definition:v A computer system in which two or more CPUs

share full access to a common RAMq Hardware implements shared memory among

CPUsq Architecture determines whether access times

to different memory regions are the samev UMA - uniform memory accessv NUMA - non-uniform memory access

7

Bus-based UMA and NUMA architectures

Bus becomes the bottleneck as number of CPUs increases

8

Crossbar switch-based UMA architecture

Interconnect cost increases as square of number of CPUs

9

Multiprocessors with 2x2 switches

10

Omega switching network from 2x2 switches

Interconnect suffers contention, but costs less

11

NUMA multiprocessors

q Single address space visible to all CPUsq Access to remote memory via commands

- LOAD- STORE

q Access to remote memory slower than to localmemory

q Compilers and OS need to be careful aboutdata placement

12

Directory-based NUMA multiprocessors

(a) 256-node directory based multiprocessor(b) Fields of 32-bit memory address(c) Directory at node 36

13

Operating systems for multiprocessors

q OS structuring approachesv Private OS per CPUv Master-slave architecturev Symmetric multiprocessing architecture

q New problemsv multiprocessor synchronizationv multiprocessor scheduling

14

The private OS approach

q Implications of private OS approachv shared I/O devicesv static memory allocationv no data sharingv no parallel applications

15

The master-slave approach

q OS only runs on master CPUv Single kernel lock protects OS data structuresv Slaves trap system calls and place process on scheduling

queue for masterq Parallel applications supported

v Memory shared among all CPUsq Single CPU for all OS calls becomes a bottleneck

16

Symmetric multiprocessing (SMP)

q OS runs on all CPUsv Multiple CPUs can be executing the OS simultaneouslyv Access to OS data structures requires synchronizationv Fine grain critical sections lead to more locks and more

parallelism … and more potential for deadlock

17

Multiprocessor synchronization

q Why is it different compared to singleprocessor synchronization?v Disabling interrupts does not prevent memory

accesses since it only affects “this” CPUv Multiple copies of the same data exist in caches of

different CPUs• atomic lock instructions do CPU-CPU communication

v Spinning to wait for a lock is not always a bad idea

18

Synchronization problems in SMPs

TSL instruction is non-trivial on SMPs

19

Avoiding cache thrashing during spinning

Multiple locks used to avoid cache thrashing

20

Spinning versus switching

q In some cases CPU “must” waitv scheduling critical section may be held

q In other cases spinning may be more efficientthan blockingv spinning wastes CPU cyclesv switching uses up CPU cycles alsov if critical sections are short spinning may be better

than blockingv static analysis of critical section duration can

determine whether to spin or blockv dynamic analysis can improve performance

21

Multiprocessor scheduling

q Two dimensional scheduling decisionv time (which process to run next)v space (which processor to run it on)

q Time sharing approachv single scheduling queue shared across all CPUs

q Space sharing approachv partition machine into sub-clusters

22

Time sharing

q Single data structure used for schedulingq Problem - scheduling frequency influences

inter-thread communication time

23

Interplay between scheduling and IPC

q Problem with communication between two threadsv both belong to process Av both running out of phase

24

Space sharing

q Groups of cooperating threads can communicate atthe same timev fast inter-thread communication time

25

Gang scheduling

q Problem with pure space sharingv Some partitions are idle while others are overloaded

q Can we combine time sharing and space sharingand avoid introducing scheduling delay into IPC?

q Solution: Gang Schedulingv Groups of related threads scheduled as a unit (gang)v All members of gang run simultaneously on different

timeshared CPUsv All gang members start and end time slices together

26

Gang scheduling

Multi-computer Systems

28

Multi-computers

q Also known asv cluster computersv clusters of workstations (COWs)

q Definition:Tightly-coupled CPUs that do notshare memory

29

Multi-computer interconnection topologies

(a) single switch(b) ring(c) grid

(d) double torus(e) cube(f) hypercube

30

Store & forward packet switching

31

Network interfaces in a multi-computer

q Network co-processors may off-loadcommunication processing from the main CPU

32

OS issues for multi-computers

q Message passing performance

q Programming modelv synchronous vs asynchornous message passingv distributed virtual memory

q Load balancing and coordinated scheduling

33

Optimizing message passing performance

q Parallel application performance is dominated bycommunication costsv interrupt handling, context switching, message

copying …

q Solution - get the OS out of the loopv map interface board to all processes that need itv active messages - give interrupt handler address of

user-bufferv sacrifice protection for performance?

34

CPU / network card coordination

q How to maximize independence between CPU andnetwork card while sending/receiving messages?v Use send & receive rings and bit-mapsv one always sets bits, one always clears bits

35

Blocking vs non-blocking send calls

q Minimum services providedv send and receive

commands

q These can be blocking(synchronous) or non-blocking (asynchronous)calls

(a) Blocking send call

(b) Non-blocking send call

36

Blocking vs non-blocking calls

q Advantages of non-blocking callsv ability to overlap computation and communication

improves performance

q Advantages of blocking callsv simpler programming model

37

Remote procedure call (RPC)

q Goalv support execution of remote proceduresv make remote procedure execution indistinguishable

from local procedure executionv allow distributed programming without changing the

programming model

38

Remote procedure call (RPC)

q Steps in making a remote procedure callv client and server stubs are proxies

39

RPC implementation issues

q Cannot pass pointersv call by reference becomes copy-restore (at best)

q Weakly typed languagesv Client stub cannot determine size of reference

parametersv Not always possible to determine parameter types

q Cannot use global variablesv may get moved (replicated) to remote machine

q Basic problem - local procedure call relies onshared memory

40

Distributed shared memory (DSM)

q Goalv use software to create the illusion of shared

memory on top of message passing hardwarev leverage virtual memory hardware to page fault on

non-resident pagesv service page faults from remote memories instead

of from local disk

41

Distributed shared memory (DSM)

q DSM at the hardware, OS or middleware layer

42

Page replication in DSM systems

Replication

(a) Pages distributed on 4machines

(b) CPU 0 reads page 10

(c) CPU 1 reads page 10

43

Consistency and false sharing in DSM

44

Strong memory consistency

P1

P2

P3

P4

W1

W2

W3

W4

R2

R1

q Total order enforces sequential consistencyv intuitively simple for programmers, but very costly to

implement

v not even implemented in non-distributed machines!

45

Scheduling in multi-computer systems

q Each computer has its own OSv local scheduling applies

q Which computer should we allocate a task toinitially?v Decision can be based on load (load balancing)v load balancing can be static or dynamic

46

Graph-theoretic load balancing approach

Process

q Two ways of allocating 9 processes to 3 nodesq Total network traffic is sum of arcs cut by node

boundariesq The second partitioning is better

47

Sender-initiated load balancing

q Overloaded nodes (senders) off-load work to underloadednodes (receivers)

48

Receiver-initiated load balancing

q Underloaded nodes (receivers) request work from overloadednodes (senders)

Distributed Systems

50

Distributed systems

q Definition: Loosely-coupled CPUs that do notshare memoryv where is the boundary between tightly-coupled and

loosely-coupled systems?

q Other differencesv single vs multiple administrative domainsv geographic distributionv homogeneity vs heterogeneity of hardware and

software

51

Comparing multiprocessors, multi-computers and distributed systems

52

Ethernet as an interconnect

Computer

q Bus-based vs switched Ethernet

53

The Internet as an interconnect

54

OS issues for distributed systems

q Common interfaces above heterogeneoussystemsv Communication protocolsv Distributed system middleware

q Choosing suitable abstractions for distributedsystem interfacesv distributed document-based systemsv distributed file systemsv distributed object systems

55

Network service and protocol types

56

Protocol interaction and layering

57

Homogeneity via middleware

58

Distributed system middleware models

q Document-based systemsq File-based systemsq Object-based systems

59

Document-based middleware - WWW

60

Document-based middleware

How the browser gets a pageq Asks DNS for IP addressq DNS replies with IP addressq Browser makes connectionq Sends request for specified pageq Server sends fileq TCP connection releasedq Browser displays textq Browser fetches, displays images

61

File-based middleware

q Design issuesv Naming and name resolutionv Architecture and interfacesv Caching strategies and cache consistencyv File sharing semanticsv Disconnected operation and fault tolerance

62

Naming

(b) Clients with the same view of name space(c) Clients with different views of name space

63

Naming and transparency issues

q Can clients distinguish between local and remote files?

q Location transparencyv file name does not reveal the file's physical storage

location.

q Location independencev the file name does not need to be changed when the

file's physical storage location changes.

64

Global vs local name spaces

q Global name spacev file names are globally uniquev any file can be named from any node

q Local name spacesv remote files must be inserted in the local name spacev file names are only meaningful within the calling nodev but how do you refer to remote files in order to insert

them?• globally unique file handles can be used to map remote

files to local names

65

Building a name space with super-root

q Super-root / machine name approachv concatenate the host name to the names of files stored on

that hostv system-wide uniqueness guaranteedv simple to located a filev not location transparent or location independent

66

Building a name space using mountingq Mounting remote file systems

v exported remote directory is imported and mounted ontolocal directory

v accesses require a globally unique file handle for the remotedirectory

v once mounted, file names are location-transparent• location can be captured via naming conventions

v are they location independent?• location of file vs location of client?• files have different names from different places

67

Local name spaces with mounting

q Mounting (part of) a remote file system in NFS.

68

Nested mounting on multiple servers

69

NSF name space

q Server exports a directory

q mountd: provides a unique file handle for the exporteddirectory

q Client uses RPC to issue nfs_mount request to server

q mountd receives the request and checks whetherv the pathname is a directory?v the directory is exported to this client?

70

NFS file handles

q V-node containsv reference to a file handle for mounted remote filesv reference to an i-node for local files

q File handle uniquely names a remote directoryv file system identifier: unique number for each file system (in UNIX

super block)v i-node and i-node generation number

v-nodei-nodeFile handle

File System identifier i-nodei-node generation

number

71

Mounting on-demand

q Need to decide where and when to mount remotedirectories

q Where? - Can be based on conventions to standardizelocal name spaces (ie., /home/username for user homedirectories)

q When? - boot time, login time, access time, …?q What to mount when?

v How long does it take to mount everything?v Do we know what everything is?v Can we do mounting on-demand?

q An automounter is a client-side process that handles on-demand mountingv it intercepts requests and acts like a local NFS server

72

Distributed file system architectures

q Server sidev how do servers export filesv how do servers handle requests from clients?

q Client sidev how do applications access a remote file in the same way

as a local file?

q Communication layerv how do clients and servers communicate?

73

Local access architectures

q Local access approachv move file to clientv local access on clientv return file to serverv data shipping

approach

74

Remote access architectures

q Remote accessv leave file on serverv send read/write operations

to serverv return results to clientv function shipping approach

75

File-level interface

q Accesses can be supported at either the filegranularity or block granularity

q File-level client-server interfacev local access model with whole file movement and

cachingv remote access model client-server interface at

system call levelv client performs remote open, read, write, close calls

76

Block-level interface

q Block-level client-server interfacev client-server interface at file system or disk block

levelv server offers virtual disk interfacev client file accesses generate block access requests

to serverv block-level caching of parts of files on client

77

NFS architecture

q The basic NFS architecture for UNIX systems.

78

NFS server side

q Mountdv server exports directory via mountdv mountd provides the initial file handle for the exported

directoryv client issues nfs_mount request via RPC to mountdv mountd checks if the pathname is a directory and if the

directory is exported to the client

q nfsd: services NFS RPC calls, gets the data from itslocal file system, and replies to the RPCv Usually listening at port 2049

q Both mountd and nfsd use RPC

79

Communication layer: NFS RPC Calls

q NFS / RPC uses XDR and TCP/IPq fhandle: 64-byte opaque data (in NFS v3)

v what’s in the file handle?

status, fattrfhandle, offset, count, datawrite

status, fhandle, fattrdirfh, name, fattrcreate

status, fattr, datafhandle, offset, countread

status, fhandle, fattrdirfh, namelookup

ResultsInput argsProc.

80

NFS file handles

q V-node containsv reference to a file handle for mounted remote filesv reference to an i-node for local files

q File handle uniquely names a remote directoryv file system identifier: unique number for each file system (in UNIX

super block)v i-node and i-node generation number

v-nodei-nodeFile handle

File System identifier i-nodei-node generation

number

81

NFS client side

q Accessing remote files in the same way asaccessing local files requires kernel supportv Vnode interface

read(fd,..) struct file

ModeVnodeoffset

V_data

fs_op

struct vnode

{int (*open)(); int (*close)(); int (*read)(); int (*write)(); int (*lookup)(); … }

processfile table

82

Caching vs pure remote service

• Network traffic?– caching reduces remote accesses ⇒ reduces network traffic– caching generates fewer, larger, data transfers

• Server load?– caching reduces remote accesses ⇒ reduces server load

• Server disk throughput?– optimized better for large requests than random disk blocks

• Data integrity?– cache-consistency problem due to frequent writes

• Operating system complexity?– simpler for remote service.

83

Four places to cache files

q Server’s disk: slow performanceq Server’s memory

v cache management, how much to cache, replacementstrategy

v still slow due to network delayq Client’s disk

v access speed vs server memory?v large files can be cachedv supports disconnected operation

q Client’s memoryv fastest accessv can be used by diskless workstationsv competes with the VM system for physical memory

space

84

Cache consistency

v Reflecting changes to local cache to master copyv Reflecting changes to master copy to local caches

update/invalidate

Copy 1

Copy 2

Master copy

write

85

Common update algorithms for client caching

q Write-through: all writes are carried out immediatelyv Reliable: little information is lost in the event of a client crashv Slow: cache not useful for writes

q Delayed-write: writes do not immediately propagate to serverv batching writes amortizes overheadv wait for blocks to fillv if data is written and then deleted immediately, data need not

be written at all (20-30 % of new data is deleted with 30 secs)q Write-on-close: delay writing until the file is closed at the

clientv semantically meaningful delayed-write policyv if file is open for short duration, works finev if file is open for long, susceptible to losing data in the event of

client crash

86

Cache coherence

q How to keep locally cached data up to date / consistent?q Client-initiated approach

v check validity on every access: too much overheadv first access to a file (e.g., file open)v every fixed time interval

q Server-initiated approachv server records, for each client, the (parts of) files it

cachesv server responds to updates by propagation or invalidation

q Disallow caching during concurrent-write or read/writesharingv allow multiple clients to cache file for read only accessv flush all client caches when the file is opened for writing

87

NFS – server caching

q Readsv use the local file system cachev prefetching in UNIX using read-ahead

q Writesv write-through (synchronously, no cache)v commit on close (standard behaviour in v4)

88

NFS – client caching (reads)

q Clients are responsible for validating cache entries(stateless server)

q Validation by checking last modification timev time stamps issues by serverv automatic validation on open (with server??)

q A cache entry is considered valid if one of the followingare true:v cache entry is less than t seconds old (3-30 s for files,

30-60 s for directories)v modified time at server is the same as modified time on

client

89

NFS – client caching (writes)

q Delayed writesv modified files are marked dirty and flushed to server on

close (or sync)

q Bio-daemons (block input-output)v read-ahead requests are done asynchronouslyv write requests are submitted when a block is filled

90

File sharing semantics

q Semantics of File sharingv (a) single processor gives sequential consistencyv (b) distributed system may return obsolete value

91

Consistency semantics for file sharingq What value do reads see after writes?q UNIX semantics

v value read is the value stored by last writev writes to an open file are visible immediately to others with the

file openv easy to implement with one server and no cache

q Session semanticsv writes to an open file are not visible immediately to others with

the file opened alreadyv changes become visible on close to sessions started later

q Immutable-Shared-Files semantics - simple to implementv A sharable file cannot be modifiedv File names cannot be reused and its contents may not be

alteredq Transactions

v All changes have all-or-nothing propertyv W1,R1,R2,W2 not allowed where P1 = W1;W2 and P2 = R1;R2

92

NFS – file sharing semantics

q Not UNIX semantics!q Unspecified in NFS standardq Not clear because of timing dependenciesq Consistency issues can arise

v Example: Jack and Jill have a file cached. Jack opens thefile and modifies it, then he closes the file. Jill thenopens the file (before t seconds have elapsed) andmodifies it as well. Then she closes the file. Are bothJack’s and Jill’s modifications present in the file? Whatif Jack closes the file after Jill opens it?

q Locking part of v4 (byte range, leasing)