Joonwon Lee [email protected] Distributed System. 2 Distributed System (DS) –consists of a...

Joonwon [email protected]

Distributed System

2

Distributed System• Distributed System (DS)

– consists of a collection of autonomous computers linked by a computer network and equipped with distributed system software.

• DS software– enables computers to coordinate their activities and to

share the resources of the system, i.e., hardware, software and data.

• Users of a DS should perceive a single, integrated computing facility even though it may be implemented by many computers in different locations.

3

Characteristics of Distributed Systems

• The following characteristics are primarily responsible for the usefulness of distributed systems Resource Sharing Openness Concurrency Scalability Fault tolerance Transparency

– They are not automatic consequences of distribution; system and application software must be carefully designed

4

DESIGN GOALS• Key design goals

– Performance, Reliability, Consistency, Scalability, Security

• Basic design issues– Naming

– Communication: optimize the implementation while retaining a high level programming model

– Software structure: structure a system so that new services can be introduced that will interwork fully with existing services

– Workload allocation: deploy the processing, communication and resources for optimum effect in the processing of changing workload

– Consistency maintenance: the maintenance of consistency at reasonable cost

5

Naming• Distributed systems are based on the sharing of resources and

on the transparency of resource distribution

• Names assigned to resources must

– have global meanings that are independent of location

– be supported by a name interpretation system that can translate names to enable programs to access the resources

• Design issue

– design a naming scheme that will scale, and translate names efficiently to meet appropriate performance goals

6

Communication• Communication between a pair of processes

involves:– transfer of data from the sending process to the receiving

process

– synchronization of the receiving process with the sending process may be required

• Programming Primitives• Communication Structure

– Client- Server

– Group Communication

7

Software Structure• Addition of new service should be easy

Computer and network hardware

Operating system kernel services

Distributed programmingsupport

Open services

Applications

The main categories of software in a distributed system

8

Workload Allocation• How is work allocated amongst resources in a DS ?• Workstation-Server Model

– ‘putting the processor cycles near the user’ – good for interactive applications

– capacity of workstation determines the size of largest task that can be performed on behalf of the user

– does not optimize the use of processing and memory resources

– a single user with a large computing task is not able to obtain additional resources

• Some modifications of the workstation-server model– processor pool model, shared memory multiprocessor

9

Processor Pool Model• Processor pool model

– allocate processors dynamically to users

– a processor pool usually consists of a collection of low-cost computers

– each processor in a pool has an independent network connection

– processors do not have to be homogeneous

– processors are allocated to processes for their lifetime

• Users– use a simple computer or X-terminal

– a user’s work can be performed partly or entirely on the pool processors

• examples: Amoeba, Clouds, Plan 9

10

Use of Idle Workstations• A significant proportion of workstations on a network may be

unused or be used for lightweight activities (at some time especially overnight)– The idle workstations can be used to run jobs for users who are

logged on at other stations and do not have sufficient capacity at their machine

• In Sprite OS– the target workstation is chosen transparently by the system– include a facility for process migration

• NOW(Networks of Workstations)– MPP is expensive and workstations are NOT– network is getting faster than any other components– for what?

• network RAM, cooperative file cacheing, software RAID, parallel computing, …etc

11

Consistency Maintenance• Update Consistency

– Arises when several processes access and update data concurrently

– changing a data value cannot be performed instantaneously

– desired effect

• the update looks atomic - a related set of changes made by a given process should appear to all other processes as if it was done instantaneous

• Significant because– many processes share data

– operation of system itself depends on the consistency of file directories managed by file services, naming databases etc

12

Consistency Maintenance (cont’d)

• Replication Consistency– motivations of data replication

• increased availability and performance– if data have been copied to several computers and

subsequently modified at one or more of them,• the possibility of inconsistencies arises

between the values of data items at different computers

13

• Cache Consistency– cacheing vs replication

– same consistency problem as replication

– examples• multiprocessor caches

• file caches

• cluster web server

Consistency Maintenance (cont’d)

14

User Requirements• Functionality

– What the system should do for users• Quality of Service

– issues of performance, reliability and security• Reconfigurability

– accommodate changes without causing disruption to existing services

15

Distributed File System

1. Introduction

2. The SUN Network File System

3. The Andrew File System

4. The Coda File System

5. The xFS

16

Introduction• Three practical implementations.

– Sun Network File System– Andrew File System– Coda File System– These systems aim to emulate the UNIX file system interface

• Emulation of a UNIX file system interface– caching of file data in client computers is an essential design feature,

but the conventional UNIX file system offers one-copy update semantics

• one-copy update semantics: file contents seen by all of the concurrent processes are those that they would see if only single copy of the file contents existed

– These three implementations allow some deviation from one-copy semantics

• one-copy model has not been strictly adhered

17

Server Structure• Connectionless• Connection-Oriented• Iterative Server• Concurrent Server

18

Stateful Server

filedescriptorfor client A

file system

fopen(...)

fread(fp, nbytes)

data

file positionis updatedhere

client A

19

Stateless Server

filedescriptorfor client A

file system

fopen(fp, read))fread(.,position.)fclose(fp)

client A

data

file positionis updatedhere

20

The Sun NFS• provide transparent access to remote files for client

programs • each computer has client and server modules in its

kernel– the client and server relationship is symmetric

• each computer in an NFS can act as both a client and a server

• larger installations may be configured as dedicated servers

• available for almost every major system

21

The Sun NFS (cont’d)• Design goals with respect to transparency• Access transparency

– An API is identical to the local OS’s interface. Thus, in a UNIX client, no modifications to existing programs are required for accesses to remote files.

• Location transparency – each client establishes a file name space by adding remote

file systems to its local name space for each client (mount)

– NFS does not enforce a single network-wide file name space.

• each client may see a unique set of name space

22

The Sun NFS (cont’d)• Failure transparency

– NFS server is stateless and most file access operations are idempotent

– UNIX file operations are translated to NFS operations by an NFS client module

– Stateless and idempotent nature of NFS ensures that failure semantics for remote file access are similar to those for local file access

• Performance transparency– both the client and server employ caching to achieve satisfactory

performance

– For clients, the maintenance of cache coherence is somewhat complex, because several clients may be using and updating the same file

23

The Sun NFS (cont’d)• Migration transparency

– Mount service

• establish the file name space in client computers

– file systems may be moved between servers, but the remote mount tables in each client must then be separately updated to enable the clients to access the file system in its new location

• migration transparency is not fully achieved by NFS

– Automounter

• runs in each NFS client and enables pathnames to be used that refer to unmounted file systems

24

The Sun NFS (cont’d)• Replication transparency

– NFS does not support file replication in a general sense

• Concurrency transparency– UNIX support only rudimentary locking facilities for

concurrency control

– NFS does not aim to improve upon the UNIX approach to the control of concurrent updates to files

25

The Sun NFS (cont’d)• Scalability

– Scalability of the NFS is limited.

• Due to the lack of replication

– The number of clients that can simultaneously access a shared file is restricted by the performance of the server that holds the file.

• can become a system-wide performance bottleneck for heavily-used files.

26

Implementation of NFS

– User-level client process: process using NFS– NFS client and server modules communicate using remote

procedure calling.

27

The Andrew File System• Andrew

– a distributed computing environment developed at CMU

• Andrew File System (AFS)– reflects an intention to support information-sharing on a

large scale

– provides transparent access to remote shared files for UNIX programs

– scalability is the most important design goal

– implemented on workstations and servers running BSD4.3 UNIX or Mach

28

The Andrew File System (cont’d)• Two unusual design characteristics

– whole-file serving

• the entire contents of files are transmitted to client computers by AFS servers.

– whole-file caching

• a copy of a file is stored in a cache on the client’s local disk.

• the cache is permanent, surviving reboots of the client computer.

29

The Andrew File System (cont’d)• The design strategy is based on some assumptions

– files are small

– reads are much more common than writes (about 6 times)

– sequential access is common and random access is rare

– most files are read and written by only one user

– temporal locality of reference for files is high

• Databases do not fit the design assumptions of AFS– typically shared by many users and are often updated quite

frequently

– DB are treated by its own storage control, anyway

30

Implementation• Some questions about the implementation of AFS

– How does AFS gain control when an open or close system call referring to a file in the shared file space is issued by a client?

– How is the server holding the required file located?

– What space is allocated to cached files in workstations?

– How does AFS ensure that the cached copies of files are up-to-date when files may be updated by several clients?

31

Implementation (cont’d)

– Vice: name given to the server software that runs as a user-level UNIX process in each server computer.

– Venus: a user-level process that runs in each client computer.

32

Cache coherence• Callback promise

– mechanism for ensuring that cached copies of files are updated when another client closes the same file after updating it.

• Vice supplies a copy of a file to a Venus with a callback– callback promises are stored with the cached files

– state of callback promise: either valid or cancelled

• When a Vice update a file, it notifies all of the Venus processes to which it has issued callback promises by sending a callback– callback is a RPC from a server to a client (i.e., Venus)

• When a Venus receives a callback, it sets the callback promise token for the relevant file to cancelled

33

Cache coherence (cont’d)• Handling open in Venus

– If the required file is found in the cache, then its token is checked.

• If its value is cancelled, then get a new copy

• If valid, then use it

• Restart of a client computer after a failure– some callbacks may have been missed

– for each file with a valid token, Venus sends a timestamp to the server

• If timestamp is current, the server responds with valid.

• Otherwise, the server responds with cancelled

34

Cache coherence (cont’d)• Callback promise renewal interval

– Callback promises must be renewed before an open if a time T (say, 10 minutes) has elapsed without communication from the server for a cached file

– deals with communication failure

35

Update semantics• For a client C operating on a file F on a server S, the

followings are guaranteed

• Update semantics for AFS-1

after a successful open: latest(F,S)

after a failed open: failure(S)

after a successful close: updated(F,S)

after a failed close: failure(S)– latest(F,S): current value of F at C is the same as the value at S

– failure(S): open or close has not been performed at S

– updated(F,S): C’s value of F has been successfully propagated to S

36

Update semantics (2)• Update semantics for AFS-2:

– currency guarantee for open is slightly weaker

– after a successful open:

latest(F,S,0) or (lostCallback(S,T) and inCache(F) and latest(F,S,T))

– latestes(F,S,T): the copy of F seen by client is no more than T out of date

– lostCallback(S,T): callback message from S to C has been lost during the last T time

– inCache(F): F was in the cache at C before open was attempted

37

Update semantics (3)• AFS does not provide any further concurrency

control mechanism• If clients in different workstations open, write and

close the same file concurrently, – only the updates from the last close remain and all others

will be silently lost (no error report)– clients must implement concurrency control independently

if they require it

• When two client processes in the same workstation open a file, – they share the same cached copy, and updates are

performed in the normal UNIX fashion: block-by-block.

38

The Coda File System• Coda File System

– a descendent of AFS that addresses several new requirements [CMU]

– replication for a large scale system

– improvement in fault-tolerance

– mobile use of portable computers

• Goal– constant data availability

– provide users with the benefits of a shared file repository, but allow them to rely entirely on local resources when the repository is partially or totally inaccessible

– retain the original goals of AFS with regard to scalability and the emulation of UNIX

39

The Coda File System (cont’d)• read-write volumes

– can be stored on several servers

– higher throughput of file accesses and a greater degree of fault tolerance

• Support of disconnected operation – an extension of the mechanism in AFS for caching copies

of files at workstations

– enable workstations to operate when disconnected from the network

40

The Coda File System (cont’d)• Volume storage group (VSG)

– set of servers holding replicas of a file volume

• Available volume storage group (AVSG)– some subset of VSG in which a client wishing to open a

file

• Callback promise mechanism– Clients are notified of a change, as in AFS

• Updates instead of invalidations

41

The Coda File System (cont’d)• Coda version vector (CVV)

– attached to each version of a file

– vector of integers with one element for each server in VSG

• [server-i1, server-i2, . . ., server-ik]

– each element of CVV denotes the number of modifications on the version of the file held at the corresponding server

– Provide information about the update history of each file version to enable inconsistencies to be detected and corrected automatically if updates do not conflict, or with manual intervention if they do

42

The Coda File System (cont’d)• Repair of inconsistency

– if all the elements of CVV at one site > those of all other sites

• inconsistency can be automatically repaired

– otherwise, the conflict cannot in general be resolved automatically

• the file is marked as ‘inoperable’, and the owner of the file is informed of the conflict

• needs a manual intervention

43

The Coda File System (cont’d)• Scenario

– when a modified file is closed, Venus sends to each site in AVSG an update message (new contents of the file and CVV)

– Vice at each site checks CVV

• if consistent, store new contents and returns ACK

– Venus increments elements of CVV for the servers that responded positively to the update message, and distributes the new CVV to members of AVSG

44

The Coda File System: Example• F is a file in a volume replicated at servers S1, S2 and S3

C1 and C2: clientsVSG for F = {S1, S2, S3}AVSG for C1 = {S1, S2}, AVSG for C2 = {S3}

• Initially, CVVs for F at all three servers are [1, 1, 1]• C1 modifies F

– CVVs for F at S1 and S2 are [2, 2, 1]

• C2 modifies F– CVV for F at S3 is [1, 1, 2]

• No CVV dominates all other CVVs– conflict requiring manual intervention

• Suppose F is not modified in step 3 above. Then [2, 2, 1] dominates [1, 1, 1]. Thus, the version of the file at S1 or S2 should replace that at S3

45

Update semantics• The currency guarantees by Coda when a file is opened at a

client are weaker than for AFS

• The Guarantee offered by – successful open

• It provides the most recent copy of file from the current AVSG

• If no server is accessible, a locally cached copy of file is used if available.

– successful close

• The file has been propagated to the currently accessible set of servers

• If no server is available, the file has been marked for propagation at the earliest opportunity.

46

Update semantics (cont’d)• S: server, S: set of servers (the file’s VSG)s: the AVSG for the file seen by a client C

after a successful open: s and (latest(F,s,0) or (latest(F,s,T) and lostCallback(s,T) and inCache(F)))

or (s = and inCache(F))

after a failed open: s and conflict(F, s)

or (s = and inCache(F))after a successful close: s and updated(F, s)

or (s = )after a failed close: s and conflict(F, s)

– conflict(F, s) means that the values of F at some servers in s are currently in conflict

47

Cache coherence• Venus at each client must detect the following events

within T seconds– enlargement of AVSG

• due to accessibility of a previously inaccessible server

– shrinking of an AVSG

• due to a server becoming inaccessible

– a lost callback

• Multicast messages to VSG

48

xFS• xFS: Serverless Network File System

– in the paper " A Case for NOW", “Experience with a ...”– idea

• file system as a parallel program• exploit fast LANs

• Cooperative Cacheing– use remote memory to avoid going to disk

• manage client memory as a global resource– much of client memory is not used– server: get file from client's memory instead of from disk– better send to idle client than discarding replaced file copy

49

xFS Cache Coherence• Write Ownership Cache Coherence

– each node can own a file

– owner has the most up to date copy

– server just keeps track of who "owns" file

– any request to a file is forwarded to the owner

– a file is either

• owned: only one copy exists

• read-only: multiple copies

– to modify a file,

• secure a file as owned

• modify as many time as you want

• if someone else reads the file, send the up to date version, and marks the file as read-only

50

xFS Cache Coherence

invalid

owned read-only

write byother node

write

write

readwrite by

other node

read byother node

51

xFS Software RAID• Cooperative cacheing makes availability nightmare

– any crash will damage a part of a file system

• stripe data redundantly over multiple disks– software RAID

– reconstruct missing part from remaining parts

– logging makes reconstruction easy

52

xFS Software RAID• Motivations

– high nadwidth requirements from• multimedia• parallel computing

– economic workstations– high speed network– let’s learn from RAID

• parallel IO from inexpensive hard disks• fault managements• limitations

– single server– small write problem

53

xFS Software RAID• Approaches

– stripe each file across multiple file servers

• small file problems– when stripping units is too small

• ideal size is 10’s of Kbytes

• two reads and two writes for a write (parity check/build)

– when a file is a stripping unit

• parity will consume the same space

• load cannot be spread across servers

54

xFS Experiences• Need of a formal method for cache coherence

– it is much more complicated than it looks

• lot of trasient states

• 3 formal states => 22 implementation states

– ad hoc test-and-retry leaves unknown errorr permanently

– no one is sure about the correctness

– software protability is poor

55

xFS Experiences• Threads in a server

– it is a nice concept but

– it incurs too much concurrency

• too much data races

• the most difficult thing to understand in the world

• dificult to debug

– solution:iterative server

• difficult to design but simple to debug– less error-prone

• efficient

• RPC– not suitable for multi-party communication

– need to gather/scatter RPC servers

Joonwon Lee [email protected] Distributed System. 2 Distributed System (DS) –consists of a...

Documents

Transcript of Joonwon Lee [email protected] Distributed System. 2 Distributed System (DS) –consists of a...