Distributed File Systems

Dr. Kalpakis

CMSC 621, Advanced Operating Systems. Fall 2003

URL: http://www.csee.umbc.edu/~kalpakis/Courses/621

Distributed File Systems

CMSC 621, Fall 2003 2URL: http://www.csee.umbc.edu/~kalpakis/Courses/621

DFS

A distributed file system is a module that

implements a common file system shared by all nodes in a distributed

system

DFS should offer

network transparency

high availability

key DFS services

file server (store, and read/write files)

name server (map names to stored objects)

cache manager (file caching at clients or servers)


DFS Mechanisms

Mounting

Caching

Hints

Bulk data transfers


DFS mechanisms

mounting

name space = collection of names of stored objects which may or may

not share a common name resolution mechanism

binds a name space to a name (mount point) in another name space

mount tables maintaining the map of mount points to stored objects

mount tables can be kept at clients or servers

caching

amortize access cost of remote or disk data over many references

can be done at clients and/or servers

can be main memory or disk caches

helps to reduce delays (disk or network) in accessing stored objects

helps to reduce server loads and network traffic


DFS mechanisms

hints

caching introduces the problem a cache consistency

ensuring cache consistency is expensive

cached info can be used as a hint (e.g. mapping of a name to a stored

object)

bulk data transfers

overhead in executing network protocols is high

network transit delays are small

solution: amortize protocol processing overhead and disk seek times and

latencies over many file blocks


Name Resolution Issues

naming schemes

host:filenamesimple and efficientno location-transparency

mounting

single global name space

uniqueness of names requires cooperating servers

Context-aware

partition the name space into contexts

name resolution is always performed with respect to a given context

name serverssingle name serverdifferent name servers for different parts of a name space


Caching Issues

Main memory caches

faster access

diskless clients can also use caching

single design for both client and server caches

compete with Virtual Memory manager for physical memory

can not completely cache large stored objects

block-level caching is complex to implement

can not be used by portable clients

Disk caches

remove some of drawbacks of the main memory caches


Caching Issues

Writing policy

write-through

every client’s write request is performed at the servers immediately

delayed writing

client’s writes are reflected to the stored objects at servers after some delay

many writes in the cache

writes to short-lived objects are not done at servers

20-30% of new data are deleted within 30 secs

lost data is an issue

delayed writing until file close

most files are open for a short time


Caching Issues

Approaches to deal with the cache consistency problem

server-initiated

servers inform client cache managers whenever their cached data become

stale

servers need to keep track who cached which file blocks

client-initiated

clients validate data with servers before using

partially negates caching benefits

disable caching when concurrent-write sharing is detected

concurrent-write sharing: multiple clients opened a file with at least

one of them opened for writing

avoid concurrent-write sharing by using locking


More Caching Consistency Issues

The sequential-write sharing problem

occurs when a client opens a (previously opened) file that has recently

been modified and closed by another client

causes problems

A client may still have (outdated) file blocks in its cache

Other client may have not written its modified cached file blocks to file

server

solutions

associate file timestamps with all cached file blocks; at file open request

current file timestamp from file server

file server asks the client with the modified cached blocks to flush its data to

server when another client opens a file for writing


Availability Issues

Replication can help in increasing data availability

is expensive due to extra storage for replicas and due to overhead in

maintaining the replicas consistent

Main problems

maintaining replica consistency

detecting replica inconsistencies and recovering from them

handle network partitions

placing replicas where needed

keep the rate of deadlocks small and availability high


Availability Issues

Unit of replication

complete file or file block

allows replication of only the data that are needed

replica management is harder (locating replicas, ensuring file protection,

etc)

volume (group) of files

wasteful if many files are not needed

replica management simpler

pack, a subset of the files in a user’s primary pack

mutual consistency among replicas

Let most current replica= replica with highest timestamp in a quorum

Use voting to read/write replicas and keep at least one replica current

Only votes from most current replicas are valid


Scalability & Semantic Issues

Caching & cache consistency

take advantage of file usage patterns

many widely used and shared files are accessed in read-only mode

data a client needs are often found in another client’s cache

organize client caches and file servers in a hierarchy for each file

implement file servers, name servers, and cache managers as

multithreaded processes

common FS semantics: each read operation returns data due to

the most recent write operation

providing these semantics in DFS is difficult and expensive


NFS

K erne l

V FSInterface

S erverR outines

D isk

R P C /XD R

R P C /XD R

N FSU nix FSO ther FS

V FSInterface

O S In terface

C lient

S erver


NFS

Interfaces

file system

virtual file system (VFS)

vnodes uniquely identify objects in the FS

contain mount table info (pointers to parent FS and mounted FS)

RPC and XDR (external data representation)


NFS Naming and Location

Filenames are mapped to represented object at first use

mapping is done at the servers by sequentially resolving each

element of a pathname using the vnode information until a file

handle is obtained


NFS Caching

File Caching

read ahead and 8KB file blocks are used

files or file blocks are cached with timestamp of last update

cached blocks are assumed valid for a preset time period

block validation is performed at file open and after timeout at the server

upon detecting an invalid block all blocks of the file are discarded

delayed writing policy with modified blocks flushed to server upon file

close


NFS Caching

Directory name lookup caching

directory names ==> vnodes

cached entries are updated upon lookup failure or when new info is

received

File/Directory attribute cache

access to file/dir attributes accounts for 90% of file requests

file attributes are discarded after 3 secs

dir. Attributes are discarded after 30 secs

dir. Changes are performed at the server

NFS servers are stateless


Sprite File System

Name space is a single hierarchy of domains

Each server stores one or more domains

Domains have unique prefixes

mount points link domains in single hierarchy

clients maintain prefix table


Sprite FS - Prefix tables

locating files in Sprite

each client finds longest prefix match in its prefix table and then sends

remaining of pathname to the matching server together with the domain

token in its prefix table

server replies with file token or with a new pathname if the “file” is a remote

link

each client request contains the filename and domain token

when client fails to find matching prefix or fails during a file open

client broadcasts pathname and server with matching domain replies with

domain/file token

entries in prefix table are hints


Sprite FS - Caching

Client-cache in main memory

file block size is 4KB

cache entries are addressed with file token and block#, which allows

blocks to be added without contacting the server

blocks can be accessed without accessing file’s disk map to get block’s disk

address

clients do not cache directories to avoid inconsistencies

servers have main memory caches as well

delayed writing policy is used


Sprite FS - Cache Writing Policy

Observations

BSD

20-30% of new data live less than 30 secs

75% of files are open for less than 0.5 secs

90% of files are open for less than 10 secs

recent study

65-80% of files are open for less than 30 secs

4-27% of new data are deleted within 30 secs

One can reduce traffic by

not updating servers at file close immediately

not updating servers when caches are updated


Sprite Cache Writing Policy

Delayed writing policy

every 5 secs flush client’s cached (modified) blocks to server if they

haven’t been modified within the last 30 secs

flush blocks from server’s cache to disk within 30-60 secs afterwards

replacement policy: LRU

80% of time blocks ejected to make room for other blocks

20% of time to return memory to VM

cache blocks are unreferenced for about 1hr before ejected

cache misses

40% on reads and 1% on writes


Sprite Cache Consistency

Server initiated

avoid concurrent-write sharing by disabling caching for files open

concurrently for reading and writing

ask client writing file to flush its blocks

inform all other clients that file is not cacheable

file becomes cacheable when all clients close the file again

solve sequential-write sharing using version numbers

each client keeps the version# of file whose blocks it caches

server increments version# each time file is opened for writing

client is informed of file version# at file open

server keeps track of last writer; server asks last writer to flush its cached

blocked if file is opened by another client


Sprite VM and FS Cache Contention

VM and FS compete for physical memory

VM and FS negotiate for physical memory usage

separate pools of blocks using the time of last access to determine

winner; VM is given slight preference (it losses only if a block hasn’t

been referenced for 20 mins)

double caching is a problem

FS marks blocks of newly compiled code with infinite time of last reference

backing files=swapped-out pages (including process state and data

segments)

clients bypass FS cache when reading/writing backing files


CODA

Goals

scalability

availability

disconnected operation

Volume = collection of files and directories on a single server

unit of replication

FS objects have a unique FID which consists of

32-bit volume number

32-bit vnode number

32-bit uniquifier

replicas of a FS object have the same FID


CODA Location

Volume Location database

replicated at each server

Volume Replication database

replicated at each server

Volume Storage Group (VSG)

Venus

client cache manager

caching in local disk

AVSG=client accessible nodes in VSG

preferred server in AVSG


CODA Caching & Replication

Venus caches files/dirs on demand

from the server in AVSG with the most up-to-date data

on file access

users can indicated caching priorities for file/dirs

users can bracket action sequences

Venus established callbacks at preferred server for each FS object

Server callbacks

server tells client that cached object is invalid

lost callbacks can happen


CODA AVSG Maintenance

Venus tracks changes in AVSG

new nodes in VSG that should or should not be in its AVSG by periodically

probing every node in VSG

removes a node from AVSG if operation fails

chooses a new preferred server if needed

Coda Version Vector (CVV)

both for volumes and files/dirs

vector with one entry for each node in VSG indicating the number of

updates of the volume or FS object


Coda Replica Management

State of an object or replica

each modification is tagged with a storeid

update history = sequence of storeids

state is a truncated update history

latest storeid LSID

CVV



comparing replicas A & B leads to one of four cases

LSID-A = LSID-B & CVV-A = CVV-B => strong equality

LSID-A = LSID-B & CVV-A != CVV-B => weak equality

LSID-A != LSID-B & CVV-A >= CVV-B => A dominates B

otherwise => inconsistent

when S receives an update for a replica C

checks the state of S and C; test is successful if

for files, it leads to strong equality or dominance

for dirs, it leads to strong equality



When C wants to update a replicated object

phase I

sent update to every node in its AVSG

each node performs a check of replica states (cached object and replicated

object), and informs the client of the result, and performs the update if

successful

if unsuccessfull, pauses client, server tries to resolve problem automatically,

if not then client aborts else client resumes

phase II

client sends updated object state to every site in AVSG



Force operation between servers

happens when Venus informs AVSG of weak consistency in AVSG

server with dominant replica overwrites data and state of dominated

server

for directories is done with the help of locking one directory at a time

repair operation

automatic; proceeds in two phases as in an update

migrate operation

moves inconsitent data to a covolume for manual repair


Conflict Resolution

Conflicts between

files are done by the user using the repair tool which bypasses Coda

update rules; inconsistent files are inaccessible to CODA

directories

uses the fact that a dir is a list of files

non-automated conflicts

update/update (for attributes)

remove/update

create/create (adding identical files)

all other conflicts can be resolved easily

inconsistent objects and objects without automatic conflict resolution are

placed in covolumes

Distributed File Systems

Documents

Transcript of Distributed File Systems