Distributed File Systems
description
Transcript of Distributed File Systems
Dr. Kalpakis
CMSC 621, Advanced Operating Systems. Fall 2003
URL: http://www.csee.umbc.edu/~kalpakis/Courses/621
Distributed File Systems
CMSC 621, Fall 2003 2URL: http://www.csee.umbc.edu/~kalpakis/Courses/621
DFS
A distributed file system is a module that
implements a common file system shared by all nodes in a distributed
system
DFS should offer
network transparency
high availability
key DFS services
file server (store, and read/write files)
name server (map names to stored objects)
cache manager (file caching at clients or servers)
CMSC 621, Fall 2003 3URL: http://www.csee.umbc.edu/~kalpakis/Courses/621
DFS Mechanisms
Mounting
Caching
Hints
Bulk data transfers
CMSC 621, Fall 2003 4URL: http://www.csee.umbc.edu/~kalpakis/Courses/621
DFS mechanisms
mounting
name space = collection of names of stored objects which may or may
not share a common name resolution mechanism
binds a name space to a name (mount point) in another name space
mount tables maintaining the map of mount points to stored objects
mount tables can be kept at clients or servers
caching
amortize access cost of remote or disk data over many references
can be done at clients and/or servers
can be main memory or disk caches
helps to reduce delays (disk or network) in accessing stored objects
helps to reduce server loads and network traffic
CMSC 621, Fall 2003 5URL: http://www.csee.umbc.edu/~kalpakis/Courses/621
DFS mechanisms
hints
caching introduces the problem a cache consistency
ensuring cache consistency is expensive
cached info can be used as a hint (e.g. mapping of a name to a stored
object)
bulk data transfers
overhead in executing network protocols is high
network transit delays are small
solution: amortize protocol processing overhead and disk seek times and
latencies over many file blocks
CMSC 621, Fall 2003 6URL: http://www.csee.umbc.edu/~kalpakis/Courses/621
Name Resolution Issues
naming schemes
host:filenamesimple and efficientno location-transparency
mounting
single global name space
uniqueness of names requires cooperating servers
Context-aware
partition the name space into contexts
name resolution is always performed with respect to a given context
name serverssingle name serverdifferent name servers for different parts of a name space
CMSC 621, Fall 2003 7URL: http://www.csee.umbc.edu/~kalpakis/Courses/621
Caching Issues
Main memory caches
faster access
diskless clients can also use caching
single design for both client and server caches
compete with Virtual Memory manager for physical memory
can not completely cache large stored objects
block-level caching is complex to implement
can not be used by portable clients
Disk caches
remove some of drawbacks of the main memory caches
CMSC 621, Fall 2003 8URL: http://www.csee.umbc.edu/~kalpakis/Courses/621
Caching Issues
Writing policy
write-through
every client’s write request is performed at the servers immediately
delayed writing
client’s writes are reflected to the stored objects at servers after some delay
many writes in the cache
writes to short-lived objects are not done at servers
20-30% of new data are deleted within 30 secs
lost data is an issue
delayed writing until file close
most files are open for a short time
CMSC 621, Fall 2003 9URL: http://www.csee.umbc.edu/~kalpakis/Courses/621
Caching Issues
Approaches to deal with the cache consistency problem
server-initiated
servers inform client cache managers whenever their cached data become
stale
servers need to keep track who cached which file blocks
client-initiated
clients validate data with servers before using
partially negates caching benefits
disable caching when concurrent-write sharing is detected
concurrent-write sharing: multiple clients opened a file with at least
one of them opened for writing
avoid concurrent-write sharing by using locking
CMSC 621, Fall 2003 10URL: http://www.csee.umbc.edu/~kalpakis/Courses/621
More Caching Consistency Issues
The sequential-write sharing problem
occurs when a client opens a (previously opened) file that has recently
been modified and closed by another client
causes problems
A client may still have (outdated) file blocks in its cache
Other client may have not written its modified cached file blocks to file
server
solutions
associate file timestamps with all cached file blocks; at file open request
current file timestamp from file server
file server asks the client with the modified cached blocks to flush its data to
server when another client opens a file for writing
CMSC 621, Fall 2003 11URL: http://www.csee.umbc.edu/~kalpakis/Courses/621
Availability Issues
Replication can help in increasing data availability
is expensive due to extra storage for replicas and due to overhead in
maintaining the replicas consistent
Main problems
maintaining replica consistency
detecting replica inconsistencies and recovering from them
handle network partitions
placing replicas where needed
keep the rate of deadlocks small and availability high
CMSC 621, Fall 2003 12URL: http://www.csee.umbc.edu/~kalpakis/Courses/621
Availability Issues
Unit of replication
complete file or file block
allows replication of only the data that are needed
replica management is harder (locating replicas, ensuring file protection,
etc)
volume (group) of files
wasteful if many files are not needed
replica management simpler
pack, a subset of the files in a user’s primary pack
mutual consistency among replicas
Let most current replica= replica with highest timestamp in a quorum
Use voting to read/write replicas and keep at least one replica current
Only votes from most current replicas are valid
CMSC 621, Fall 2003 13URL: http://www.csee.umbc.edu/~kalpakis/Courses/621
Scalability & Semantic Issues
Caching & cache consistency
take advantage of file usage patterns
many widely used and shared files are accessed in read-only mode
data a client needs are often found in another client’s cache
organize client caches and file servers in a hierarchy for each file
implement file servers, name servers, and cache managers as
multithreaded processes
common FS semantics: each read operation returns data due to
the most recent write operation
providing these semantics in DFS is difficult and expensive
CMSC 621, Fall 2003 14URL: http://www.csee.umbc.edu/~kalpakis/Courses/621
NFS
K erne l
V FSInterface
S erverR outines
D isk
R P C /XD R
R P C /XD R
N FSU nix FSO ther FS
V FSInterface
O S In terface
C lient
S erver
CMSC 621, Fall 2003 15URL: http://www.csee.umbc.edu/~kalpakis/Courses/621
NFS
Interfaces
file system
virtual file system (VFS)
vnodes uniquely identify objects in the FS
contain mount table info (pointers to parent FS and mounted FS)
RPC and XDR (external data representation)
CMSC 621, Fall 2003 16URL: http://www.csee.umbc.edu/~kalpakis/Courses/621
NFS Naming and Location
Filenames are mapped to represented object at first use
mapping is done at the servers by sequentially resolving each
element of a pathname using the vnode information until a file
handle is obtained
CMSC 621, Fall 2003 17URL: http://www.csee.umbc.edu/~kalpakis/Courses/621
NFS Caching
File Caching
read ahead and 8KB file blocks are used
files or file blocks are cached with timestamp of last update
cached blocks are assumed valid for a preset time period
block validation is performed at file open and after timeout at the server
upon detecting an invalid block all blocks of the file are discarded
delayed writing policy with modified blocks flushed to server upon file
close
CMSC 621, Fall 2003 18URL: http://www.csee.umbc.edu/~kalpakis/Courses/621
NFS Caching
Directory name lookup caching
directory names ==> vnodes
cached entries are updated upon lookup failure or when new info is
received
File/Directory attribute cache
access to file/dir attributes accounts for 90% of file requests
file attributes are discarded after 3 secs
dir. Attributes are discarded after 30 secs
dir. Changes are performed at the server
NFS servers are stateless
CMSC 621, Fall 2003 19URL: http://www.csee.umbc.edu/~kalpakis/Courses/621
Sprite File System
Name space is a single hierarchy of domains
Each server stores one or more domains
Domains have unique prefixes
mount points link domains in single hierarchy
clients maintain prefix table
CMSC 621, Fall 2003 20URL: http://www.csee.umbc.edu/~kalpakis/Courses/621
Sprite FS - Prefix tables
locating files in Sprite
each client finds longest prefix match in its prefix table and then sends
remaining of pathname to the matching server together with the domain
token in its prefix table
server replies with file token or with a new pathname if the “file” is a remote
link
each client request contains the filename and domain token
when client fails to find matching prefix or fails during a file open
client broadcasts pathname and server with matching domain replies with
domain/file token
entries in prefix table are hints
CMSC 621, Fall 2003 21URL: http://www.csee.umbc.edu/~kalpakis/Courses/621
Sprite FS - Caching
Client-cache in main memory
file block size is 4KB
cache entries are addressed with file token and block#, which allows
blocks to be added without contacting the server
blocks can be accessed without accessing file’s disk map to get block’s disk
address
clients do not cache directories to avoid inconsistencies
servers have main memory caches as well
delayed writing policy is used
CMSC 621, Fall 2003 22URL: http://www.csee.umbc.edu/~kalpakis/Courses/621
Sprite FS - Cache Writing Policy
Observations
BSD
20-30% of new data live less than 30 secs
75% of files are open for less than 0.5 secs
90% of files are open for less than 10 secs
recent study
65-80% of files are open for less than 30 secs
4-27% of new data are deleted within 30 secs
One can reduce traffic by
not updating servers at file close immediately
not updating servers when caches are updated
CMSC 621, Fall 2003 23URL: http://www.csee.umbc.edu/~kalpakis/Courses/621
Sprite Cache Writing Policy
Delayed writing policy
every 5 secs flush client’s cached (modified) blocks to server if they
haven’t been modified within the last 30 secs
flush blocks from server’s cache to disk within 30-60 secs afterwards
replacement policy: LRU
80% of time blocks ejected to make room for other blocks
20% of time to return memory to VM
cache blocks are unreferenced for about 1hr before ejected
cache misses
40% on reads and 1% on writes
CMSC 621, Fall 2003 24URL: http://www.csee.umbc.edu/~kalpakis/Courses/621
Sprite Cache Consistency
Server initiated
avoid concurrent-write sharing by disabling caching for files open
concurrently for reading and writing
ask client writing file to flush its blocks
inform all other clients that file is not cacheable
file becomes cacheable when all clients close the file again
solve sequential-write sharing using version numbers
each client keeps the version# of file whose blocks it caches
server increments version# each time file is opened for writing
client is informed of file version# at file open
server keeps track of last writer; server asks last writer to flush its cached
blocked if file is opened by another client
CMSC 621, Fall 2003 25URL: http://www.csee.umbc.edu/~kalpakis/Courses/621
Sprite VM and FS Cache Contention
VM and FS compete for physical memory
VM and FS negotiate for physical memory usage
separate pools of blocks using the time of last access to determine
winner; VM is given slight preference (it losses only if a block hasn’t
been referenced for 20 mins)
double caching is a problem
FS marks blocks of newly compiled code with infinite time of last reference
backing files=swapped-out pages (including process state and data
segments)
clients bypass FS cache when reading/writing backing files
CMSC 621, Fall 2003 26URL: http://www.csee.umbc.edu/~kalpakis/Courses/621
CODA
Goals
scalability
availability
disconnected operation
Volume = collection of files and directories on a single server
unit of replication
FS objects have a unique FID which consists of
32-bit volume number
32-bit vnode number
32-bit uniquifier
replicas of a FS object have the same FID
CMSC 621, Fall 2003 27URL: http://www.csee.umbc.edu/~kalpakis/Courses/621
CODA Location
Volume Location database
replicated at each server
Volume Replication database
replicated at each server
Volume Storage Group (VSG)
Venus
client cache manager
caching in local disk
AVSG=client accessible nodes in VSG
preferred server in AVSG
CMSC 621, Fall 2003 28URL: http://www.csee.umbc.edu/~kalpakis/Courses/621
CODA Caching & Replication
Venus caches files/dirs on demand
from the server in AVSG with the most up-to-date data
on file access
users can indicated caching priorities for file/dirs
users can bracket action sequences
Venus established callbacks at preferred server for each FS object
Server callbacks
server tells client that cached object is invalid
lost callbacks can happen
CMSC 621, Fall 2003 29URL: http://www.csee.umbc.edu/~kalpakis/Courses/621
CODA AVSG Maintenance
Venus tracks changes in AVSG
new nodes in VSG that should or should not be in its AVSG by periodically
probing every node in VSG
removes a node from AVSG if operation fails
chooses a new preferred server if needed
Coda Version Vector (CVV)
both for volumes and files/dirs
vector with one entry for each node in VSG indicating the number of
updates of the volume or FS object
CMSC 621, Fall 2003 30URL: http://www.csee.umbc.edu/~kalpakis/Courses/621
Coda Replica Management
State of an object or replica
each modification is tagged with a storeid
update history = sequence of storeids
state is a truncated update history
latest storeid LSID
CVV
CMSC 621, Fall 2003 31URL: http://www.csee.umbc.edu/~kalpakis/Courses/621
Coda Replica Management
comparing replicas A & B leads to one of four cases
LSID-A = LSID-B & CVV-A = CVV-B => strong equality
LSID-A = LSID-B & CVV-A != CVV-B => weak equality
LSID-A != LSID-B & CVV-A >= CVV-B => A dominates B
otherwise => inconsistent
when S receives an update for a replica C
checks the state of S and C; test is successful if
for files, it leads to strong equality or dominance
for dirs, it leads to strong equality
CMSC 621, Fall 2003 32URL: http://www.csee.umbc.edu/~kalpakis/Courses/621
Coda Replica Management
When C wants to update a replicated object
phase I
sent update to every node in its AVSG
each node performs a check of replica states (cached object and replicated
object), and informs the client of the result, and performs the update if
successful
if unsuccessfull, pauses client, server tries to resolve problem automatically,
if not then client aborts else client resumes
phase II
client sends updated object state to every site in AVSG
CMSC 621, Fall 2003 33URL: http://www.csee.umbc.edu/~kalpakis/Courses/621
Coda Replica Management
Force operation between servers
happens when Venus informs AVSG of weak consistency in AVSG
server with dominant replica overwrites data and state of dominated
server
for directories is done with the help of locking one directory at a time
repair operation
automatic; proceeds in two phases as in an update
migrate operation
moves inconsitent data to a covolume for manual repair
CMSC 621, Fall 2003 34URL: http://www.csee.umbc.edu/~kalpakis/Courses/621
Conflict Resolution
Conflicts between
files are done by the user using the repair tool which bypasses Coda
update rules; inconsistent files are inaccessible to CODA
directories
uses the fact that a dir is a list of files
non-automated conflicts
update/update (for attributes)
remove/update
create/create (adding identical files)
all other conflicts can be resolved easily
inconsistent objects and objects without automatic conflict resolution are
placed in covolumes