CS519 Fall 2003 zDistributed File Systems zLecturer: Ricardo Bianchini.
-
Upload
luke-nelson -
Category
Documents
-
view
212 -
download
0
Transcript of CS519 Fall 2003 zDistributed File Systems zLecturer: Ricardo Bianchini.
CS519 Fall 2003
Distributed File SystemsLecturer: Ricardo Bianchini
CS 519Operating System
Theory2
File Service
Implemented by a user/kernel process called file server
A system may have one or several file servers running at the same time
Two models for file services upload/download: files move between server and
clients, few operations (read file & write file), simple, requires storage at client, good if whole file is accessed
remote memory access: files stay at server, rich interface with many operations, less space at client, efficient for small accesses
CS 519Operating System
Theory3
Directory Service
Provides naming usually within a hierarchical file system Clients can have the same view (global root directory) or
different views of the file system (remote mounting) Location transparent: location of the file doesn’t appear in
the name of the file ex: /server1/dir1/file specifies the server but not where the
server is located -> server can move the file in the network without changing the path
Location independence: a single name space that looks the same on all machines, files can be moved between servers without changing their names -> difficult
CS 519Operating System
Theory4
Two-Level Naming
Symbolic name (external), e.g. prog.c; binary name (internal), e.g. local i-node number as in Unix
Directories provide the translation from symbolic to binary names
Binary name format i-node: no cross references among servers (server, i-node): a directory in one server can refer to a
file on a different server Capability specifying address of server, number of file,
access permissions, etc {binary_name+}: binary names refer to the original file
and all of its backups
CS 519Operating System
Theory5
File Sharing Semantics
UNIX semantics: total ordering of R/W events easy to achieve in a non-distributed system in a distributed system with one server and multiple
clients with no caching at client, total ordering is also easily achieved since R and W are immediately performed at server
Session semantics: writes are guaranteed to become visible only when the file is closed allow caching at client with lazy updating -> better
performance if two or more clients simultaneously write: one file (last
one or non-deterministically) replaces the other
CS 519Operating System
Theory6
File Sharing Semantics (cont’d)
Immutable files: create and read file operations (no write) writing a file means to create a new one and enter it
into the directory replacing the previous one with the same name: atomic operations
collision in writing: last copy or non-deterministically what happens if the old copy is being read?
Transaction semantics: mutual exclusion on file accesses; either all file operations are completed or none is. Good for banking systems
CS 519Operating System
Theory7
File System Properties
Observed in a study by Satyanarayanan (1981) most files are small (< 10K) reading is much more frequent than writing most R&W accesses are sequential (random access is
rare) most files have a short lifetime -> create the file on the
client file sharing is unusual -> caching at client the average process uses only a few files
CS 519Operating System
Theory8
Server System Structure
File + directory service: combined or not Cache directory hints at client to accelerate the
path name look up – directory and hints must be kept coherent
State information about clients at the server stateless server: no client information is kept between
requests stateful server: servers maintain state information about
clients between requests
CS 519Operating System
Theory9
Stateless vs. Stateful
requests are self-contained better fault tolerance open/close at client (fewer
msgs) no space reserved for tables thus, no limit of open files
shorter messages better performance (info in
memory until close) open/close at server file locking possible read ahead possible
Stateless Server Stateful Servers
CS 519Operating System
Theory10
Caching
Three possible places: server’s memory, client’s disk, client’s memory
Caching in server’s memory: avoids disk access but still network access
Caching at client’s disk (if available): tradeoff between disk access and remote memory access
Caching at client in main memory inside each process address space: no sharing at client in the kernel: kernel involvement on hits in a separate user-level cache manager: flexible and efficient if
paging can be controlled from user-level Server-side caching eliminates coherence problem. Client-
side cache coherence? Next…
CS 519Operating System
Theory11
Client Cache Coherence in DFS
How to maintain coherence (according to a model, e.g. UNIX semantics or session semantics) of copies of the same file at various clients
Write-through: writes sent to the server as soon as they are performed at the client -> high traffic, requires cache managers to check (modification time) with server before can provide cached content to any client
Delayed write: coalesces multiple writes; better performance but ambiguous semantics
Write-on-close: implements session semantics Central control: file server keeps a directory of open/cached
files at clients and sends invalidations -> Unix semantics, but problems with robustness and scalability; problem also with invalidation messages because clients did not solicit them
CS 519Operating System
Theory12
File Replication
Multiple copies are maintained, each copy on a separate file server - multiple reasons: Increase reliability: file accessible even if a server is down Improve scalability: reduce the contention by splitting the
workload over multiple servers Replication transparency
explicit file replication: programmer controls replication lazy file replication: copies made by the server in
background use group communication: all copies made at the same
time in the foreground How replicas should be modified? Next…
CS 519Operating System
Theory13
Modifying Replicas: Voting Protocol
Updating all replicas using a coordinator works but is not robust (if coordinator is down, no updates can be performed) => Voting: updates (and reads) can be performed if some specified # of servers agree.
Voting Protocol: A version # (incremented at write) is associated with each file To perform a read, a client has to assemble a read quorum of
Nr servers; similarly, a write quorum of Nw servers for a write If Nr + Nw > N, then any read quorum will contain at least one
most recently updated file version For reading, client contacts Nr active servers and chooses the
file with largest version # For writing, client contacts Nw active servers asking them to
write. Succeeds if they all say yes.
CS 519Operating System
Theory14
Modifying Replicas: Voting Protocol
Nr is usually small (reads are frequent), but Nw is usually close to N (want to make sure all replicas are updated). Problem with achieving a write quorum in the presence of server failures
Voting with ghosts: allows to establish a write quorum when several servers are down by temporarily creating dummy (ghost) servers (at least one must be real)
Ghost servers are not permitted in a read quorum (they don’t have any files)
When server comes back it must restore its copy first by obtaining a read quorum
CS 519Operating System
Theory15
Network File System (NFSv3)
A stateless DFS from Sun; only state is map of handles to files An NFS server exports directories Clients access exported directories by mounting them Because NFS is stateless, OPEN and CLOSE RPCs are not provided
by the server (implemented at the client); clients need to block on close until all dirty data are stored on disk at the server
NFS provides file locking (through separate network lock manager protocol) but UNIX semantics is not achieved due to client caching dirty cache blocks are sent to server in chunks, every 30 sec or at close a timer is associated with each cache block at the client (3 sec for data
blocks, 30 sec for directory blocks). When the timer expires, the entry is discarded (if clean, of course)
when a file is opened, last modification time at the server is checked
CS 519Operating System
Theory16
Recent Research in DFS
Petal & Frangipani (DEC SRC): 2-layer DFS system
xFS (Berkeley) : a serverless network file system
CS 519Operating System
Theory17
Petal: Distributed Virtual Disks
A distributed storage system that provides a virtual disk abstraction separate from the physical resource
The virtual disk is globally accessible to all Petal clients on the network
Virtual disks are implemented on a cluster of servers that cooperate to manage a pool of physical disks
Advantages recover from any single failure transparent reconfiguration and expandability load and capacity balancing low-level service (lower than a DFS) that handles distribution
problems
CS 519Operating System
Theory18
Petal
CS 519Operating System
Theory19
Virtual to Physical Translation
<virtual disk, virtual offset> -> <server, physical disk, physical offset>
Three data structures: virtual disk directory, global map, and physical map
The virtual disk directory and global map are globally replicated and kept consistent
Physical map is local to each server One level of indirection (virtual disk to global map)
is necessary to allow transparent reconfiguration. We’ll discuss reconfiguration soon
CS 519Operating System
Theory20
Virtual to Physical Translation (cont’d)
1. The virtual disk directory translates the virtual disk identifier into a global map identifier
2. The global map determines the server responsible for translating the given offset (a virtual disk may be spread over multiple physical disks). The global map also specifies the redundancy scheme for the virtual disk
3. The physical map at a specific server translates the global map identifier and the offset to a physical disk and an offset within that disk. The physical map is similar to a page table
CS 519Operating System
Theory21
Support for Backup
Petal simplifies a client’s backup procedure by providing a snapshot mechanism
Petal generates snapshots of virtual disks using copy-on-write. Creating a snapshot requires pausing the client’s application to guarantee consistency
A snapshot is a virtual disk that cannot be modified Snapshots require a modification to the translation scheme. The
virtual disk directory translates a virtual disk id into a pair <global map id, epoch #> where epoch # is incremented at each snapshot
At each snapshot a new tuple with a new epoch is created in the virtual disk directory. The snapshot takes the old epoch #
All accesses to the virtual disk are made using the new epoch #, so that any write to the original disk creates new entries in the new epoch rather than overwrites the blocks in the snapshot
CS 519Operating System
Theory22
Virtual Disk Reconfiguration
Needed when a new server is added or the redundancy scheme is changed
Steps to perform it at once (not incrementally) and in the absence of any other activity: create a new global map with desired redundancy
scheme and server mapping change all virtual disk directories to point to the new
global map redistribute data to the severs according to the
translation specified in the new global map The challenge is to perform it incrementally and
concurrently with normal client requests
CS 519Operating System
Theory23
Incremental Reconfiguration
First two steps as before; step 3 done in background starting with the translations in the most recent epoch that have not yet been moved
Old global map is used to perform read translations which are not found in the new global map
A write request only accesses the new global map to avoid consistency problems
Limitation: the mapping of the entire virtual disk must be changed before any data is moved -> lots of new global map misses on reads -> high traffic. Solution: relocate only a portion of the virtual disk at a time. Read requests for portion of virtual disk being relocated cause misses, but not requests to other areas
CS 519Operating System
Theory24
Redundancy with Chained Data Placement
Petal uses chained-declustering data placement
two copies of each data block are stored on neighboring servers
every pair of neighboring servers has data blocks in common
if server 1 fails, servers 0 and 2 will share server’s read load (not server 3)
server 0 server 1 server 2 server 3 d0 d1 d2 d3 d3 d0 d1 d2 d4 d5 d6 d7 d7 d4 d5 d6
CS 519Operating System
Theory25
Chained Data Placement (cont’d)
In case of failure, each server can offload some of its original read load to the next/previous server. Offloading can be cascaded across servers to uniformly balance load
Advantage: with simple mirrored redundancy, the failure of a server would result in a 100% load increase to another server
Disadvantage: less reliable than simple mirroring - if a server fails, the failure of either one of its two neighbor servers will result in data becoming unavailable
In Petal, one copy is called primary, the other secondary Read requests can be serviced by any of the two servers,
while write requests must always try the primary first to prevent deadlock (blocks are locked before reading or writing, but writes require access to both servers)
CS 519Operating System
Theory26
Read Request
The Petal client tries primary or secondary server depending on which one has the shorter queue length. (Each client maintains a small amount of high-level mapping information that is used to route requests to the “most appropriate” servers. If a request is sent to an inappropriate server, the server returns an error code, causing the client to update its hints and retry the request)
The server that receives the request attempts to read the requested data
If not successful, the client tries the other server
CS 519Operating System
Theory27
Write Request
The Petal client tries the primary server first The primary server marks data busy and sends the request
to its local copy and the secondary copy When both complete, the busy bit is cleared and the
operation is acknowledged to the client If not successful, the client tries the secondary server If the secondary server detects that the primary server is
down, it marks the data element as stale on stable storage before writing to its local disk
When the primary server comes up, the primary server has to bring all data marked stale up-to-date during recovery
Similar if secondary server is down
CS 519Operating System
Theory28
Petal Prototype
CS 519Operating System
Theory29
Petal Performance - Latency
Single client generates requests to random disk offsets
CS 519Operating System
Theory30
Petal Performance - Throughput
Each of 4 clients making random requests to single VD.Failed configuration = one of 4 servers has crashed
CS 519Operating System
Theory31
Petal Performance - Scalability
CS 519Operating System
Theory32
Frangipani
Petal provides disk interface -> need a file system Frangipani is a file system designed to take full
advantage of Petal Frangipani’s main characteristics:
All users are given a consistent view of the same set of files
Servers can be added without changing configuration of existing servers or interrupting their operation
Tolerates and recovers from machine, network, and disk failures
Very simple internally: a set of cooperating machines that use a common store and synchronize access to that store with locks
CS 519Operating System
Theory33
Frangipani
Petal takes much of the complexity out of Frangipani Petal provides highly available storage that can scale in
throughput and capacity However, Frangipani improves on Petal, since:
Petal has no provision for sharing the storage among multiple clients
Applications use a file-based interface rather than the disk-like interface provided by Petal
Problems with Frangipani on top of Petal: Some logging occurs twice (once in Frangipani and once in
Petal) Cannot use disk location in placing data, cause Petal virtualizes
disks Frangipani locks entire files and directories as opposed to
individual blocks
CS 519Operating System
Theory34
Frangipani Structure
CS 519Operating System
Theory35
Frangipani: Disk Layout
A Frangipani file system uses only 1 Petal virtual disk Petal provides 264 bytes of “virtual” disk space
Commits real disk space when actually used (written) Frangipani breaks disk into regions
1st region (1T) stores config parameters and housekeeping info 2nd region (1T) stores logs – each Frangipani server uses a
portion of this region for its log. Can have up to 256 logs. 3rd region (3T) holds allocation bitmaps, describing which
blocks in remaining regions are free. Each server locks a different portion.
4th region (1T) holds inodes 5th region (128T) holds small data blocks (4 Kbytes each) Remainder of Petal disk holds large data blocks (1 Tbyte each)
CS 519Operating System
Theory36
Frangipani: File Structure
First 16 blocks (64 KB) of a file are stored in small blocks
If file becomes larger, store the rest in a 1 TB large block
CS 519Operating System
Theory37
Frangipani: Dealing with Failures
Write-ahead redo logging of metadata; user data is not logged
Each Frangipani server has its own private log Only after a log record is written to Petal does the
server modify the actual metadata in its permanent locations
If a server crashes, the system detects the failure and another server uses the log to recover Because the log is on Petal, any server can get to it.
CS 519Operating System
Theory38
Frangipani: Synchronization & Coherence
Frangipani has a lock for each log segment, allocation bitmap segment, and each file
Multiple-reader/single-writer locks. In case of conflicting requests, the owner of the lock is asked to release or downgrade it to remove the conflict
A read lock allows a server to read data from disk and cache it. If server is asked to release its read lock, it must invalidate the cache entry before complying
A write lock allows a server to read or write data and cache it. If a server is asked to release its write lock, it must write dirty data to disk and invalidate the cache entry before complying. If a server is asked to downgrade the lock, it must write dirty data to disk before complying
CS 519Operating System
Theory39
Frangipani: Lock Service
Fully distributed lock service for fault tolerance and scalability
How to release locks owned by a failed Frangipani server? The failure of a server is discovered when its “lease”
expires. A lease is obtained by the server when it first contacts the lock service. All locks acquired are associated with the lease. Each lease has an expiration time (30 seconds) after its creation or last renewal. A server must renew its lease before it expires
When a server fails, the locks that it owns cannot be released until its log is processed and any pending updates are written to Petal
CS 519Operating System
Theory40
Frangipani: Performance
CS 519Operating System
Theory41
Frangipani: Performance
CS 519Operating System
Theory42
Frangipani: Scalability
CS 519Operating System
Theory43
Frangipani: Scalability
CS 519Operating System
Theory44
Frangipani: Scalability
CS 519Operating System
Theory45
xFS (Context & Motivation)
A server-less network file system that works over a cluster of cooperative workstations
Moving away from central FS is motivated by three factors hardware opportunity (fast switched LANs) provide
aggregate bandwidth that scales with the number of machines in the network
user demand is increasing: e.g., multimedia limitations of central FS approach:
limited scalability Expensive replication for availability increase complexity and
operation latency
CS 519Operating System
Theory46
xFS (Contribution & Limitations)
A well-engineered approach which takes advantage of several research ideas: RAID, LFS, cooperative caching
A truly distributed network file system (no central bottleneck) control processing distributed across the system on per-
file granularity storage distributed using a software RAID and a log-
based network striping (Zebra) use cooperative caching to use portions of client
memory as a large, global file cache Limitation: requires machines to trust each other
CS 519Operating System
Theory47
RAID in xFS
RAID partitions a stripe of data into N-1 data blocks and a parity block (the exclusive-OR of the bits of data blocks)
Data and parity blocks are stored on different storage servers
Provides both high bandwidth and fault tolerance Traditional RAID drawbacks:
multiple accesses for small writes hardware RAID expensive (special hardware to compute
parity)
CS 519Operating System
Theory48
LFS in xFS
High-performance writes: buffer writes in memory to write them to disk in large, contiguous, fixed-size groups called log segments
Writes are always appended as logs imap to locate i-nodes: stored in memory and periodically
checkpointed to disk Simple recovery procedure: get the last checkpoint and
then rolls forward reading the later segments and in the log and update imap and i-nodes
Free disk management through log cleaner: coalesces old, partially empty segments into a smaller number of full segments -> cleaning overhead can be large sometime
CS 519Operating System
Theory49
Zebra
Combines LFS and RAID: LFS’s large writes make writes to the network RAID efficient
Implements RAID in software Writes coalesced into a private per-client log Log-base striping:
log segment split into log fragments which are striped over the storage servers
parity fragment computation is local (no network access)
Deltas stored in the log encapsulate modifications to file system states that must be performed atomically - used for recovery
CS 519Operating System
Theory50
Metadata and Data Distribution
A centralized FS stores all data blocks on its local disks manages location of metadata maintains a central cache of data blocks in its memory manages cache consistency metadata that lists which
clients in the system are caching each block (not NFS)
CS 519Operating System
Theory51
xFS: Metadata and Data Distribution
Stores data on storage servers Splits metadata management among multiple
managers that can dynamically alter the mapping from a file to its manager
Uses cooperative caching that forwards data among client caches under the control of the managers
The key design challenge: how to locate data and metadata in such a completely distributed system
CS 519Operating System
Theory52
xFS: Data Structures
CS 519Operating System
Theory53
Manager Map
Allows clients to determine which manager to contact for a file
Manager map is globally replicated (it is small) Two translations are necessary to allow manager
remapping external file name - > file index number (directory) index number -> manager (manager map)
Manager map can also be used for a coarse-grained workload balancing among managers
File manager controls disk location metadata (Imap &I-node) and cache consistency state (list of clients caching the block or who has the ownership for write)
CS 519Operating System
Theory54
Read Operation
CS 519Operating System
Theory55
Write Operation
Clients buffer writes in their local memory until committed to a stripe group of storage servers
Since xFS uses LFS a write changes the disk address of the modified block
After a client commits a segment to a storage server it notifies the modified blocks’ managers to modify their index nodes and imaps
Index nodes and data blocks do not have to be simultaneously committed because in Zebra the client’s log includes a delta that allows reconstruction of the manager’s data structure in the event of a crash
CS 519Operating System
Theory56
Cache Consistency
Per-block rather than per-file Ownership-based similar to a DSM scheme To modify a block a client must get the
ownership from the manager The manager invalidates any other cached copies
of the block, then gives write permission (ownership) to the client
Ownership can be revoked by the manager Manager keeps the list of clients caching each
block
CS 519Operating System
Theory57
Log cleaner in xFS
Distributed Relies on utilization status which is also distributed:
maintained by the client who wrote that segment A leader in each group initiates cleaning and decides which
cleaners should clean the stripe group’s segments Each cleaner receives a subset of segments to clean Cleaners assume optimistic concurrrency to resolve
conflicts between cleaner updates and normal writes In case of a conflict (because a client is writing a block as it
is cleaned) the manager ensures that client update takes precedence over the cleaner’s update
CS 519Operating System
Theory58
xFS: Performance
CS 519Operating System
Theory59
xFS: Performance