1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and...
-
date post
20-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and...
11COMP 734 – Fall 2011
COMP 734 -- Distributed File Systems
With Case Studies:
NFS, Andrew and Google
22COMP 734 – Fall 2011
Distributed File Systems Were Phase 0 of “Cloud” Computing
Mobility(user & data) Sharing
AdministrationCosts
ContentManagement
BackupSecurity
Performance???Phase 0: Centralized Data, Distributed Processing
33COMP 734 – Fall 2011
Distributed File System Clients and Servers
request
(e.g., read)response
(e.g., file block)
Client Server
Client
ClientMost Distributed File Systems Use
Remote Procedure Calls (RPC)
network
44COMP 734 – Fall 2011
RPC Structure (Birrell & Nelson)1
1 Fig. 1 (slightly modified)
localreturn
localcall
unpackresult
packargs
unpackargs
return
wait
receive
call
work
importer exporter
interface
client
exporter
interface
clientstub
importer
serverstub
serverRPC
runtimeRPC
runtime
Caller machine Callee machine
transmitcall packet
transmitresult packet
receive
packresult
network
Birrell, A. D. and B. J. Nelson, Implementing Remote Procedure Calls, ACM Transactions on Computer Systems, Vol. 2, No. 1, February 1984, pp. 39-59
55COMP 734 – Fall 2011
Unix Local File Access Semantics – Multiple Processes Read/Write a File Concurrently
write by A
write by B
read by C B Ahappens before
A Bhappens before
write by A
write by B
read by C
writes are “atomic”
reads always get the atomic result of the most recently completed write
write by A
write by B
read by C
66COMP 734 – Fall 2011
What Happens If Clients Cache File/Directory Content?
read()
response
Client Server
Client
Clientwrite()
write()
clientcache
clientcache
Do the Cache Consistency Semantics Match Local Concurrent Access Semantics?
read()
network
77COMP 734 – Fall 2011
File Usage Observation #1:Most Files are Small
Source: Mary Baker, et at, “Measurements of a Distributed File System,” Proceedings 13th ACM SOSP, 1991, pp. 198-212. Source: Andrew W. Leung, et at, “Measurement and Analysis of Large-Scale Network File System Workloads,” Proceedings USENIX Annual Technical Conference, 2008,
pp. 213-226.
Unix, 1991
80% of files < 10 KB
Windows, 2008
80% of files < 30 KB
88COMP 734 – Fall 2011
File Usage Observation #2:Most Bytes Are Transferred from Large Files (and Large Files Are Larger)
Source: Mary Baker, et at, “Measurements of a Distributed File System,” Proceedings 13th ACM SOSP, 1991, pp. 198-212. Source: Andrew W. Leung, et at, “Measurement and Analysis of Large-Scale Network File System Workloads,” Proceedings USENIX Annual Technical Conference, 2008,
pp. 213-226.
Unix, 1991
80% of bytes from files > 10 KB
Windows, 2008
70% of bytes from files > 30 KB
99COMP 734 – Fall 2011
File Usage Observation #3:Most Bytes Are Transferred in Long Sequential Runs – Most Often the Whole File
Source: Mary Baker, et at, “Measurements of a Distributed File System,” Proceedings 13th ACM SOSP, 1991, pp. 198-212. Source: Andrew W. Leung, et at, “Measurement and Analysis of Large-Scale Network File System Workloads,” Proceedings USENIX Annual Technical Conference, 2008,
pp. 213-226.
Unix, 1991 Windows, 2008
85% of sequential bytes from runs > 100 KB
60% of sequential bytes from runs > 100 KB
1010COMP 734 – Fall 2011
Chronology of Early Distributed File Systems – Designs for Scalability
Time
Clie
nts
sup
por
ted
by
on
e in
sta
llatio
n
product
open source (AFS)
1111COMP 734 – Fall 2011
NFS-2 Design Goals
Transparent File Naming Scalable -- O(100s) Performance approximating local files Fault tolerant (client, server, network faults) Unix file-sharing semantics (almost) No change to Unix C library/system calls
1212COMP 734 – Fall 2011
bob
NFS File Naming – Exporting Names
/
local
/
local usr
usr
sam joe
doc
readme.docreport.doc
usr
fred
/
tools
source
main.cdata.h
proj usr
sam joe
doc
readme.docreport.doc
bin
tools
bin
toolssource
main.cdata.h
proj
host#1
host#2
host#3
exportfs /bin exportfs /usr
exportfs /usr/proj
1313COMP 734 – Fall 2011
bob
NFS File Naming – Import (mount) Names
/
local
/
local usr
usr
sam joe
doc
readme.docreport.doc
sam joe
doc
readme.docreport.doc
usr
fred
/
tools
tools
source
main.cdata.h
proj
source
main.cdata.h
usr
sam joe
doc
readme.docreport.doc
bin
tools
bin
toolssource
main.cdata.h
proj
host#1
host#2
host#3
exportfs /bin exportfs /usr
exportfs /usr/projmount host#3:/bin /tools
mount host#1:/usr/proj /local
mount host#3:/usr /usr
1414COMP 734 – Fall 2011
NFS-2 Remote Procedure Calls
NFS server is Stateless- each call is self-contained and independent
- avoids complex state recovery issues in the event of failures
- each call has an explicit file handle parameter
1515COMP 734 – Fall 2011
NFS-2 Read/Write with Client Cache
application
OS kernel
OS kernel
read(fh, buffer, length)
file system NFS client file systemNFS server
Buffercacheblocks
network
1616COMP 734 – Fall 2011
NFS-2 Cache Operation and Consistency Semantics
Application writes modify blocks in buffer cache and on the server with write() RPC (“write-through”)
Consistency validations of cached data compare the last-modified timestamp in the cache with the value on the server If server timestamp is later, the cached data is discarded Note the need for (approximately) synchronized clocks on client and server
Cached file data is validated (a getattr() RPC) each time the file is opened.
Cached data is validated (a getattr() RPC) if an application accesses it and a time threshold has passed since it was last validated validations are done each time a last-modified attribute is returned on RPC for
read, lookup, etc.) 3 second threshold for files 30 second threshold for directories
If the time threshold has NOT passed, the cached file/directory data is assumed valid and the application is given access to it.
1717COMP 734 – Fall 2011
Andrew Design Goals
Transparent file naming in single name space Scalable -- O(1000s) Performance approximating local files Easy administration and operation “Flexible” protections (directory scope) Clear (non-Unix) consistency semantics Security (authentication)
1818COMP 734 – Fall 2011
Users Share a Single Name Space
/
etc bincache
/
etc bincache
/
etc bincache
/
etc bincache
pkg home
smithfd reiter
doc
proj
tools
win32
afs
1919COMP 734 – Fall 2011
Server “Cells” Form a “Global” File System
/
etc bincache
/
etc bincache
home
ted carol
doc
home
alice bob
doc
home
smithfd reiter
doc
/afs/cs.unc.edu
/afs/cern.ch/afs/cs.cmu.edu
2020COMP 734 – Fall 2011
Andrew File Access (Open)
application
OS kernelOS kernel
open(/afs/usr/fds/foo.c)
file system Client file systemServer
On-disk cache
(1)
(1) - open request passed to Andrew client
(2)
(2) - client checks cache for valid file copy
(3)
(3) - if not in cache, fetch whole file from server and write to cache; else (4)(4) - when file is in cache, return handle to local file
(4)
network
2121COMP 734 – Fall 2011
Andrew File Access (Read/Write)
application
OS kernelOS kernel
file system Client file system Server
On-disk cache
(5) - read and write operations take place on local cache copy
(5)
read(fh, buffer, length)write(fh, buffer, length)
network
2222COMP 734 – Fall 2011
Andrew File Access (Close-Write)
application
OS kernelOS kernel
close(fh)
file system Client file systemServer
On-disk cache
(6)
(6) - close request passed to Andrew client
(7)
(7) - client writes whole file back to server from cache
(8) - server copy of file is entirely replaced
(8)
network
2323COMP 734 – Fall 2011
Andrew Cache Operation and Consistency Semantics
Directory lookup uses valid cache copy; directory updates (e.g., create or remove) to cache “write-through” to server.
When file or directory data is fetched, the server “guarantees” to notify (callback) the client before changing the server’s copy
Cached data is used without checking until a callback is received for it or 10 minutes has elapsed without communication with its server
On receiving a callback for a file or directory, the client invalidates the cached copy
Cached data can also be revalidated (and new callbacks established) by the client with an RPC to the server avoids discarding all cache content after network partition or client crash
2424COMP 734 – Fall 2011
Andrew Benchmark -- Server CPU Utilization
Source: Howard, et al, “Scale and Performance in a Distributed File System”, ACM TOCS, vol. 6, no. 1, February 1988, pp. 51-81.
Andrew server utilization increases slowly with load
NFS server utilization saturates quickly with load
2525COMP 734 – Fall 2011
Google is Really Different….
Huge Datacenters in 30+ Worldwide Locations
Datacenters house multiple server clusters
Even nearby in Lenior, NC
each > football field
4 story cooling towers
The Dalles, OR (2006)
2007
2008
2626COMP 734 – Fall 2011
Google is Really Different….
Each cluster has hundreds/thousands of Linux systems on Ethernet switches
500,000+ total servers
2727COMP 734 – Fall 2011
Custom Design Servers
Clusters of low-cost commodity hardware Custom design using high-volume components SATA disks, not SAS (high capacity, low cost, somewhat
less reliable) No “server-class” machines Battery power backup
2828COMP 734 – Fall 2011
Facebook Enters the Custom Server Race (April 7, 2011)
Announces the Open Compute Project (the Green Data Center)
2929COMP 734 – Fall 2011
Google File System Design Goals
Familiar operations but NOT Unix/Posix Standard Specialized operation for Google applications
record_append()
Scalable -- O(1000s) of clients per cluster Performance optimized for throughput
No client caches (big files, little cache locality) Highly available and fault tolerant Relaxed file consistency semantics
Applications written to deal with consistency issues
3030COMP 734 – Fall 2011
File and Usage Characteristics
Many files are 100s of MB or 10s of GB Results from web crawls, query logs, archives, etc. Relatively small number of files (millions/cluster)
File operations: Large sequential (streaming) reads/writes Small random reads (rare random writes)
Files are mostly “write-once, read-many.” File writes are dominated by appends, many from hundreds of
concurrent processes (e.g., web crawlers)
process
process
process
Appended file
3131COMP 734 – Fall 2011
GFS Basics
Files named with conventional pathname hierarchy (but no actual directory files) E.g., /dir1/dir2/dir3/foobar
Files are composed of 64 MB “chunks” (Linux typically uses 4 KB blocks)
Each GFS cluster has many servers (Linux processes): One primary Master Server Several “Shadow” Master Servers Hundreds of “Chunk” Servers
Each chunk is represented by a normal Linux file Linux file system buffer provides caching and read-ahead Linux file system extends file space as needed to chunk size
Each chunk is replicated (3 replicas default) Chunks are check-summed in 64KB blocks for data integrity
3232COMP 734 – Fall 2011
GFS Protocols for File Reads
Minimizes client interaction with master: - Data operations directly with chunk servers. - Clients cache chunk metadata until new open or timeout
Ghemawat, S., H. Gobioff, and S-T. Leung, The Google File System, Proceedings of ACM SOSP 2003, pp. 29-43
3333COMP 734 – Fall 2011
Master Server Functions
Maintain file name space (atomic create, delete names) Maintain chunk metadata
Assign immutable globally-unique 64-bit identifier Mapping from files name to chunk(s) Current chunk replica locations
Refresh dynamically from chunk servers
Maintain access control data Manage replicas and other chunk-related actions
Assign primary replica and version number Garbage collect deleted chunks and stale replicas Migrate chunks for load balancing Re-replicate chunks when servers fail
Heartbeat and state-exchange messages with chunk servers
3434COMP 734 – Fall 2011
GFS Relaxed Consistency Model
Writes that are large or cross chunk boundaries may be broken into multiple smaller ones by GFS
Sequential writes (successful): One copy semantics*, writes serialized.
Concurrent writes (successful): One copy semantics Writes not serialized in overlapping regions
Sequential or concurrent writes (failure): Replicas may differ Application should retry
All replicas equal
*Informally, there exists exactly one current value at all replicas and that value is returned for a read of any replica
3535COMP 734 – Fall 2011
GFS Applications Deal with Relaxed Consistency
Writes Retry in case of failure at any replica Regular checkpoints after successful sequences Include application-generated record identifiers and
checksums Reads
Use checksum validation and record identifiers to discard padding and duplicates.
3636COMP 734 – Fall 2011
GFS Chunk Replication (1/2)
12
12
12
Master
Client
C1
C2
C3
primaryClient
Find
Loca
tion
C1,
C2(
prim
ary)
,C3
1. Client contacts master to get replica state and caches it
LRU buffers at chunk servers
2. Client picks any chunk server and pushes data. Servers forward data along “best” path to others.
ACK
ACK
ACK
3737COMP 734 – Fall 2011
GFS Chunk Replication (2/2)
12
12
12
Master
Client
C1
C2
C3
Client
Write
Write
12
12
write order
write order
ACK
ACK
success/failure
3. Client sends write request to primary
4. Primary assigns write order and forwards to replicas
5. Primary collects ACKs and responds to client. Applications must retry write if there is any failure.
success/
failure
3838COMP 734 – Fall 2011
GFS record_append()
Client specifies only data content and region size; server returns actual offset to region
Guaranteed to append at least once atomically File may contain padding and duplicates
Padding if region size won’t fit in chunk Duplicates if it fails at some replicas and client
must retry record_append() If record_append() completes successfully, all
replicas will contain at least one copy of the region at the same offset
3939COMP 734 – Fall 2011
GFS Record Append (1/3)
12
12
12
Master
Client
C1
C2
C3
primaryClient
Find
Loca
tion
C1,
C2(
prim
ary)
,C3
1. Client contacts master to get replica state and caches it
LRU buffers at chunk servers
2. Client picks any chunk server and pushes data. Servers forward data along “best” path to others.
ACK
ACK
ACK
4040COMP 734 – Fall 2011
GFS Record Append (2/3)
12
12
12
Master
Client
C1
C2
C3
Client
Write
Write
1@2@
1@2@
write order
write order
ACK
ACK
offset/failure
3. Client sends write request to primary
4. If record fits in last chunk, primary assigns write order and offset and forwards to replicas
5. Primary collects ACKs and responds to client with assigned offset. Applications must retry write if there is any failure.
success/
failure
4141COMP 734 – Fall 2011
GFS Record Append (3/3)
12
12
12
Master
Client
C1
C2
C3
Write
Retry on next chunk
3. Client sends write request to primary
4. If record overflows last chunk, primary and replicas pad last chunk and offset points to next chunk
Pad to next chunk
Pad to next chunk
5. Client must retry write from beginning