Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new...

39
C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Frangipani: A Scalable Distributed File System Presented by: Long Zhang Slides come from the combination of previous course and Frangipani’s original slides in SOSP 97

Transcript of Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new...

Page 1: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

C. A. Thekkath, T. Mann, and E. K. LeeSystems Research Center

Digital Equipment Corporation

Frangipani: A Scalable Distributed File System

Presented by: Long ZhangSlides come from the combination of previous course and Frangipani’s original slides in SOSP 97

Page 2: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

Motivation

2

Large-scale distributed file systems are hard to administer

Administration is a problem because of- size of installation- number of components

Page 3: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

Background Introduction System Structure Disk Layout Logging and Recovery The Lock Service Easy Administration Performance Conclusions Questions

Outline

Page 4: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

Original slides: http://ftp.digital.com/pub/Digital/SRC/publications/thekkath/talk/frangipani-sosp.ppt

This paper is built on top of two related papers: Edward K. Lee , Chandramohan A. Thekkath, Petal: distributed

virtual disks, Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, p.84-92, October, 1996, Cambridge, Massachusetts, United States.

Leslie Lamport. The Part-Time Parliament. Technical Report 49, Digital Equipment Corporation, Systems Research Center, 130Lytton Ave., Palo Alto, CA943011044, September 1989.

Background (cont'd)

Page 5: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

Related Work

5

NFS (Sandberg et al.,’85, SUN)VAXClusters (Kronenberg, Levy, & Strecker,’86, DEC)AFS (Howard et al.,’88, CMU)Echo (Mann et al.,’94, SRC)xFS (Anderson et al.,’95, Berkeley)Calypso (Devarakonda, Kish, and Mohindra,’95, IBM)Shillner and Felten (’96, Princeton)

Page 6: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

Many distributed file systems already there: VMS Cluster file system, Echo, Calypso, and etc.

Generally, large-scaled distributed file systems are hard to manage. Lots of file systems administration work require human intervention – have to be done manually.

The administration problem is caused by Growing computer installation. More disks attached to more machines.

(components)

Introduction

Page 7: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a

distributed storage system. Can also be viewed as a cluster file system.

It can solve the administration problem by Give all users a consistent view of files. Frangipani servers can be easily added to existing

installation to improve the performance. Add users without manually configuration. Dynamic/hot backup support Fault tolerance. (machine, network, disk failures)

Introduction – Solution

Page 8: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

Petal Prototype

8

Switched Network

Petal Client

Petal virtual disk

Disks

Petal Server

Petal Client

Petal Client

Petal Client

Disks

Petal Server

Disks

Petal Server

Page 9: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

Introduction – Layered structureUser

programUser

programUser

program

Frangipani file server

Frangipani file server

Petal distributed

virtual

Physical disks

Distributed lock service

Page 10: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

System Structure – Common workstations

Petal virtual disk

Page 11: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

System Structure – Components User programs access Frangipani through the

standard operating system call interface. (Digital Unix vnode interface)

Frangipani file server module runs within OS kernel. Changes to file contents are staged through the

local kernel buffer pool. Could be volatile until next fsync/sync system call.

Metadata changes are logged in Petal and be guaranteed non-volatile. (Write ahead redo log, discuss later)

Page 12: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

Components Frangipani file server module read/write Petal

virtual disks using local Petal device driver. Exploit Petal’s large virtual space. More details in a separate paper.

The lock services Multi-reader/single-writer lock Lock with leases (discuss later)

Page 13: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

Client/Server configuration Security issues:

Any Frangipani machine can read/write any block of the shared Petal virtual disk.

Eavesdropping on the network interconnecting the Petal and Frangipani machines

Solution: run Frangipani, Petal and lock servers on trusted network, machines and OSs .

Client/Server configuration. All the servers are interconnecting with a private

network. Remote, untrusted clients talk to Frangipani servers

through a separate network. (have no access to Petal) Bonus: Clients can use Frangipani without modifying

Page 14: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

Client/Server configuration

Page 15: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

Why not use an old file system on Petal? Petal works with old file systems. Traditional file systems such as UFS, AdvFS (target

in performance section) cannot share a block device.

The machine runs the file system can be a bottleneck.

Why choose two layer structure? Two layer structure is not unique. e.g. Universal

File Server. Modularity. Frangipani machines can be added

and deleted transparently. Consistent backup without halting the system.

System Structure – Design issues

Page 16: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

Three aspects of the Frangipani design can be problematic: Duplicated logging. Sometimes logged both by

Petal and Frangipani. Doesn’t use disk location information in placing

data. Frangipani locks entire files and directories rather

than blocks.

Design issues (cont'd)

Page 17: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

264 bytes of address space provided by Petal Commits/decommits in large chunks – 64K Six regions in address space:

1st region stores shared configuration parameters and housekeeping information – 1TB

2nd region stores logs. Each Frangipani server has one. Reserved 1TB, partitioned into 256 logs.

3rd region is used for allocation bitmaps, to describe which blocks in the remaining regions are free – 3TB

4th region holds inodes. 1 TB inode space, each

Disk Layout

Page 18: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

5th region hold small data blocks, each 4KB in size. Allocated 7TB

The remainder holds for large data blocks. 1 TB for each large block. 224 large files limit.

Frangipani takes advantage of Petal’s large, sparse disk address space to simplify its data structure.

Disk Layout (cont'd)

Page 19: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

Frangipani uses a write ahead redo log for metadata Metadata: any on-disk data structure other than

the content of an ordinary file. Log records are kept on Petal. Logs are bounded in size – 128 KB

Data is written to Petal On fsync/sync system calls, or every 30 seconds. On lock revocation or then the log wraps.

Each Frangipani machine has a separate log Reduces contention Independent recovery

Logging and Recovery

Page 20: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

Frangipani server crashes can be detected in two ways: Detected by a client of failed server; When the lock service asks the failed server to

return a lock it is holding. Generally, recovery is initiated by the lock

service. Recovery demon will take the ownership of the

failed server’s logs and locks. After recovery, releases all the locks and frees the

logs.

Logging and Recovery (cont'd)

Page 21: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

Multiple reader/single writer lock mechanism Read lock allows a server to read data and cache

it. Write lock allows a server to read or write data . When a write lock is downgraded or released, the

server must flush its dirty data to disk.

Locks are moderately coarse-grained Lock for each logical segments

Each file, directory or symbolic link is one segment. protects entire file or directory

Lock Services

Page 22: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

Avoiding deadlock by globally ordering these locks.

And acquiring these locks in two phases: A server determine what locks it needs. Which file

or directory? Read lock or write lock? The server sorts the locks by inode address and

acquires each lock in turn. Then checks whether any objects identified in phase

one were modified while their locks were released. If so, the server releases locks and loops back to phase one.

Lock Services (cont'd)

Page 23: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

The lock service deal with client failure using leases Client obtain a lease together with the lock. If the

lease expires, the client either renew the lease or the lock will become invalid.

Three different implementations: (Key problem: where to store the lock state?) 1st : A single, centralized server. All lock states are

keep in the server volatile memory. 2nd: Primary/backup server. Store the lock state on

a Petal virtual disk, so in case of server crash, the lock state can be recovered. Poor performance.

Lock Services (cont'd)

Page 24: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

3rd and final: A set of mutually cooperating lock servers, and a clerk module linked into each Frangipani server. Result: fully distributed for fault tolerance and scalable performance.

Highlights of final implementation: The lock servers maintain a lock table for each

Frangipani server. Clerk module is responsible for communications. (via asynchronous messages)

A small amount of global state information is replicated across all lock servers using Lamport’s Paxos algorithm. (Also used in Google chubby lock service http://labs.google.com/papers/chubby.html)

Lock Services (cont'd)

Page 25: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

Adding another Frangipani server requires a minimal amount of administrative work: Which Petal virtual disk to use And where to find lock service.

Removing a Frangipani server is even easier. Simply shut the server off. Lock servers will invalid

the locks hold by the server after the lease expired and initiate recovery service to run the redo logs.

Easy Administration (adding/removing servers)

Page 26: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

Petal’s snapshot feature provides a convenient way to make consistent full dump of a Frangipani file system Uses copy-on-write techniques Crash consistent: a snapshot reflects a coherent

state.

Backup a Frangipani file system: Taking a Petal snapshot. And copying it to tape.

Easy Administration – backup

Page 27: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

Non-volatile memory (NVRAM) Solved Frangipani server latency problems. Placed in between physical disks and Petal server.

Ideal testbed: 100 Petal nodes. (small array controllers) 50 Frangipani servers. (typical workstations)

Reality: 7 333Mhz DEC Alpha 500 5/333 as Petal servers. Each has 9 DIGITAL RZ29 disks, 4.3 GB each. Connected to 24 port ATM switch 155 Mbit/s link.

Performance – Experimental

Page 28: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

Why AdvFS? Significantly faster than BSD-derived UFS file

system. Can stripe files across multiple disks. Uses a write-ahead log like Frangipani.

Frangipani FS doesn’t use local disks while AdvFS using locally attached disks.

For MAB, unmount file system at end of each phase. Same reason as the tests performed for log-based FS.

Single Machine Performance

Page 29: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

Single Machine Performance

Table 1: Modified Andrew Benchmark with unmount operations

Table 2: Frangipani Throughput and CPU Utilization

Page 30: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

Scaling

0

15

30

45

60

1 2 3 4 5 6

CreateCopyStatScan Compile

Frangipani Scaling on Modified Andrew Benchmark

Elap

sed

time

(sec

s)

Frangipani Machines

Page 31: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

Scaling (cont'd)Frangipani scaling on Uncached Read

thro

ughp

ut(M

B/s)

Frangipani Machines

0

17.5

35.0

52.5

70.0

1 2 3 4 5 6

Page 32: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

15.00

28.75

42.50

56.25

70.00

1 2 3 4 5 6

Scaling (cont'd)Frangipani scaling on write.

thro

ughp

ut(M

B/s)

Frangipani Machines

Page 33: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

Discussion I am bit worried about its locking granularity. What if we

can lock individual blocks rather than files or directories ? How would affect the overall performance of the system ?

Petal is using data replication for high availability. Maintaining consistency among of several copies in a distributed system is inherently difficult so how does Petal deal with this issue ?

33

Page 34: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

Frangipani is feasible to build because of its two-layer structure. all shared state is on a Petal disk

easy to add, delete, and recover servers Frangipani servers do not communicate with each

other: simple to design, implement, debug, and test Frangipani performance is comparable to a

productions DIGITAL Unix file system (AdvFS). Still in early prototype stage, need more experience

to improve scalability, finer-grained locking and etc. Applications:

Design of Compaq’s VersaStore products predates many of the storage and NAS appliances in

the industry today.

Conclusions

Page 35: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

Discussions During logging and recovery, each entry in the log is given

a monotonically increasing sequence number and each log record has a version number for the block it updates. These are used to signify the end of the log (if the next entry is less than the current one), or an old block (if the block number in the record is less than the on disk version number). However, these numbers have to be implemented as some sort of integers in the system. How is overflow of these taken care of? I realise that this would take an unusually high number of writes, but wouldn't this potentially be an issue otherwise?

35

Page 36: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

Discussions Petal optionally replicates data for high availability. How

does this affect the locking and synchronization? When a certain file is to be updated, are it's inodes and data blocks simultaneously locked and updated on all the Petal servers on which it exists? Also, since Petal can continue functioning as long as a single disk containing the file is available, isn't it possible that there will be inconsistent versions of the file if any of the servers with replicated data is unavailable at any time? How are files merged in such situations?

36

Page 37: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

What is the benefit of using Petal to build Frangipani? And what is the benefit of using "so-called" virtual disk to provide a large address space?

Do you think implementing a cluster file system on top of a disk-based storage structure, like Petal, is better than implementing directly on top of the file systems of an operating system?

The bottleneck of such a system seems to be the network bandwidth, the Petal server throughput and its disk access time. So why does it need to implement Frangipani as an operating system module, which both reduce the reliability and portability? Implementing it in the user level seems

37

Page 38: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

4) Do you think it is a balanced architecture? The Frangipani server deals with the requests of clients and its disk only acts as the cache. So it seems the Frangipani server needs small disk, but fast CPU and network interface. While, the Petal server needs large disks and even fast network interface.

5) The system still needs manual administration when adding/removing the either Frangipani server or Petal server. Do you think it scales well?

38

Page 39: Frangipani: A Scalable Distribute File Systemnorm/508/2009W1/summaries/... · Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed

"Only metadata is logged, not user data, so a user has no guarantee that the file system state is consistent from his point of view after a failure.” Is it acceptable for the users’ data to be inconsistent after a failure and any existing distributed file system solve this problem well?

The chunk size in Petal virtual disk is 64kb, yet in the filesytem, Frangipani, there are 4kb block and 512b inode, that means some file operation will wait for others, right?

39