C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation...

41
C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Frangipani: A Scalable Distributed File System Presented by: Zhiyong (Ricky) Cheng

Transcript of C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation...

Page 1: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

C. A. Thekkath, T. Mann, and E. K. Lee

Systems Research Center

Digital Equipment Corporation

Frangipani: A Scalable Distributed File System

Presented by: Zhiyong (Ricky) Cheng

Page 2: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

fran·gi·pani (fran′ji pan′ē, pän′ē)

From youdictionary.com : any of a genus (Plumeria) of tropical American shrubs and trees of the dogbane Family, with large, funnel-shaped flowers and milky sap; specif., a small tree with fragrant, reddish flowers that are used, in Hawaii, to make leis.

Page 3: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

BackgroundIntroductionSystem StructureDisk LayoutLogging and RecoveryThe Lock ServiceEasy AdministrationPerformanceConclusionsQuestions

Outline

Page 4: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

About the main authorPersonal website: http://research.microsoft.com/users/thekkath/

Chandu Thekkath is the Director of the Platforms and Distributed Systems Group in ISRC within Microsoft Research. Before that he was a Principal Researcher at Microsoft Research in Silicon Valley.

At DEC, Thekkath’s most influential work was the Petal/Frangipani project jointly done with E. Lee and T. Mann. This system supported a scalable, distributed virtual disk and file system. It was completed (and made public) in 1997 and influenced the design of Compaq’s VersaStore (Self-developed storage virtualization strategy) products (HP: Storage Apps) and predates many of the storage and NAS (Network-attached storage) appliances in the industry today.

Background

Page 5: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

Original slides: http://ftp.digital.com/pub/Digital/SRC/publications/thekkath/talk/frangipani-sosp.ppt

This paper is built on top of two related papers: Edward K. Lee , Chandramohan A. Thekkath, Petal:

distributed virtual disks, Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, p.84-92, October, 1996, Cambridge, Massachusetts, United States.

Leslie Lamport. The Part-Time Parliament. Technical Report 49, Digital Equipment Corporation, Systems Research Center, 130Lytton Ave., Palo Alto, CA943011044, September 1989.

Background (cont'd)

Page 6: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

Many distributed file systems already there:VMS Cluster file system, Echo, Calypso, and

etc.Generally, large-scaled distributed file

systems are hard to manage. Lots of file systems administration work require human intervention – have to be done manually.

The administration problem is caused byGrowing computer installation.More disks attached to more machines.

(components)

Introduction – What's the problem?

Page 7: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

Frangipani A new scalable distributed file system.Two layered model: build on top of Petal, a

distributed storage system.Can also be viewed as a cluster file system.

It can solve the administration problem byGive all users a consistent view of files.Frangipani servers can be easily added to existing

installation to improve the performance.Add users without manually configuration.Dynamic/hot backup supportFault tolerance. (machine, network, disk failures)

Introduction – Solution

Page 8: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

Introduction – Layered structure

User program User program

User program

Frangipani file server

Frangipani file server

Petal distributed virtual

disk service

Physical disks

Distributed lock service

Page 9: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

System Structure – Common caseworkstations

Petal virtual disk

Page 10: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

System Structure – ComponentsUser programs access Frangipani through

the standard operating system call interface. (Digital Unix vnode interface)

Frangipani file server module runs within OS kernel.Changes to file contents are staged through the

local kernel buffer pool. Could be volatile until next fsync/sync system call.

Metadata changes are logged in Petal and be guaranteed non-volatile. (Write ahead redo log, discuss later)

Page 11: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

Components (cont'd)Read/write Petal virtual disks using local

Petal device driver.Exploit Petal’s large virtual space.More details in a separate paper.

The lock servicesMulti-reader/single-writer lockLock with leases (discuss later)

Page 12: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

Client/Server configurationSecurity issues:

Any Frangipani machine can read/write any block of the shared Petal virtual disk.

Eavesdropping on the network interconnecting the Petal and Frangipani machines

Solution: run Frangipani, Petal and lock servers on trusted network, machines and OSs .

Client/Server configuration.All the servers are interconnecting with a private network.Remote, untrusted clients talk to Frangipani servers

through a separate network. (have no access to Petal)Bonus: Clients can use Frangipani without modifying OS

kernel.

Page 13: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

Client/Server configuration (cont'd)

Page 14: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

Why not use an old file system on Petal?Petal works with old file systems.Traditional file systems such as UFS, AdvFS (target

in performance section) cannot share a block device.The machine runs the file system can be a

bottleneck.Why choose two layer structure?

Two layer structure is not unique. e.g. Universal File Server.

Modularity. Frangipani machines can be added and deleted transparently.

Consistent backup without halting the system.Depends on the design goal of the file system.

System Structure – Design issues

Page 15: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

Three aspects of the Frangipani design can be problematic:Duplicated logging. Sometimes logged both by

Petal and Frangipani. Doesn’t use disk location information in placing

data. Frangipani locks entire files and directories

rather than blocks.

Design issues (cont'd)

Page 16: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

264 bytes of address space provided by PetalCommits/decommits in large chunks – 64K Six regions in address space:

1st region stores shared configuration parameters and housekeeping information – 1TB

2nd region stores logs. Each Frangipani server has one. Reserved 1TB, partitioned into 256 logs.

3rd region is used for allocation bitmaps, to describe which blocks in the remaining regions are free – 3TB

4th region holds inodes. 1 TB inode space, each inode 512 bytes long.

Disk Layout

Page 17: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

5th region hold small data blocks, each 4KB in size. Allocated 7TB

The remainder holds for large data blocks. 1 TB for each large block. 224 large files limit.

Frangipani takes advantage of Petal’s large, sparse disk address space to simplify its data structure.

Disk Layout (cont'd)

Page 18: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

Frangipani uses a write ahead redo log for metadataMetadata: any on-disk data structure other than the

content of an ordinary file.Log records are kept on Petal.Logs are bounded in size – 128 KB

Data is written to PetalOn fsync/sync system calls, or every 30 seconds.On lock revocation or then the log wraps.

Each Frangipani machine has a separate logReduces contentionIndependent recovery

Logging and Recovery

Page 19: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

Frangipani server crashes can be detected in two ways:Detected by a client of failed server;When the lock service asks the failed server to

return a lock it is holding.Generally, recovery is initiated by the lock service.

Recovery demon will take the ownership of the failed server’s logs and locks.

After recovery, releases all the locks and frees the logs.

Recovery can be carried out on any machine.Log is distributed and available via Petal.

Logging and Recovery (cont'd)

Page 20: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

Multiple reader/single writer lock mechanism Read lock allows a server to read data and cache it.Write lock allows a server to read or write data .When a write lock is downgraded or released, the

server must flush its dirty data to disk.

Locks are moderately coarse-grainedLock for each logical segments

Each file, directory or symbolic link is one segment.protects entire file or directory

Lock Services

Page 21: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

Avoiding deadlock by globally ordering these locks.

And acquiring these locks in two phases:A server determine what locks it needs. Which

file or directory? Read lock or write lock? The server sorts the locks by inode address and

acquires each lock in turn.Then checks whether any objects identified in

phase one were modified while their locks were released. If so, the server releases locks and loops back to phase one.

Lock Services (cont'd)

Page 22: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

The lock service deal with client failure using leasesClient obtain a lease together with the lock. If the

lease expires, the client either renew the lease or the lock will become invalid.

Three different implementations: (Key problem: where to store the lock state?)1st : A single, centralized server. All lock states

are keep in the server volatile memory.2nd: Primary/backup server. Store the lock state on

a Petal virtual disk, so in case of server crash, the lock state can be recovered. Poor performance.

Lock Services (cont'd)

Page 23: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

3rd and final: A set of mutually cooperating lock servers, and a clerk module linked into each Frangipani server. Result: fully distributed for fault tolerance and scalable performance.

Highlights of final implementation:The lock servers maintain a lock table for each

Frangipani server. Clerk module is responsible for communications. (via asynchronous messages)

A small amount of global state information is replicated across all lock servers using Lamport’s Paxos algorithm. (Also used in Google chubby lock service http://labs.google.com/papers/chubby.html)

Lock Services (cont'd)

Page 24: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

Adding another Frangipani server requires a minimal amount of administrative work:Which Petal virtual disk to useAnd where to find lock service.

Removing a Frangipani server is even easier.Simply shut the server off. Lock servers will

invalid the locks hold by the server after the lease expired and initiate recovery service to run the redo logs.

Easy Administration (adding/removing servers)

Page 25: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

Petal’s snapshot feature provides a convenient way to make consistent full dump of a Frangipani file systemUses copy-on-write techniquesCrash consistent: a snapshot reflects a

coherent state.

Backup a Frangipani file system:Taking a Petal snapshot.And copying it to tape.

Easy Administration – backup

Page 26: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

Non-volatile memory (NVRAM)Solved Frangipani server latency problems.Placed in between physical disks and Petal

server.Ideal testbed:

100 Petal nodes. (small array controllers)50 Frangipani servers. (typical workstations)

Reality:7 333Mhz DEC Alpha 500 5/333 as Petal

servers. Each has 9 DIGITAL RZ29 disks, 4.3 GB each.Connected to 24 port ATM switch 155 Mbit/s

link.

Performance – Experimental setup

Page 27: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

Why AdvFS?Significantly faster than BSD-derived UFS file

system.Can stripe files across multiple disks.Uses a write-ahead log like Frangipani.

Frangipani FS doesn’t use local disks while AdvFS using locally attached disks.

For MAB, unmount file system at end of each phase. Same reason as the tests performed for log-based FS.

Single Machine Performance

Page 28: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

Single Machine Performance

Table 1: Modified Andrew Benchmark with unmount operations

Table 2: Frangipani Throughput and CPU Utilization

Page 29: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

Scaling

1 2 3 4 5 60

10

20

30

40

50

60

Compile

Scan

Stat

Copy

Create

Frangipani Scaling on Modified Andrew Benchmark

Ela

pse

d t

ime (

secs

)

Frangipani Machines

Page 30: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

Scaling (cont'd)Frangipani scaling on Uncached Read

thro

ug

hp

ut(

MB

/s)

Frangipani Machines

1 2 3 4 5 60

10

20

30

40

50

60

70

Page 31: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

1 2 3 4 5 60

10

20

30

40

50

60

70

Scaling (cont'd)Frangipani scaling on write.

thro

ug

hp

ut(

MB

/s)

Frangipani Machines

Page 32: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

Frangipani is feasible to build because of its two-layer structure.all shared state is on a Petal disk

easy to add, delete, and recover servers Frangipani servers do not communicate with each other:

simple to design, implement, debug, and testFrangipani performance is comparable to a productions

DIGITAL Unix file system (AdvFS).Still in early prototype stage, need more experience to

improve scalability, finer-grained locking and etc.Applications:

Design of Compaq’s VersaStore productspredates many of the storage and NAS appliances in the

industry today.

Conclusions

Page 33: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

What are the advantages of using a two level Frangipani/Petal approach? It seems that the authors developed Petal first, and then worked to design a file system that work with the storage abstraction provided by Petal. The two-level approach has some limitations, e.g., the disability to use disk location information in placing data as mentioned in the end of Section 2. Is it possible to combine these two things together?

Why is there even a need for Frangipani? By the looks of Figure 2, it seems like Petal exports the all the disks as one giant virtual disk. Couldn't a normal file system (like ext3) be put on top of Petal (much like LVM, but distributed)?

Questions – Two layer structure

Page 34: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

Maybe I'm missing the point of Frangipani... but in Section 7 they talk about adding additional servers to a machine. It sounded like Frangipani is a file system interface you could use to access a Petal virtual disk, so why would you need more than one such interface running on a machine?

Why the authors stopped with only two layers, but not more number of layers? Maybe by splitting parts of one layer can simply the concepts of the file system. Would more layers increase complexity of the system?

Do you believe the two layered approach is the ideal way to design a distributed file system?

Two layer structure (cont'd)

Page 35: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

In Section 2.2 they talk about security and say that "it would not be sufficient for a Frangipani machine to authenticate itself to Petal as acting on behalf of a particular user." Why not? Is this because Petal has no knowledge of users, and just acts as a disk, or is it something else?

The authors state that even though Frangipani is designed to work well in a "cluster of workstations within a single administrative domain," that it could be exported to "untrusted machines outside an administrative domain." How would this affect the administration of the system? Can the current design cope with this? What about security with untrusted machines now part of the network?

The authors list some security measures they could implement (but haven't) and also state that if they did so they could "reach roughly the NFS level of security." What does "roughly" mean?

Security

Page 36: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

It seems that security is a fairly significant limitation for the presented system, as the authors state that they haven't implemented any security measures. Do you know if there is any follow up work that solves this problem?

The authors claim that Frangipani can be "exported to untrusted machines using ordinary network file access protocols", but wouldn't the networks' file storage be compromised?

There seems to be a lot (maybe too much) trust in the Frangipani system. [ie: “Frangipani servers trust one another, the Petal servers, and the lock service.” (p2) and “Any Frangipani machine can read or write any block of the shared Petal virtual disk, so Frangipani must run only on machines with trusted operating systems” (p3)]. Is this very secure?

Security (cont'd)

Page 37: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

For recovery, how's recovery demon assigned? Does the fact that only one recovery demon is active means that there is no partitioning?

The paper states that “only metadata is logged, not user data, so a user has no guarantee that the file system state is consistent from his point of view after a failure. We do not claim these semantics to be ideal, but they are the same as what standard local Unix file systems provide.” (p5) Is it reasonable for the users’ data to be inconsistent after a failure? I don’t think this is reasonable and I don’t believe ‘standard local file systems do that’ is a good excuse. Would it be beneficial to combine this with systems such as the Elephant file system to provide more data security?

Logging/Recovery

Page 38: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

How can the authors claim that their system is scalable when all of their performance tests are done on a single server? They did look at the "average time taken by one Frangipani machine" on the modified Andrew Benchmark with up to 6 machines, but how can one claim that a system is scalable with only 6 machines?

The authors promote the reliability of their system (reliably detect the end of the log, method reliably rejects writes with expired leases, etc) yet they failed to report supporting test data. Is this an oversight or perhaps an overly optimistic prediction on their part?

The performance of Frangipani was about the same (but worse) than that of AdvFS, even given that Frangipani has five times the bandwidth. How does it compare to other distributed file systems like AFS?

Performance

Page 39: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

Is Frangipani scalable to the large network environment? Can it handle several issues of communication?

The authors state that Petal "optionally replicates data for high availability." If data is replicated, how can the system guarantee that "changes made to a file or directory on one machine are immediately visible on all others." What mechanisms does Petal employ to ensure this level of consistency?

Scalability / Consistency

Page 40: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

What are the significant contributions of the Frangipani distributed file system? (i.e. what was their new concept). It seems like they rely heavily on Petal for many of the details. They repeatedly stress the simplicity of their idea (which isn't a bad thing at all) but... What did they do? (Locking system for shared file management vs. a log based system?)

Frangipani is not very portable. Have there been attempts to develop Frangipani for other systems or to make it more portable?

Contributions/ Later work

Page 41: C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Zhiyong (Ricky) Cheng.

THANK YOU!