Garth A. Gibson*, David F. Nagle**, William Courtright II*, Nat Lanza*, Paul Mazaitis*, Marc...

15
Garth A. Gibson*, David F. Nagle**, William Courtright II*, Nat Lanza*, Paul Mazaitis*, Marc Unangst*, Jim Zelenka* "NASD Scalable Storage Systems",USENIX99, Extreme Linux Workshop, Monterey, CA, June 1999. http://www.pdl.cs.cmu.edu/Publications/pu blications.html)

Transcript of Garth A. Gibson*, David F. Nagle**, William Courtright II*, Nat Lanza*, Paul Mazaitis*, Marc...

Page 1: Garth A. Gibson*, David F. Nagle**, William Courtright II*, Nat Lanza*, Paul Mazaitis*, Marc Unangst*, Jim Zelenka* "NASD Scalable Storage Systems",USENIX99,

Garth A. Gibson*, David F. Nagle**, William Courtright II*, Nat Lanza*, Paul Mazaitis*, Marc Unangst*, Jim Zelenka*

"NASDScalable Storage Systems",USENIX99, Extreme Linux

Workshop, Monterey, CA, June 1999.http://www.pdl.cs.cmu.edu/Publications/publications.html)

Page 2: Garth A. Gibson*, David F. Nagle**, William Courtright II*, Nat Lanza*, Paul Mazaitis*, Marc Unangst*, Jim Zelenka* "NASD Scalable Storage Systems",USENIX99,

Motivation

• NASD minimizes server based data movement and separates management and filesystem sematics from store-and-forward copying

• Figure 1: Standalone server with attached disks– Look at long path requests and data take through OS layers and through

various machines

• Reference implementation of NASD for Linux 2.2 including NASD device code that runs on workstation or PC masquerading as subsystem or disk drive

• NFS-like distributed file system that uses NASD subsystems or devices

• NASD striping middleware for large striped files

Page 3: Garth A. Gibson*, David F. Nagle**, William Courtright II*, Nat Lanza*, Paul Mazaitis*, Marc Unangst*, Jim Zelenka* "NASD Scalable Storage Systems",USENIX99,

Figure 1 -- NetSCSI and NASD

• Figure 1 outlines data path where clients ask for data, servers forward request to storage -- forwarded request is a DMA command to return data directly to a client.– When DMA is complete, status is returned to server and collected and

forwarded to client

• NASD– On first access, client contacts server for access checks

– Server grants reusable rights or capabilities

– Clients then present requests directly to storage

– Storage verifies capabilities and directly replies

Page 4: Garth A. Gibson*, David F. Nagle**, William Courtright II*, Nat Lanza*, Paul Mazaitis*, Marc Unangst*, Jim Zelenka* "NASD Scalable Storage Systems",USENIX99,

NASD Interface

• Read, write object data

• Read, write object attributes

• Create, resize, remove soft partitions

• Construct copy-on-write version of object

• Logical version number on file can be changed by file manager to revoke capability

Page 5: Garth A. Gibson*, David F. Nagle**, William Courtright II*, Nat Lanza*, Paul Mazaitis*, Marc Unangst*, Jim Zelenka* "NASD Scalable Storage Systems",USENIX99,

NASD Security

• Security protocol– Capability has public portion -- CAapArg, private key CapKey

– CapArg specifies what rights are being granted for which object

– CapKey is a keyed message digest of CapArg and a secret key shared only with target drive

– Client sends CapArg with each request, gnerates a CapKey-keyed digest of request parameters and CapArg

– Each drive knows its secret keys and receives CapArg with each request

– Can compute client’s CapKey and verify request

– If any field of CapArg or request has been changed, digest comparison will fail

– Scheme protects integrity of requests but does not protect privacy of data

Page 6: Garth A. Gibson*, David F. Nagle**, William Courtright II*, Nat Lanza*, Paul Mazaitis*, Marc Unangst*, Jim Zelenka* "NASD Scalable Storage Systems",USENIX99,

Filesystems for NASD

• Constructed distributed file system with NFS-like semantics tailored for NASD

• Each file and directory occupies exactly one NASD object, offsets in files are same as offsets in objects

• File length, last file modify time correspond directly to NASD-maintained object attributes

• Remainder of file attributes stored in uninterpreted section of object’s attributes

• Data moving operations -- read, write) and attribute reads (getattr) are sent directly to NASD drive

– file attributes are either computed from NASD object attributes (e.g. modify times and object size) or stored in the uninterpreted filesystem-specific attribute

• Other requests are handled by file manager

• Capabilities are piggybacked on file manager’s response to lookup operations

Page 7: Garth A. Gibson*, David F. Nagle**, William Courtright II*, Nat Lanza*, Paul Mazaitis*, Marc Unangst*, Jim Zelenka* "NASD Scalable Storage Systems",USENIX99,

Access to Striped Files and Continuous Media

• NASD-optimized parallel filesystem

• Filesystem manages objects not directly backed by data

• Backed by storage manager which redirects clients to component NASD objects

• NASD PFS supports SIO low-level parallel filesystem interface on top of NASD-NFS files striped using user-level Cheops middleware

• Figure 6

Page 8: Garth A. Gibson*, David F. Nagle**, William Courtright II*, Nat Lanza*, Paul Mazaitis*, Marc Unangst*, Jim Zelenka* "NASD Scalable Storage Systems",USENIX99,

Garth A. Gibson, David F. Nagle, Khalil Amiri, Jeff Butler, Fay W. Chang, Howard Gobioff, Charles Hardin, Erik Riedel, David Rochberg and Jim Zelenka A cost-effective, high-bandwidth storage architecture.

Architectural Support for Programming Languages and Operating Systems Proceedings of the 8th international conference on Architectural support for programming languages and operating systems October 2 - 7,

1998, San Jose, CA USA Pages 92-103.

Page 9: Garth A. Gibson*, David F. Nagle**, William Courtright II*, Nat Lanza*, Paul Mazaitis*, Marc Unangst*, Jim Zelenka* "NASD Scalable Storage Systems",USENIX99,

Evolution of storage architectures

• Local Filesystem -- Simple- aggregate, application, file management concurrency control, low level storage management. Data makes one trip of peripheral area network such as SCSI. Disks offer fixed sized block abstraction

• Distributed Filesystem -- Intermediate server machine is introduced. Server offers simple file access interface to clients.

• Distributed Filesystem with RAID controller -- Interpose another computer -- RAID controller.

• Distributed Filesystem that employs DMA -- Can arrange to DMA data to clients rather than to copy through server. HPSS is an example (although this is not how it is usually employed).

• NASD- based DFS, NASD-Cheops based DFS

Page 10: Garth A. Gibson*, David F. Nagle**, William Courtright II*, Nat Lanza*, Paul Mazaitis*, Marc Unangst*, Jim Zelenka* "NASD Scalable Storage Systems",USENIX99,

Principals of NASD

• Direct transfer -- data moved between drive and client without indirection or store-and-forward through file server

• Asynchronous oversight -- Ability of client to perform most operations without synchronous appeal to the file manager

• Cryptographic integrity -- Drives ensure that commands and data have not been tampered with by generating and verifying cryptographic keyed digests

• Object based interface -- Drives export variable length objects instead of fixed-size blocks. Allows disk drives to direct knowledge of relationships between disk blocks and minimize security overhead.

Page 11: Garth A. Gibson*, David F. Nagle**, William Courtright II*, Nat Lanza*, Paul Mazaitis*, Marc Unangst*, Jim Zelenka* "NASD Scalable Storage Systems",USENIX99,

Prototype Implementation

• NASD prototype drive runs on 133MHz, 64MB, Dec Alpha 3000/400 with two Seagate ST52160 disks attached by two 5 MB/s SCSI busses

• Intended to simulate a controller and drive

• NASD system implements own internal object access, cache, disk space management modules

• Figure 6 -- Performance for sequential reads and writes– Sequential bandwidth as function of request size

– NASD better tuned for disk access on reads that miss cache

– FFS better tuned for cache accesses

– Write performance of FFS due to immediate acknowledgement for writes up to 64KB

Page 12: Garth A. Gibson*, David F. Nagle**, William Courtright II*, Nat Lanza*, Paul Mazaitis*, Marc Unangst*, Jim Zelenka* "NASD Scalable Storage Systems",USENIX99,

Scalability

• 13 NASD drives, each linked by OC-3 ATM to 10 client machines

• Each client issues series of sequential 2MB read requests striped across four NASDs.

• Each NASD can deliver 32MB/s from cache to RPC protocol stack

• DCE RPC cannot push more than 80Mb/s through a 155 Mb/s ATM link before receiving client saturates

• Figure 7 demonstrates close to linear scaling up to 10 clients

Page 13: Garth A. Gibson*, David F. Nagle**, William Courtright II*, Nat Lanza*, Paul Mazaitis*, Marc Unangst*, Jim Zelenka* "NASD Scalable Storage Systems",USENIX99,

Computational Requirements

• Table 1 -- number of instructions needed to service given request size including all communications (DCE RPC, UDP/IP)

• Overhead mostly due to communications

• Significantly more expensive than Seagate Barracuda

Page 14: Garth A. Gibson*, David F. Nagle**, William Courtright II*, Nat Lanza*, Paul Mazaitis*, Marc Unangst*, Jim Zelenka* "NASD Scalable Storage Systems",USENIX99,

Filesystems for NASD

• NFS covered in last paper

• AFS -- lookup operations carried out by parsing directory files locally

• AFS RPCs added to obtain and relinquish capabilities explicitly

• AFS’s sequential consistency provided by breaking callbacks (notifying holders of potentially stale copies) when a write capability is issued

• File manager does’nt know that a write operation has arrived at a drive so it must tell clients when a write may occur

• No new callbacks on file with outstanding write capability

• AFS enforces per-volume quota on allocated disk space

• File manager allocates space when it issues a capability, and it keeps track of how much space is actually written to

Page 15: Garth A. Gibson*, David F. Nagle**, William Courtright II*, Nat Lanza*, Paul Mazaitis*, Marc Unangst*, Jim Zelenka* "NASD Scalable Storage Systems",USENIX99,

Active Disks

• Provide full application-level programmability of drives

• Customize functionality for data intensive computations

• NASD’s object based interface provides knowledge of data at devices without having to use external metadata