© 2010 IBM Corporation
IBM Research
9P Overview
Eric Van HensbergenIBM Austin Research Lab([email protected])
IBM Research
9P Overview © 2010 IBM Corporation2
Agenda
• Historical Background (Plan 9 & Inferno)• 9P Protocol Basics• Extensions• Linux Client Code Overview
IBM Research
9P Overview © 2010 IBM Corporation
Historical Background
• Plan 9 from Bell Labs was a distributed operating system developed as a successor to UNIX starting in the mid-1980’s.
• Primary motivation for Plan 9 was to rethink operating systems in light of pervasive networking (networking was added an afterthought to original.
• Plan 9 resources were scattered across cluster of machines with each machine having a role (Terminal, CPU Server, Auth Server, File Server)
• Inferno was a commercial venture based off of Plan 9 which provided Plan 9’s environment tightly coupled with a virtual machine in both native and hosted (Linux, BSD, Windows)platforms.
3
IBM Research
9P Overview © 2010 IBM Corporation
Plan 9 Trivia
• Supported Multiple Hosts, but only 32-bit• x86, MIPS, Alpha, SPARC, PowerPC, ARM
• Native Support for UTF-8 from inception• Own Tool Set (Ken Thompson’s C compilers)• Some Kernel Stats• 37 syscalls• 178,738 lines of code amongst all ports (38k lines portable)• optional real-time scheduler
• User development environment primarily C and Alef• ANSI/POSIX Emulation environment available
• Open sourced (Lucent Public License 1.02)
4
IBM Research
9P Overview © 2010 IBM Corporation
Plan 9 Core Design Concepts
• All Resources Represented as File Hierarchies• System Resources: processes, devices, networking stack• System Services: DNS, Window System, Plumbing• Application Services: Editor Interfaces, Plumbing
• Namespaces• private, per-process by default• user manipulatable• bind and union directories
• Standard Communication Protocol• a standard protocol, 9P, used to access both local and
remote resources
5
IBM Research
9P Overview © 2010 IBM Corporation
Implication of Design Concepts
• Since all resources exposed as file hierarchies and remote hierarchies could be accessed via 9P• remote resources could be accessed as easily as local
ones (audio, graphics, network) without specialized protocols for each
• Since namespaces were private and per-process• individual users could compose namespaces of local and
remote resources and subsequent applications could access those resources transparently
• individual applications can do this as well without affecting other applications (each window in the window manager had its own namespace)
6
IBM Research
9P Overview © 2010 IBM Corporation7
9P Protocol Basics
• Based around core Plan 9 System Call I/O operations• Local operations degrade to functional calls• Remote operations closer to proxy operations• Pure request/response RPC model• Transport Independent• only requires reliable, in order delivery mechanism• can be secured with authentication, encryption, & digesting
• By default, requests are non-cached avoiding coherence problems and race conditions
• Design stresses keeping things simple resulting in small and efficient client and servers
IBM Research
9P Overview © 2010 IBM Corporation
9P Protocol Terms and Structures
• tag - numeric identifier for multiplexing operations• fid - numeric identifier for file system entities• represent transient position in filesystem (directory or files)• also represent open files• transient fids can navigate or queried for meta-data, open
fids can only be used for operations (read, write, close)• qids• qid.type: type of qid (directory, file, etc.)• qid.path: unique per-entity identifier• qid.version: monotonically increasing file version
• stat - metadata structure (directories or files)• strings - always size prefixed
8
IBM Research
9P Overview © 2010 IBM Corporation9
9P Basics: Protocol Overview
size op tag
fid offsetsize Twrite tag count data
size Rwrite tag count
Protocol Specification Available: http://ericvh.github.com/9p-rfc/
Numeric pointer to a path element or open file...
Numeric transaction id for multiplexing
IBM Research
9P Overview © 2010 IBM Corporation10
9P Basics: Operations Session Management
– Version: protocol version and capabilities negotiation
– Attach: user identification and session option negotiation
– Auth: user authentication enablement
– Walk: hierarchy traversal and transaction management
– Clunk: forget about a fid
Error Management
– Error: a pending request triggered an error
– Flush: cancel a pending request
Metadata Management– Stat: retrieve file metadata
– Wstat: write file metadata
File I/O
– Create: atomic create/open
– Open, Read, Write, Close
– Directory read packaged w/read operation (Reads stat information with file list)
– Remove
IBM Research
9P Overview © 2010 IBM Corporation11
version
size Tversion tag versionmsize
size Rversion tag versionmsize
Initial tag is always (ushort)~0msize defines maximum length in bytes of any single 9P message.
version string (size prefixed) must always begin with 9P, if the server doesn’trecognize, it responds with version=unknown and client retries until it gets a match. version of 9P specified by 4 characters after 9P (ie. 9P2000)
optional extensions specified by . specifiers (9P2000.U and 9P2000.L)
IBM Research
9P Overview © 2010 IBM Corporation12
auth
size Tauth tag unameafid aname
size Rattach tag qid
User selects afid to represent authentication channel for a particular user(identified by uname) and attach parameter (aname).
Auth protocol is not defined by 9P, once it is complete afid is presented insubsequent attach message. The same validated afid may be used for multiplemessages with the same uname and aname.
IBM Research
9P Overview © 2010 IBM Corporation13
attach
size Tattach tag unameafid anamefid
size Rattach tag aqid
Serves as an introduction from the user to the server.fid chosen initially by clientuname identifies user to serveraname identifies an attach parameter (optional)afid identifies previously negotiated authentication channel
(set to (u32int)~0 if client doesn’t wish to authenticate
IBM Research
9P Overview © 2010 IBM Corporation14
flush
size Tflush tag oldtag
size Rflush tag
Flush is sent to server to cancel an outstanding operation (specified by oldtag)
Server always sends RflushIt is permitted for server to have already sent response and still send RflushIf client receives response before Rflush, it must honor response
It is also permitted to Flush a Flush, server must handle flush requests in order
Tag may not be reused until all Rflush have returned
IBM Research
9P Overview © 2010 IBM Corporation15
error
size Rerror tag ename
Rerror sent in response to report errors on other operations.
Plan 9 errors returned as strings from the server.
IBM Research
9P Overview © 2010 IBM Corporation16
walk - fid creation and navigation
size Twalk tag nwnamenewfid wnamefid
size Rwalk tag nwqid qid ...
...
new fids are created by a walk with no name arguments (nwname=0)this is also known as a ‘clone’ operation for historical reasons
walks with fid=newfid move the fid around fs hierarchy following path specified bynwnames wname(s)
walks can both create and navigate fids (newfid is navigated)
partial path resolution failures return nwqid < nwname (with qids for successful path elements walked)
dot-dot (..) and dot (.) treated special meaning parent directory or current directory
IBM Research
9P Overview © 2010 IBM Corporation17
clunk - fid reclaimation
size Tclunk tag fid
size Rclunk tag
sent when a fid is no longer needed, client may reuse fid as a newfid for other operations
even if clunk returns an error, fid is no longer valid
typically invoked on a close, but also invoked when a transient reference is no longer needed
IBM Research
9P Overview © 2010 IBM Corporation
Entity Operations
• Create, Open, Read, Write, Remove, Stat, Wstat• basically what you would think
• Create functions as atomic create/open operation• Plan 9 has special open modes for exclusive access, append
only, and temporary files.• No special dirread function, just open & read directory• returns integral number of stat structures, one for every file
in the directory• Rename within directory accomplished with Wstat• non-directory renames non-atomic
• Read/Write include offsets in operation•Wstat can selectively set attributes by used “don’t touch” flag
18
IBM Research
9P Overview © 2010 IBM Corporation19
9P Packet Trace (from v9fs)<<< (0x8055650) Tattach tag 0 fid 2 afid -1 uname aname nuname 266594>>> (0x8055650) Rattach tag 0 qid (0000000000000002 48513969 'd')<<< (0x8055650) Twalk tag 0 fid 1 newfid 3 nwname 1 'test'>>> (0x8055650) Rwalk tag 0 nwqid 1 (000000000000401a 48613b9d 'd')<<< (0x8055650) Tstat tag 0 fid 3>>> (0x8055650) Rstat tag 0 'test' 'ericvh' 'root' '' q (000000000000401a 48513b9d 'd') m d777 at 1213278479 mt 1213283229 l 0 t 0 d 0 ext ''<<< (0x8055650) Twalk tag 0 fid 3 newfid 4 nwname 1 'hello.txt'>>> (0x8055650) Rwalk tag 0 nwqid 1 (000000000000401b 4851379d '')<<< (0x8055650) Tstat tag 0 fid 4>>> (0x8055650) Rstat tag 0 'hello.txt' 'ericvh' 'ericvh' '' q (000000000000401b 4851379d '') m 644 at 1213283229 mt 1213283229 l 12 t 0 d 0 ext ''<<< (0x8055650) Twalk tag 0 fid 4 newfid 5 nwname 0>>> (0x8055650) Rwalk tag 0 nwqid 0<<< (0x8055650) Topen tag 0 fid 5 mode 0>>> (0x8055650) Ropen tag 0 (000000000000401b 4851379d '') iounit 0<<< (0x8055650) Tstat tag 0 fid 4>>> (0x8055650) Rstat tag 0 'hello.txt' 'ericvh' 'ericvh' '' q (000000000000401b 4851379d '') m 644 at 1213283229 mt 1213283229 l 12 t 0 d 0 ext ''<<< (0x8055650) Tread tag 0 fid 5 offset 0 count 8192>>> (0x8055650) Rread tag 0 count 12 data 68656c6c 6f20776f 726c640a
<<< (0x8055650) Tread tag 0 fid 5 offset 12 count 8192>>> (0x8055650) Rread tag 0 count 0 data
<<< (0x8055650) Tclunk tag 0 fid 5>>> (0x8055650) Rclunk tag 0<<< (0x8055650) Tclunk tag 0 fid 4>>> (0x8055650) Rclunk tag 0<<< (0x8055650) Tclunk tag 0 fid 3>>> (0x8055650) Rclunk tag 0
IBM Research
9P Overview © 2010 IBM Corporation
Extension Models
• Extend arguments to existing operations to accommodate non-Plan 9 environments
• Provide a single extension operation which encapsulates any extended protocol operations
• Provide a set of complimentary operations which provide any extensions (including extensions which are semantic changes to existing operations)
• Provide synthetic file system interfaces which exist either within the hierarchy or within an alternate aname mount• can either be provided by primary server, or through a
secondary server either mounted underneath
20
IBM Research
9P Overview © 2010 IBM Corporation21
Unix Extensions (9P2000.u)
• Existing Support:• UID/GID support• Error ID support• Stat mapping• Permissions mapping• Symbolic and Hard Links• Device Files
• All accomplished via optional extended arguments to existing operations and an extended Stat structure
IBM Research
9P Overview © 2010 IBM Corporation22
Future Work: .L extension series
• The 9P protocol is a network mapping of the Plan 9 file system API
• Many mismatches with Linux/POSIX• Existing .U extension model is clunky• Developing a more direct mapping to Linux VFS• New opcodes which match VFS API• Linux native data formats (stat, permissions, etc.)• Direct support of extended attributes, locking, etc.
• Should be able to co-exist with legacy 9P and 9P2000.u protocols and servers.
IBM Research
9P Overview © 2010 IBM Corporation23
9P Client/Server Support
• Comprehensive list: http://9p.cat-v.org/implementations• C, C#, Python, Ruby, Java, Python, TCL, Limbo, Lisp, OCAML,
Scheme, PHP and Javascript• FUSE Clients (for Linux, BSD, and Mac)• Native Kernel Support for OpenBSD•Windows support via Rangboom proprietary client• Inferno supports native 9P (aka Styx)• Simple server library available (libixp) (9P2000 only)• 9P2000.u available in spfs (single threaded) and npfs (multi-
threaded)• golang client and server now available
IBM Research
9P Overview © 2010 IBM Corporation24
9P in the Linux Kernel
• Since 2.6.14• Small Client Code Base• include/net/9p - global definitions and interface files• fs/9p: VFS Interface ~1500 lines of code• net/9p
• Core: Protocol Handling ~2500 lines of code• FD Transport (sockets, etc.): ~1100 lines of code• Virtio Transport: ~300 lines of code• RDMA Transport: ~700 lines of code
• Small Server Code Base• Spfs (standard userspace server): ~7500 lines of code• Current KVM-qemu patch: ~1500 lines
IBM Research
9P Overview © 2010 IBM Corporation
9P Linux Kernel Debug• Enable debug for client side trace (-o debug=0xffff turn all on)• 0x001 - display verbose error messages (via syslog)• 0x002 - used for more verbose granular debug• 0x004 - 9p trace• 0x008 - VFS trace• 0x010 - marshalling debug• 0x020 - RPC debug• 0x040 - transport specific debug• 0x080 - allocation debug• 0x100 - display protocol message debug• 0x200 - display FID debug• 0x400 - display packet debug• 0x800 - display fscache tracing debug
25
IBM Research
9P Overview © 2010 IBM Corporation
v9fs access modes
• access=user• new attach every time a new user tries to access the file
system• access=<uid>• single attach and only allows uid=<uid> to access
• access=any• single attach and allows all users to access with rights of
user who performed initial attach
26
IBM Research
9P Overview © 2010 IBM Corporation
v9fs transport options
• trans_fd module• tcp: normal socket operations• unix: mount a named pipe• fd: used passed file descriptors for connection (rfdno,
wfdno)• virtio: use virtio channel• rdma: use infiniband RDMA
27
IBM Research
9P Overview © 2010 IBM Corporation
v9fs cache modes
• Default is no cache• cache=loose• no attempts are made at consistency, intended for
exclusive access, read-only mounts• fids aren’t generally clunked in order to hold reference to
files• cache=fscache• use FS-Cache for persistent, read-only cache backend• EXPERIMENTAL. Hasn’t been fully tested.
• Other options possible in future including path caches (dentry cache) and/or temporal based cache with semantics similar to other distributed file systems.
28
IBM Research
9P Overview © 2010 IBM Corporation
v9fs other options
• port=<port> - specify TCP port• uname=<user> - specify user to initially mount as• aname=<name> - attach argument• maxdata=<n> - specify maximum single packet size• noextend - only use vanilla protocol (no .u)• dfltuid - specify default uid to mount as (.u)• dfltgid - specify default gid to mount as (.u)• afid - specify a security channel (only valid for fd transport)• nodevmap - no special files, make any special fils look normal• cachetag - optional persistent tag signature
29
IBM Research
9P Overview © 2010 IBM Corporation
Typical Regressions Process
• Simple mount against spfs file server• Test with short set of Linux file system benchmarks• fsx -N 1000 -R -W testfile• echo run | postmark• bonnie -s 1• dbench -t 60 4
30
IBM Research
9P Overview © 2010 IBM Corporation
9p server operation
• spfs/npfs: (9P2000.u)• ufs -p 5670 -s• -p specifies port number• -s specifies single user (whoever is running spfs)• can also pass -d to see server side trace• if using npfs, specify -w to limit number of threads
• patched kvm-qemu (for virtio transport)• kvm <other_args> -share /• tells kvm to share / over virtio channel to guest
31
IBM Research
9P Overview © 2010 IBM Corporation
Code Style and Development Goal
• Stick to Linux Coding Style Guidelines (of course)• Keep It Simple• short names• limit any use of macro definitions or conditionals (#ifdef)• extensions should be kept optional• any cache extensions should be kept optional (configurable
at mount time)• send patches for review on:• [email protected]
• bug tracking for client on bugzilla.kernel.org• protocol documentation/updates to • http://github.com/ericvh/9p-rfc
32
IBM Research
9P Overview © 2010 IBM Corporation
Code Review
• http://lxr.linux.no/linux/include/net/9p/• http://lxr.linux.no/linux/fs/9p/• http://lxr.linux.no/linux/net/9p/
33
Top Related