CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Lightweight Remote Procedure Call B. Bershad,...
-
Upload
ashlynn-howard -
Category
Documents
-
view
213 -
download
0
Transcript of CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Lightweight Remote Procedure Call B. Bershad,...
CS 443 Advanced OS
Fabián E. Bustamante, Spring 2005
Lightweight Remote Procedure Call
B. Bershad, T. Anderson, E. Lazowska and H. Levy
U. Of Washington
Appears in SOSP 1989
Presented by: Fabián E. Bustamante
2
Introduction
Granularity of protection mech. used by an OS has significant impact on system’s design & useCapability systems– Fine-grained protection – object exists in its own protection domain– All object live within a single name or address space– A process in one domain can act on an object in another only
through a protected procedure call– Parameter passing simplified by existence of global name space
containing all objects– Problems w/ efficient implementations
In distributed computing, large-grained protection mechanisms– RPC facilitates placement of subsystems in different machines– Absence of global address space ameliorated by automatic stub
generators & sophisticated run-time libraries– Widely used, efficient and convenient model
3
Observation
Small kernel OSs borrows large-grained protection & programming models from distributed computing– Separate component placed in disjoint domains
– Messages used for all inter-domain communication
But, also adopt their control transfer & comm. model– Independent threads exchanging msgs. containing potentially large,
structured values
– However, common case – most comm. in an OS are …• Cross-domain (bet/ domains on same machine) instead of cross-
machine• Simple because complex data structures are concealed behind abstract
system interfaces
– Thus model violates the common case → low performance or bad modularity, commonly the latter
Handle normal and worst cases separately as a rule, because the requirements for the two arequite different: The normal case must be fast. The worst case must make some progress. B. Lampson, “Hints for computer system design.”
4
Motivation
Use & performance of RPC (inside the OS)
Frequency of cross-machine activity– Systems examined
• The V System– Highly decomposed system – all through msg. passing
• Tao, the Firefly OS– Middle-sized kernel responsible for VM, sched. and device
access; rest access through RPC (FS, network protocols, ...)
• Unix/NSF– All local system functions accessed through kernel traps, RPC for
communication w/ FS
Operating System
Operations that cross mach boundaries (%)
V 3%
Taos 5.3%
Sun Unix+NFS 0.6%
5
Motivation
Parameter size & complexity – based on static & dynamic analysis of SRC RPC usage in Taos OS
28 RPC services w/ 366 procedures
Over 4 days and 1.5million cross-domain procedure calls– 95% calls went to 10/112 procedures– 75% to 3/112– Number of bytes transfer - majority < 200B– No data types were recursively defined
6
Motivation
The performance of cross-domain RPC (times in µsec)
Null procedure – void Null() { return;}– Theoretical minimum as a cross-domain operation
• One procedure call• Kernel trap & change of processor’s VM context on call• Kernel trap and context change on return
– Anything above this is overhead
System Processor Null (theoretical min.)
Null (actual) Overhead
Accent PERQ 444 2300 1856
Taos Firefly C-VAX 109 464 355
Mach C-VAX 90 754 664
V 68020 170 730 560
Amoeba 68020 170 800 630
DASH 68020 170 1590 1420
7
Motivation
Overhead – where’s the time going?– Stub overhead – diff. bet/ cross-domain & cross-machine call hidden by
lower layers → general but infrequently needed e.g. 70µsec. to run null stub– Message buffer overhead – message exchange bet/ client/server → 4
copies (through kernel) on call/return (alloc & copy)– Access validation – kernel needs to validate sender both ways– Message transfer – queue/de-queue of msg. – Scheduling – while user sees one abstract thread, there’s a thread per
domain and needs to be handled– Context switch – from client to server and back– Dispatch – receiver thread in server must interpret msg. & dispatch thread
Some optimizations tried– DASH avoids kernel copy by allocating msg. out of a region mapped in both
kernel & user domains– Mach & Taos rely on hand-off scheduling to bypass general sched.– Some systems pass few & small parameters in registers– SRC RPC gives up some safety w/ globally shared buffers, no access
validation, etc
8
Design & Implementation of LRPC
Execution model, programming semantic & large-grained protection model borrowed from RPC
Binding done at the granularity of i/f - a set of procedures – A server module exports an i/f
• LRPC runtime library (server clerk) registers i/f with a name server
– A client binds to the i/f by making an import call to kernel • Kernel notifies server's waiting clerk • Clerk replies with list of PD (procedure descriptor):
– One PD per procedure in the i/f » Entry address in server domain & size of A-stack for arguments and return value
• For each PD– Kernel pairwise allocates in client & server domain a # of A-stacks
» A-stacks are read-write shared by client & server, can be shared among proc.
– Kernel allocates linkage record for the A-stack
• Kernel returns to the client – a Binding Object -- unforgable certificate to access server's interface
(capability-like)
– List of A-stacks for procedures in the i/f
9
Design & Implementation of LRPC
Calling– Client calls user-stub
• Stub puts arguments into A-stack given by kernel at binding • Stub places binding object, A-stack address & proc. id into registers &
traps to kernel • Kernel executes in the context of client thread
– Verifies Binding Object, A-stack & locates linkage record – Puts return address & the current stack pointer into linkage record – Finds an E-stack (Execution stack) of the server – new or from pool– Updates thread's user stack pointer to run off of server's E-stack - Note
that thread is client thread – Reloads processor's VM registers with those of server domain – Performs upcall into server-stub
– Server-stub • Calls the server, which executes with the A-stack and E-stack • When the server returns, trap to the kernel • Kernel does the light weight context switch back to the client address
space – Client-stub again
• Reads the return value from the A-stack • Returns the result to the client
10
Design & Implementation of LRPC
Stub Generation– Two types of stubs automatically generated from Modula2+
definition file • Simple & fast stub in assembly language for most cases – 4x faster• Complex & general in Modular2+ for complex arguments, exception
handling, etc
LRPC on Multiprocessors– Locking mechanism is required for A-stacks
– Further reduced context switch • In single processor, light-weight context switch still incurs big
overheads: vm register updates, TLB misses• Context switch In MP - popular server's context are cached in idle
processors (domain caching)– When client calls server procedure, kernel exchanges caller's processor w/
server's
– Calling thread placed on proc.
– On return, kernel exchanges processors back
11
Design and Implementation of LRPC
Argument Copying– Conventional RPC: 4 times - user-stub/RPC msg/kernel/RPC
msg/server-stub
– LRPC: one - user-stub/A-stack
Copy operations for LRPC vs Message-based RPC
Operation LRPC Msg. passing Restricted msg. passing
Call (mutable parameters) A ABCE ADE
Call (immutable parameters) AE ABCE ADE
Return F BCF BF
Code Copy operation
D From sender/kernel space to receiver/kernel domain
E From message (or A-stack) into server stack
F From message (or A-stack) into client’s results
Code Copy operation
A From client stack to message (or A-stack)
B From sender domain to kernel domain
C From kernel domain to receiver domain
12
Evaluation
Test run on C-VAX Firefly
Null is baseline, others have “typical” parameter sizes
Each point avg. of 10k cross-domain calls
LRPC/MP uses the idle processor optimization
Test Description LRPC/MP LRPC Taos
Null The null cross-domain call 125 157 464
Add A proc. w/ 2 4B arg in & 1 4B arg. out 130 164 480
BigIn A proc. w/ in 200B arg. 173 192 539
BigInOut A proc. w/ in/out 200B arg. 219 227 636
13
Evaluation
Breakdown for the serial (1-proc) Null LRPC on a C-VAX (all in µsec)
Minimum is a timing breakdown for the theoretical minimum cross-domain call
Stub cost – 18 in client’s and 3 in server’s stub
In kernel costs go to binding, validation and linkage
25% is due to TLB misses during virtual address translation – data structures and control sequence designed to reduce it
Operation Minimum LRPC Overhead
Modula2+Procedure call 7
Two kernel traps 36
Two context switches 66
Stub 21
Kernel transfer 27
TOTAL 109 48
14
Evaluation
LRPC avoids locking shared data during call/return to remove contention on shared-memory multiprocessors
Each A-stack queue is guarded by its own lock
Figure shows the number of processors simultaneously making calls – domain caching was disabled (each call required a context switch)
4,000
6,30023,000
15
The uncommon cases
Working well in common case & acceptable in less common ones (just a few examples)Transparency & cross-machine calls– Cross-domain/machine? Early decision – first instruction in stub.
Cost of indirection is negligible in comparisonA-stacks – size and number– PD lists are defined during compilation– If size of arguments is known, A-stack size can be determined
statically, otherwise use a default size = Ethernet packet size– Beyond that, use an out-of-band mem. segment ($$ & infrequent)
Domain termination – e.g. unhandled exception or user action– All resources reclaimed by the OS
• All bindings are revoked• All threads are stopped
– If the terminating domain is a server handling a LRPC request, the outstanding call must return to the client domain
• To handle outstanding threads, you can create a new one to replace captured ones and later kill the captured one upon return
16
Conclusion
LRPC – combining elements from capabilities & RPC systemsAdopts common-case approach to comm.A viable comm. alternative for small-kernel OSs Optimized for comm. b/ protection domains in same machineCombines control transfer & comm. model of capability systems w/ the programming semantics of & large-grained protection model of RPCTechniques– Simple control transfer – client’s thread does the work in servers
domain– Simple data transfer - ~PC passing parameter mechanism (shared
stack)– Simple & highly optimized stubs– Design for concurrency – avoids shared data structure bottlenecks
Implemented in the DEC C-VAX Firefly