Thesis Proposal: Reducing Operating System Intrusion in the Piglet ...

Thesis Proposal: Reducing OperatingSystem Intrusion in the Piglet OSSteve Muir�19th June 2000Operating system intrusion|the overhead imposed upon applications by theoperating system itself|has been studied extensively in the OS research com-munity. Consideration of the forms and origins of intrusion reveals that ahigh degree of intrusion is inherent in a conventional OS due to its structureand passive nature. Thus, new OS architectures are required if the degree ofintrusion is to be signi�cantly reduced.Piglet, a new multiprocessor operating system architecture, successfully re-duces intrusion by making the OS kernel active rather than passive. Thispermits much less intrusive protection mechanisms to be employed in theconstruction of the OS, leading in turn to greatly reduced overheads for invo-cation of system services and hence increased application performance.Preliminary results obtained from the current implementation of Piglet showthat applications can indeed access system functions with much lower over-heads. While thus providing proof of the feasibility of the Piglet architecture,further research needs to be conducted to more fully evaluate it. Among thetopics to be investigated are:� the implications of asynchronous vs. synchronous kernel structure� e�cient implementation of user-level services e.g., network protocols� which application domains Piglet is most applicable to� scalability of the Piglet architecture1 IntroductionThe primary purpose of a computer system is to run user applications; theoperating system provides a supporting infrastructure for applications but doesnot directly carry out user tasks. The goal of the OS is to share system resourcesamong multiple applications which may be unaware of and/or competing with�This research was supported by NSF grants ANI98-13875 and CDS97-03220-001, andDARPA contracts #N66001-96-C-852, #MDA972-95-1-0013, and #DABT63-95-C-0073. Ad-ditional support was provided by the AT&T Foundation, and the Hewlett-Packard and IntelCorporations. 1

each other. Hence the overhead imposed by the OS upon those applications mustbe minimised if the system is to be utilised most e�ciently. In particular, themanner in which applications access resources is critical to their performance.Several alternative OS architectures e.g., microkernels, vertically-structuredoperating systems, have been proposed to address this issue. While successfullyproviding applications with the exibility to manage resources according toapplication-speci�c policies, they do not directly address the overhead imposedby the structure of the operating system. The Piglet architecture aims to providethe same degree of application-speci�c resource management while also reducingoperating system overheads.1.1 Problem StatementA number of trends in computing have forced operating systems researchers tofocus on the issue of resource management:� Decreasing cost, and the associated increasing ubiquity, of computer sys-tems has changed the manner in which those systems are used. Whenmachines were expensive the most important performance metric was jobthroughput|maximising utilisation of the system as a whole was givenpriority over the requirements of individual `jobs' e.g., batch and earlytime-sharing systems.Nowadays those resources which were scarce and/or expensive have be-come so cheap and over-provisioned e.g., CPUs, memory, persistent stor-age, network bandwidth, that users expect to have substantial resourcesdedicated to their applications. Hence, job throughput is no longer asmuch of a concern to users as job latency, and similarly, the focus of re-source management becomes the reduction of access latency rather thanincreasing utilisation.� The changing nature of user applications, especially the emergence of mul-timedia, imposes new requirements upon operating systems. The require-ments of this new class of applications are often described as soft real-timesince they share many similarities with traditional `hard' real-time systemsbut failure to meet those requirements is typically less catastrophic.General-purpose operating systems, which are known to o�er very poorsupport for hard real-time applications, are also unable to support the softreal-time requirements of current and future applications. In particular,the resource management policies they embody are typically inappropri-ate for such applications|hence the proposal of application-level resourcemanagement by many researchers as a requirement for such systems.� The predominance of the client-server model of networking coupled withthe general increase in networked computing has led to a requirement forservers to be able to handle very high I/O loads e.g. �le (and Web) servers,database management systems. In contrast to the prior two examples,throughput remains the most important performance metric.2

The favourable price/performance ratio of commodity hardware coupledwith a general-purpose OS e.g. the common PC/Linux/Apache web-server,makes it attractive as a platform for building such servers. However, inthe same way as for soft real-time applications, general-purpose operatingsystems cannot optimally support such systems due to their imposition ofgeneral resource management policies upon the server application. Thus,application-level resource management is also proposed as a solution forthis problem domain.Thus the combination of greater availability of resources and the chang-ing nature of application requirements has made e�cient resource managementthe focus of most recent operating systems research. In particular, the goal ofreducing OS intrusion|the impact that the operating system's structure hasupon application performance|has led to the adoption of application-level re-source management in order to reduce the e�ect of operating system policiesupon applications.1.1.1 Operating System IntrusionResource management can be decomposed into two independent parts: policyand mechanism. The management policy speci�es how application requirementsare satis�ed by the resource e.g., page replacement policy, while the mechanismde�nes the manner in which the policy is enforced e.g., the CPU/OS's page-table. The orthogonality of policy and mechanism is somewhat reduced bythe fact that any deterministic resource sharing mechanism imposes some con-straints upon the policies implemented using that mechanism.Both the policies and mechanisms of resource management have consider-able impact upon an application's behaviour. Inappropriate policies can have asigni�cant negative e�ect on application performance, while ine�cient mecha-nisms can impose considerable overhead upon applications. These two aspectsof OS intrusion are referred to as policy intrusion and mechanism intrusionrespectively.1.1.2 Policy IntrusionPolicy intrusion arises when the resource management policies used to imple-ment the operating system's resource abstractions con ict with the policieswhich an application wishes to use. For example, Stonebraker describes in [36]how the least-recently-used (LRU) policy commonly adopted by general-purposeoperating systems is the worst possible choice, for certain access patterns com-mon in database management systems, of block replacement strategy in a bu�ercache.In order for applications to make optimal use of resources they must be ableto specify application-speci�c management policies. Patterson [31] describes onepossible solution whereby applications give `hints' to the OS which are used toguide the OS's prefetching and caching strategies; an alternative solution is3

advocated by the architects of vertically-structured operating systems [10, 19]who propose eliminating resource management policies from the OS altogether.1.1.3 Mechanism IntrusionMechanism intrusion is the term used to describe the overheads imposed uponan application by the mechanisms used to implement the operating system.There are two principal mechanisms which impose signi�cantly upon applicationperformance: protection of the OS kernel by privilege boundaries, and the useof asynchronous interrupts to handle requests from hardware devices.While privilege boundaries are a convenient way to protect the OS kernelfrom malicious or malfunctioning applications they add signi�cant overhead tothe cost of invoking OS functions. Asynchronous interrupts, on the other hand,do not directly add overhead to OS services but instead cause unpredictablepauses in application execution when the interrupt handlers pre-empt the exe-cuting process.1.1.4 Uncorrelated IntrusionWithin the set of mechanism intrusions caused by asynchronous interrupts thereexists a subset which is of particular interest|those interrupts whose occurenceis completely unrelated to the behaviour of the currently executing applicationi.e., uncorrelated intrusions. Examples of such include periodic timer ticks andinterrupts due to background network tra�c; other interrupts, speci�cally thosearising in response to I/O requests initiated by the application e.g., DMA com-pletion, reception of a response to a network packet, arise due to applicationactions and so are not considered to be uncorrelated.Uncorrelated intrusion is particularly signi�cant because, by de�nition, itcannot be reduced by changing the actions an application performs. Perhaps theclearest demonstration of the problem of uncorrelated intrusion is given by QoScrosstalk, as introduced in [19]. QoS crosstalk arises when the operating systemperforms work on one application's behalf in violation of QoS guarantees given toanother application e.g., execution of interrupt handlers during an application'sCPU timeslice.1.1.5 Resource IntrusionResource intrusion denotes those resources used by the operating system forits own purposes, and thus not available for the use of applications e.g., diskblocks used to store metadata, memory used to maintain kernel data structures.Resource intrusion di�ers from policy and mechanism intrusion in several sig-ni�cant aspects:� In a typical computing environment resources are over-provisioned to suchan extent that the impact of resource intrusion becomes negligible.� The degree of resource intrusion is primarily dependent on the manner4

in which the OS uses that resource i.e., algorithms used, rather than theunderlying structure of the OS.Hence the issue of resource intrusion, while not to be wholly ignored, is notfurther discussed in this proposal.1.1.6 Evaluating The E�ect of OS IntrusionWithin each of the three classes of OS intrusion described above|policy, mecha-nism and resource|there exist a number of metrics which can be used to quan-titavely assess the degree of intrusion presented by a speci�c OS component.Since the metrics which are of most interest are application-speci�c no attemptwill be made to enumerate an exhaustive list here; however, some examples aregiven in Section 4.2.Like most other performance metrics, the cost of intrusion is most readilyevaluated using either latency or throughput measurements i.e., how long anoperation takes or the number of operations performed per unit time. Fora latency-based metric e.g., cost to invoke a system call, the cost due to OSintrusion can be calculated from the relation:T = Toperation + Tintrusion (1)where T , Toperation, and Tintrusion represent total time, time executing theoperation's code itself, and the time spent executing OS code which does notperform any part of the operation e.g., context switching, respectively.For a latency-based metric, total time, T , can usually be measured usingconventional techniques e.g., microbenchmarks (see Section 4.2, macrobench-marks such as lmbench [24], or application-level measurements. Toperation, thetime required to perform the operation, can usually be measured by instrument-ing source code to measure only the appropriate parts. Then, Tintrusion, whichmay be viewed as the cost of intrusion, is given by the di�erence between thetwo measured times.For throughput-based metrics, the above relationship still holds true, sincethe cost of intrusion is still fundamentally a time overhead. The most naturalway to express the cost of intrusion for such a metric is a fractional overheade.g., OS intrusion reduces network throughput by 20%. However, except inthe simplest cases, where the throughput can be calculated directly from thetime taken per operation, measuring the throughput in the no-intrusion case istypically much more di�cult than for a latency-based metric. Thus, quantifyingthe cost of intrusion in throughput-based metrics remains an open problem.1.2 Motivation and ApproachMany examples have been given in previous work of di�erent forms of operatingsystem intrusion, although there has been a greater emphasis on policy intru-sion than other forms. As mentioned above, Stonebraker's discussion of the5

impact of general-purpose OS policies upon DBMS performance introduces sev-eral examples, and a wealth of other work exists (see Section 2). Consequently,operating system designers have introduced new operating system architecturese.g., microkernels, vertically-structured systems, to address these issues.The microkernel architecture, perhaps best exempli�ed by Mach [33, 32],facilitates application-level resource management by moving much of the func-tionality of the OS out of the kernel and into user-level servers. While permittingapplications to implement their own resource management policies, the rela-tively high overhead of communication between applications and servers meansthat microkernel-based systems often perform poorly|a prime example of thecost of mechanism intrusion.While some researchers attempted to make the microkernel architecturemore attractive by introducing mechanisms for fast IPC e.g., Liedtke's L3 andL4 microkernels [20, 14], Sun's Spring nucleus [11], other groups introducedalternative architectures, which can be broadly classi�ed as vertically-structuredoperating systems, to also reduce policy intrusion without imposing the overheadof communication with application-level servers.Vertically-structured operating systems e.g., the University of Cambridge'sNemesis OS [19], MIT's Exokernel [10, 18], followed the microkernel architecturein so far as the kernel itself only implements a core set of functions. However,rather than the kernel providing only an IPC mechanism which can be used toimplement system functions in user-level servers, the vertically-structured op-erating system instead implements low-level resource management functions|typically identi�ed as protection, multiplexing, and translation|which applica-tions use as primitives to implement higher-level functions directly i.e., withoutthe overhead of IPC.It was stated earlier that a successful operating system architecture muste�ciently support multiple applications without adversely a�ecting their be-haviour. While vertically-structured operating systems, and to a lesser extentmicrokernels, have certainly made substantial progress towards this goal, im-posing a much lower degree of policy intrusion by supporting application-levelresource management, they have not signi�cantly reduced mechanism intrusion.Since there is a class of applications for which mechanism intrusion is equallyas important as policy intrusion (see Section 5.1), the Piglet architecture isproposed as an evolution of the vertically-structured operating systems, thusproviding the same low degree of policy intrusion but also signi�cantly reducingmechanism intrusion.The Piglet architecture is based upon the observation that mechanism intru-sion in existing operating systems arises from the passive nature of the OS kernelitself. The kernel is typically implemented as a privileged library of functionswhich are executed only in response to application or device requests|systemcalls and interrupts respectively. In order to guarantee the integrity of the sys-tem, elaborate protection mechanisms are used to control entry to the kernel. Itis the high cost of these protection mechanisms (see Section 4.2) that constitutemechanism intrusion.Piglet reduces mechanism intrusion by changing the nature of the OS kernel6

from a passive library to an active task. This is accomplished by dedicating one(or possibly more) system CPUs to constantly running the Piglet kernel. Hencethe kernel can continuously monitor the state of the system, both hardwaredevices and other processors, rather than only being invoked on demand.Thus Piglet reduces mechanism intrusion in two ways: by elimination ofdevice interrupts, since the active kernel continuously polls each device to de-termine its status; and by eliminating privilege boundaries between applicationsand the kernel, instead using physical separation across processors to maintainprotection and fast inter-processor communication primitives, speci�cally sharedmemory queues, for invocation of kernel functions.This change in nature of the OS kernel is made feasible by the increasingavailability of multiprocessor systems, a trend begun by the addition of small-scale multiprocessing support to commodity CPUs i.e., Intel's x86 architecture,but which is expected to continue with the introduction of single-chip multipro-cessors [4, 16]. Indeed, as the degree of multiprocessing in a system increases,when symmetric multiprocessing is expected to be unable to make full use of allavailable CPUs, new OS architectures will be required to do so. In particular,the Piglet architecture is particularly applicable to those application domainswhere I/O comprises a large fraction of the workload and thus the cost of mech-anism intrusion is signi�cant. An example of such an application domain is thenetwork appliance|web servers, �rewalls, active network nodes, etc.2 Related WorkWhile the generalised concept of OS intrusion is one of the contributions ofthis work, both policy and mechanism intrusion have been investigated to somedegree by previous researchers. The problem of policy intrusion has been knownof at least since the early '80s, and speci�c examples of mechanism intrusionhave been studied by various groups. In response, a number of OS projectswhich attempt to reduce policy intrusion, and to a lesser extent mechanismintrusion, have been designed and implemented.2.1 IntrusionPerhaps the clearest demonstration of the signi�cant negative impact which in-appropriate operating system structures can have on application performancewas provided by Stonebraker's discussion [36] of how the services providedby a general-purpose OS (UNIX) are unsatisfactory for supporting a high-performance database management system (DBMS). He comes to the conclusionthat \A DBMS would prefer a small e�cient operating system with only desiredservices... On the other hand, most general-purpose operating systems o�er allthings to all people at much higher overhead."While Stonebraker implies that the needs of any given application can bemost e�ciently met by a special-purpose operating system, a general-purposeOS which o�ers low degrees of both policy and mechanism intrusion may also7

be able to do so. Hence the need for alternative operating system architectures,speci�cally those which facilitate application-speci�c resource management withlow overhead.2.1.1 Policy IntrusionStonebraker's analysis of the limitations imposed by a general-purpose operatingsystem upon a database management system provides many examples of policyintrusion [describe them here]. Common to every example is the problem of theabstractions implemented by the OS being restricted to policies which interactadversely with those used by the DBMS. Hence, the need to separate resourcemanagement mechanisms and policies is clearly demonstrated for many di�erentclasses of resource.Application-speci�c management of virtual memory has been extensivelyresearched: Appel and Li [1] provide general considerations for implementingapplication-level VM primitives; Harty and Cheriton [15] discuss an implemen-tation of application-controlled physical memory in the V++ system and itsbene�ts in a transaction processing application; and �nally, Hand [13] and En-gler [10] describe speci�c implementations for vertically-structured operatingsystems. Hand discusses how self-paging is used in the Nemesis OS to supportvirtual memory without compromising the QoS guarantees of other applica-tions, while Engler describes how the low-level primitives exported by the Aegisexokernel can be used to e�ciently support various application-level functions.User-level network protocols are another area which has been covered bymany groups: Cornell's U-Net [38], and the Virtual Interface Architecture [6] de-rived from it, provide applications with a direct interface to the network adapter;Thekkath et al. [37], and Edwards and Muir [9] describe user-level implemen-tations of network protocols. The exibility to implement application-speci�cnetwork protocols and/or management of network resources has become muchmore important for both multiservice operating systems and high-performancenetwork appliances.2.1.2 Mechanism IntrusionWhile the need for application-speci�c resource management appears to havebeen universally adopted, the reduction of mechanism intrusion is often takento be unimportant. For example, Engler et al. describe in [10] how Aegis, the�rst implementation of their exokernel architecture, was heavily optimised toreduce the costs of invoking low-level primitives. Subsequently, Kaashoek etal. reported in [18] that such optimisations are not necessary to leverage themost bene�t from the exokernel architecture. While this statement is acceptedas generally true, in certain classes of application e.g., where I/O comprises thebulk of the workload, it appears that there is considerable bene�t to be hadfrom reducing mechanism intrusion.Mogul and Ramakrishnan [27] describe how a conventional OS can become`livelocked' due to continuously being interrupted by a network interface card.8

While the solution they propose, temporarily disabling interrupts and switchingto a polling mechanism, is very similar to the Piglet architecture it di�ers sig-ni�cantly in that their OS switches dynamically between the two schemes whilePiglet always uses polling.In a similar vein to the analysis in Section 4.2, Mogul and Borg [26] eval-uated the e�ect of context switches on cache performance. Dougan et al. [8]discuss possible modi�cations to the Linux kernel in order to reduce the impactof the OS on cache performance. In particular, they recommend bypassing thecache in certain portions of the kernel e.g., when zeroing memory pages in theidle task, so as not to pollute the cache.One of the most prominent examples of mechanism intrusion is the highcost of inter-process communication in the early microkernel architectures, asdescribed below. Subsequently, several OS projects made signi�cant e�orts toreduce this cost e.g., the L3 and L4 microkernels [14, 20], the Spring microker-nel [11]. These systems are described in more detail below.2.2 Alternative OS ArchitecturesMany di�erent OS architectures have been designed to reduce both policy andmechanism intrusion. The microkernel architecture was perhaps the �rst to bedesigned with application-speci�c resource management as a primary goal. Thehigh cost of inter-process communication (IPC) in such systems led to threedi�erent subsequent directions in OS research:� Fast IPC -based microkernels, which retained the same architecture butattempted to reduce the cost of IPC.� Vertically-structured operating systems, which eliminated that cost bymoving functionality from servers into applications themselves.� Extensible kernels, which permitted application-speci�c code to be addedas extensions to the kernel.Note that these directions are in fact orthogonal, and thus it is possible foran OS to incorporate ideas from more than one, thus leading to a 3-dimensionaldesign/feature space.2.2.1 MicrokernelsEarly examples of the microkernel architecture include the Bell Labs Multi-Environment Real-Time (MERT) system [21] and the Series/1 Distributed Sys-tem (SODS) OS [34, 12]. MERT was intended to support multiple OS envi-ronments on a single physical machine (much like IBM's VM/370 system [7]),speci�cally the combination of time-sharing (UNIX) and real-time operating sys-tems. SODS/OS was designed as the building block for a distributed computingenvironment in which processes were designed to be location-independent, thuspermitting transparent migration. Neither MERT nor SODS/OS was designedwith e�cient resource management as a primary goal.9

Both MERT and SODS/OS conform closely to the characteristics of a mi-crokernel architecture, namely provision of system functions by server processes,and the use of messages as the primary IPC mechanism. Like subsequent mi-crokernel projects they both su�ered from a high cost of communication; Lyck-lama [21] states that \the MERT system requires from 5 to 50 percent more sys-tem time for the more heavily used system calls", while Hammond [12] observesthat \the associated context swapping is extremely expensive on a conventionalmachine".Mach [33, 32] perhaps typi�es the microkernel architecture. The kernel itselfimplements only a core set of services necessary to support application-levelservers, which in turn provide the system services required by applications. Apowerful message-passing mechanism provides the primary IPC mechanism; inpractice the high degree of mechanism intrusion this introduced meant that theperformance of applications running on top of server-based operating systemse.g., the 4.3BSD server, was typically much lower than the native environment.The poor performance of Mach, primarily due to this high cost of IPC,prompted several research groups to address the problem of reducing this cost.Liedtke's L3 [20] and L4 [14] microkernels left the fundamental architectureunchanged but used the minimisation of IPC cost as the overriding priority in thesystem design; Liedtke [20] reports IPC times more than an order of magnitudefaster than Mach. Even so, lmbench [24] results presented by Hartig [14] showthat Linux running natively is still faster than the Linux server running on L4.Sun's Spring OS introduced the concept of `doors' as a mechanism by whichapplications can perform control transfers between protection domains. A dooris conceptually similar to a system call entry point in a conventional OS, inthat it de�nes a �xed access point by which an untrusted client may invoke aspeci�ed function. However, doors are de�ned by applications and passed ascapabilities to clients, thus permitting those clients to perform fast transfers ofcontrol to the server.Finally, the University of Massachusetts' Spring real-time microkernel [30](not related to the aforementioned project of the same name) is perhaps mostsimilar to Piglet in that each node in the Spring distributed system dedicatesone of four CPUs to handling system tasks. The functionality of this processor isnot described in the available documentation, other than to state that it is usedto handle administrative tasks and shield applications from device interrupts.This is primarily done to reduce unpredictability in the system, thus assistingthe scheduling of jobs with hard real-time requirements.2.2.2 Vertically-Structured Operating SystemsWhile fast IPC mechanisms have helped lessen the biggest problem with the mi-crokernel architecture, namely the high overhead of communicating with servers,others still exist e.g., QoS crosstalk. Hence the development of the vertically-structured operating system as an alternative which does not share the samearchitectural limitations. The University of Cambridge's Nemesis OS [19] andMIT's exokernel architecture [10, 18] are probably the best-known examples of10

such operating systems, but other groups have also designed operating systemswhich loosely �t into the same model e.g., Stanford's Cache kernel [5].Nemesis exempli�es the vertically-structured operating system architecture.The kernel exports only the minimum set of functions necessary to supportapplication-level resource management|these are identi�ed by Barham [2] astranslation, protection and multiplexing. Nemesis separates control- and data-plane operations|the former e.g. acquiring access to a region of disk blocks,are typically performed by server tasks, while the latter e.g., reading/writingthose disk blocks, are optimised so as to require little or no interaction with thekernel.MIT's exokernel architecture is very similar to Nemesis, sharing many ofthe key features i.e., exporting low-level resource management primitives, opti-mising data-plane operations to reduce kernel interaction. Where it di�ers fromNemesis is in its use of kernel extensibility to minimise the cost of the mostcommon kernel operations. This concept is described in more detail below.The Cache kernel also provides applications with a set of low-level primitiveswhich support application-level resource management. However, rather thanimplementing protection and multiplexing, the Cache kernel instead exports aminimal core set of objects (address spaces, threads, and kernels) upon whichapplications build services.2.2.3 Extensible Operating SystemsWhile vertically-structured operating systems provide application-speci�c re-source management by exporting low-level interfaces which applications canuse to implement their own policies, an alternative approach is to allow appli-cations to extend the OS kernel with functions which implement those policies.This principle is embodied in the University of Washington's SPIN kernel [3]and, to a lesser degree, MIT's exokernel and the Synthesis OS [22].The SPIN system permits applications to construct kernel extensions usingModula-3 [29], a type-safe object-oriented language. These extensions take theform of functions executed in response to kernel events e.g., a packet beingreceived from a network interface, and can contain almost arbitrary sequencesof code. To maintain the integrity of the kernel certain restrictions are enforcede.g., interrupt handlers must be written in such a manner that the kernel cansafely terminate the function if it executes for too long. Thus, by exposinglow-level kernel events to applications and allowing them to execute speci�cfunctions in response SPIN permits applications to embed their own resource-management interfaces in the kernel.The exokernel uses a more restricted form of kernel extension. Rather thenallowing applications to directly download arbitary functions into the kernel,the exokernel instead allows applications to specify in a function-speci�c man-ner the behaviour of certain functions. For example, restricted languages areused to generate packet �lter functions in order to demultiplex received networkpackets; similarly, the XN storage manager uses application-speci�ed untrusteddeterministic functions (UDFs) to manipulate metadata without the kernel hav-11

ing to be aware of the metadata format.Finally, much like the exokernel although in fact preceding it, Massalin'sSynthesis OS [22] made extensive use of run-time code generation to constructspecialised kernel functions. These functions were used for various tasks, includ-ing scheduling, interrupt handling, and system call invocation. Synthesis di�ersprimarily from SPIN and the exokernel in that extension is controlled whollyby the kernel itself and not by applications. Thus Synthesis is perhaps mostinteresting in that it reduces mechanism intrusion but not policy intrusion|the converse approach to that taken by most other contemporary operatingsystems.3 The Piglet ArchitectureBefore describing the Piglet architecture it is helpful to brie y consider thestructure of both conventional and vertically-structured operating systems, sinceboth policy and mechanism intrusion are almost entirely due to the fundamentalstructure of such operating systems.3.1 Conventional OS StructureFirst, consider a conventional OS such as UNIX [23] or Windows NT [35]; Fig-ure 1 shows the logical structure of such an OS running on an n-way symmetricmultiprocessor system. All n processors each run di�erent application-levelthreads and interact with physical devices through the OS.The heavy dashed line indicates the user/kernel privilege boundary|in aconventional OS this is coincident with the boundary between application andsystem functions, the latter being indicated by the darker grey shading, whilethe lightly-shaded area represents resource protection and multiplexing withinthe OS. Arrows represent control transfers: applications invoke system functionsthrough traps; the OS uses interrupts to interact with applications, either byraising signals or causing a reschedule. Similarly, the OS controls devices usingI/O commands; devices request that the OS execute a speci�c function by raisingan IRQ.The primary disadvantage of this structure is that the coincidence of theprivilege boundary and application programming interface prevents applicationsfrom changing resource-management policies which are embedded within theOS, thus leading to policy intrusion.3.2 Vertical OS StructureThe bene�ts of separating resource management mechanisms and policies havebeen clearly demonstrated for many di�erent classes of resource, as describedearlier. Vertically-structured operating systems attempt to provide application-level resource management by restructuring the operating system as shown inFigure 2. As before, n processors each run application threads and interact with12

InterfaceNetwork

System functions

User/kernel privilege boundary

apps apps

trap

OS

CPU 1 CPU n

I/O

interrupt

IRQ

Protection and multiplexingFigure 1: Structure of a Conventional OSphysical devices through the OS.Such operating systems attempt to reduce policy intrusion by providing adegree of separation between resource management mechanisms and policies.The OS provides mechanisms|translation, protection and multiplexing|toshare resources between applications; applications use those resources accordingto their own application-speci�c policies (within the constraint imposed by theOS mechanisms).In order to provide these properties the vertically-structured OS separatesthe privilege boundary and application programming interface. The interfacepresented by the OS at the privilege boundary is much lower-level, the OSonly performing protection and multiplexing of resources. Applications can beprogrammed directly to these lower-level interfaces, or they can access resourcesthrough higher-level interfaces provided by a libOS [18].By providing low-level interfaces to access resources vertical operating sys-tems e�ectively reduce policy intrusion since an application is free to implementits own policies using those interfaces. However, the fact that the OS provideslower-level primitives means that a function which a conventional OS providesas a single system call may require the invocation of several primitives in avertical OS.As an example, a vertical OS typically provides packet-level operations onthe network interface e.g., send an Ethernet frame, while a conventional OSprovides system calls which may send many such packets. Thus, the cost of13

InterfaceNetwork

trap

interrupt


System functions

Protection and multiplexing

apps apps

CPU 1 CPU n

I/O

IRQ

OS

Figure 2: Structure of a Vertical OSmechanism intrusions becomes particularly important in a vertical OS|if theoverhead of invoking low-level primitives is too high then the bene�ts of reducingpolicy intrusion may be compromised.On the other hand, the structure of a vertical OS may actually reducemechanism intrusions, especially system calls, in certain circumstances. Becauseeach application communicates with its own libOS, certain operations whichmust be protected in a conventional OS can be unprivileged in a libOS, andhence not require a privilege-crossing to invoke.3.3 The Piglet ArchitectureWhile conventional and vertical OS architectures di�er fundamentally in theirplacement of the privilege boundary, they both use the same control transfermechanisms i.e., traps and interrupts. The Piglet architecture o�ers a placementof the privilege boundary similar to vertical operating systems but uses di�erentcontrol transfer mechanisms, as shown in Figure 3. n� 1 CPUs run applicationthreads while the nth runs the lightweight device kernel (LDK); applicationsinteract with physical devices only through the LDK.Piglet derives many of its primary properties from the fact that the OS isnow an active rather than passive entity:� Protection between applications and the OS is enforced by physical ratherthan logical (privilege) boundaries.14

NetworkInterface


System functions

Protection and multiplexing

apps apps

CPU 1

poll interrupt

I/O poll

CPU n-1

LDKCPU n

Figure 3: Structure of Piglet� Applications invoke OS services by sending a message to the LDK ratherthan executing service functions directly.� The OS continually polls devices, eliminating the need for them to useinterrupts in order to guarantee timely service. Consequently, applicationCPUs are never interrupted by devices.� Elimination of interrupts signi�cantly simpli�es the design and implemen-tation of the OS.Applications running on Piglet thus incur very low mechanism intrusions:system calls are replaced by a simple protocol to post a message to a shared-memory queue, and device interrupts are eliminated.3.4 The Lightweight Device KernelAs stated above, the Piglet architecture is centred around a system CPU exe-cuting the lightweight device kernel. The behaviour of the LDK is conceptuallyvery simple: it continuously polls every device and application for service re-quests i.e., operations which must be performed on the behalf of the requester.15

Although simple, the structure of the LDK is critical to the performanceof Piglet. If Piglet is to o�er quality-of-service guarantees to applications anddevices then the distribution (mean, variance, best-case, etc.) of the pollingperiod must be tightly controlled.Three main factors contribute to the reduction of the polling period. Firstly,the complexity of the LDK is drastically reduced by the elimination of deviceinterrupts (except for exceptional cases e.g., hardware failure), since algorithmsdo not have to be designed to cope with the possibility of an interrupt occurringat unpredictable times. Secondly, the use of lock-free synchronisation in theshared-memory queues used to communicate with applications allows the LDKitself to be lock-free. Finally, the use of run-time code generation to specialisethe polling loop and other critical functions e.g., packet �lters, packet headergenerators, reduces the cost of invoking those functions.3.5 Application Invocation of OS ServicesSince the LDK and applications run on di�erent CPUs applications cannot di-rectly a�ect the LDK, instead communicating through lock-free shared-memoryqueues (the lock-free protocol used is a variant on those described by Michaeland Scott [25]). Most system services are invoked by the application postinga request to the appropriate queue; the exceptions are described below (Sec-tion 3.5.1).Execution of OS services by the LDK on behalf of the application ratherthan by the application itself has two primary bene�ts. Firstly, applicationsdo not need to cross the privilege boundary to post a request to the shared-memory queue. Protection is instead enforced by physical separation and havingthe LDK check the validity of requests. Secondly, invoking a system service isasynchronous rather than synchronous. Once the application has posted therequest it continues executing concurrently with processing of the request.Both these bene�ts reduce the overhead of invoking a system service: inaddition to eliminating the cost of the privilege crossing, the application alsoperforms less work since it only has to post a message to the LDK rather thanexecute the service function itself. However, the asynchronous nature of serviceinvocations means that it is generally not possible for the invoker to determinewhat the outcome will be. This is a signi�cant di�erence from the conventionalOS model whereby all system calls return a value to indicate success or failure.Piglet therefore uses a di�erent programming model for service invocation:� Failure to complete a service request is an exceptional event; the com-mon case is that the request is successful. Thus Piglet uses an exceptionmechanism to indicate a failed service request.� A service request cannot fail due to lack of resources. Applications mustprovide the LDK with all the resources necessary to execute the servicerequest e.g., memory pages, network bu�ers.� Applications can annotate a service request to indicate that failure should16

not raise an exception; this is equivalent to discarding the return value ofa system call invocation.� If the application cannot proceed further until it knows that the servicerequest has completed successfully it can annotate the request to raise a`successful completion' exception, then wait for an exception.The last construction allows applications to construct synchronous serviceinvocations from asynchronous primitives. In particular, library operating sys-tems could present synchronous APIs such that applications need not be awareof the asynchronous nature of Piglet's underlying primitive operations.3.5.1 Directly-Invoked ServicesAlthough most system services are invoked as described above, some must beinvoked in the conventional manner i.e., by the application crossing the priv-ilege boundary and executing OS code. Broadly speaking, such services arethose which directly a�ect local CPU state, must be privileged, and cannot bee�ciently executed by a remote CPU e.g., TLB management.3.5.2 Application-LDK CommunicationAs stated above, the shared-memory queues via which applications communicatewith the LDK use lock-free synchronisation to support concurrent access. Lock-free synchronisation is important primarily so that applications cannot cause theLDK to block.The algorithm used in the current implementation of the LDK supportsmultiple concurrent writers and a single reader and so can be used as-is forthe application-to-LDK queues; for queues used in the reverse direction theapplication must impose its own scheme for controlling concurrent access to thequeue.3.6 Exceptions and Asynchronous SchedulingExceptions in the Piglet architecture are functionally equivalent to UNIX sig-nals. They serve two main purposes: noti�cation of service request failure andnoti�cation of events speci�ed by the application. Both needs are dictated bythe asynchronous nature of a service request invocation: it cannot return a valueto indicate success or failure, nor can it block the application. An applicationwhich needs to wait for an event does so by notifying the LDK which eventsshould cause it to be unblocked and then forcing a reschedule.Scheduling is also handled di�erently in the Piglet architecture than inconventional and vertical operating systems. Rather than invoking a schedulerfunction every time a CPU is rescheduled the LDK instead continuously updatesthe scheduler state. For each thread currently running on a CPU the LDKgenerates a specialised reschedule function which switches contexts directly tothe next thread to be executed on that CPU (as in Synthesis [22]). This function17

can be changed concurrently with the thread's execution as other threads areblocked and restarted.4 Preliminary ResultsAt this point some preliminary results have been gathered to evaluate both theimpact of OS intrusion in a conventional operating system, and the bene�t ofthe Piglet architecture in reducing that impact. While these results representonly two components in that assessment they appear to support the key notionspresented earlier, speci�cally that OS intrusion is a signi�cant factor in con-ventional operating systems, and that the Piglet architecture can successfullyreduce intrusion.4.1 Evaluation MethodologyThe majority of results presented below (all except the measurement of round-trip times) were gathered by instrumenting the Piglet/Linux kernel with in-structions to log the value of the timestamp counter upon entry to and exitfrom functions of interest. This allows a pro�le to be constructed detailing theexecution path followed in response to various system events. For reasons ofconciseness, full details are omitted here, but are given elsewhere [28].All tests were conducted using the Piglet/Linux hybrid kernel derived fromLinux version 2.0.30 but updated with current drivers for the network cardsused. The test machine was a dual 200MHz Pentium Pro with 96MB of EDODRAM connected to an identical machine running Linux via an isolated FastEthernet (100Mb/s) hub using 3Com 3c905 NICs.4.2 The Cost of Mechanism IntrusionAs discussed earlier (Section 3), mechanism intrusions are inherent in conven-tional and vertical OSes due to the passive nature of the OS. A passive OS isprotected from applications by privilege boundaries; applications run in user(unprivileged) mode and must switch to supervisor/kernel (privileged) mode toexecute OS functions. The overhead of crossing the privilege boundary is an ex-ample of a mechanism intrusion|the structure of the OS imposes an additionalinvocation cost upon applications.The passivity of the conventional OS also causes mechanism intrusions inthe form of device interrupts. A device interrupt occurring during the executionof an application a�ects the application in two primary ways: it consumes CPUcycles and contaminates the CPU's caches. Both costs will be analysed in turn.4.2.1 Interrupt Period and LatencyTable 1 shows various time-based metrics calculated from the Linux kernel eventlog and application timestamp counter readings. The interrupt period, Period,represents the total application CPU time consumed by a single interrupt.18

Time (�s)Form of intrusion Period Work Overhead LatencyNetwork interrupt (upComplete) 44.3 27.5 15.8 7.6System call (sendto()) 32.3 24.8 7.5 4.1Table 1: Measured time costs of OS intrusionCache missesForm of intrusion I-cache D-cacheTimer interrupt 37.2 79.0Trivial system call (getpid()) 9.9 37.0Table 2: Measured cache pollution of OS intrusionWhile the interrupt period is the simplest measure of OS intrusion it failsto take account of the useful work done by the OS within that period. Judgingexactly what is `useful' work is somewhat subjective but we consider it to beall code whose execution directly contributes to the progress of some applica-tion. This includes system call, device driver, and bottom-half functions (`soft'interrupts), but excludes �rst-level interrupt/trap handlers, context switches,etc. As an example, Table 1 shows the total time spent in the device driver's re-ceive function and Linux's network protocol stack processing the received packet(the Work column). The di�erence between this �gure and the interrupt periodgives us the overhead of an interrupt, shown in the Overhead column.The overhead can be further subdivided into the time between the devicesignalling the interrupt and the handler being called by the OS (shown as La-tency in Table 1), and the time taken to return from the interrupt context tothe application (the di�erence between Overhead and Latency, not shown).4.2.2 The Impact of Interrupts on CPU CachesA similar method to that described earlier was used to measure the cache pol-lution due to interrupts, instead recording the performance monitoring countervalues rather than timestamp counter in the pro�les gathered. The necessarysteps were taken to ensure that both instruction and data caches were purgedof kernel code before the interrupt occured; again, fuller details are availableelsewhere [28].The most useful metrics for measuring the e�ect of an interrupt on theCPU's caches are the number of the application's cache lines replaced duringhandling of the interrupt. This is equivalent to the number of cache misses takenduring execution of the interrupt handler, with one complication. Cache linesprefetched by the CPU can be accessed without a cache miss, thus introducingan error between the measured number of cache misses and the actual numberof cache blocks replaced. However, in the Intel CPUs used in this evaluation19

only I-cache blocks can be prefetched. Mean values from a sample of 1000measurements are shown in Table 2.What is clear from these results is that an interrupt has a non-negligibleimpact upon the application executing when it occurs, both in CPU cycles con-sumed and cache contamination. While the network interrupt is more complexthan most other interrupts e.g., timers, and thus may not be representative ofthe time to execute those handlers, it is not unreasonable to assume that a bet-ter estimate lies somewhere between that �gure of 40�s and the lower-bound of15.8�s given by the overhead.This is not the only e�ect of an interrupt upon the application however, sincethe number of cache lines evicted by the interrupt handler is also signi�cant.On a system with 8k L1 caches and 32-byte cache lines (hence 256 lines in thecache), the fraction of the cache a�ected by the interrupt is almost one third forthe D-cache. The I-cache impact is harder to estimate due to the inaccuracy ofmeasurement (as discussed above) but is at least 15%. The e�ect of this cachecontamination is to e�ectively increase the time penalty of taking an interrupt,since the application must now reload the caches.Of course, the actual performance impact of interrupts upon applicationsdepends upon the interrupt load upon the system. These measurements do showthough that a system operating under a high interrupt load of several thousandor tens of thousands of interrupts per second would spend the majority of itstime handling those interrupts. An example of this phenomenon, receive livelockdue to network interrupts, was described by Mogul and Ramakrishnan [27] (seeSection 2.1.2).4.2.3 System CallsThe second row of Table 1 describes the costs of making a sendto() system callin the Linux OS. The time costs were determined using the same instrumentationmechanism as described above, and the same quantities are presented (period,work, overhead, latency). Cache e�ects were determined in a similar manner tothose for an interrupt i.e., ushing the L1 caches of OS code and data before theevent of interest occurs, but in this case the trivial system call, getpid(), wasinvoked rather than waiting for an interrupt. By measuring the cache e�ects ofthe trivial system call we can isolate the overheads of the above steps.These measurements show that the time-domain overheads (shown in theOverhead and Latency columns) of making a system call in Linux are approxi-mately half those of taking an interrupt. This is most likely due to the fact thatinterrupt handlers incur all the costs of a system call but also have to communi-cate with the external interrupt controllers, thus incurring the cost of slow I/Ooperations.What is most surprising about these measurements is the large numberof D-cache misses incurred even by the trivial system call. Given the natureof the steps described above, and speci�cally the memory references involved,it appears that the Linux kernel's system call invocation mechanism is some-what poorly structured. However, one can still conclude that the time cost20

ping muffin2 -f -c 100000 -s 64 --use-tsc 200 --histogram >/dev/nullwhere: muffin2 name of target machine-f ood-ping mode-c 100000 send 100000 packets-s 64 add payload of 64 bytes--use-tsc 200 use timestamp counter for timing, 200 cycles per �s--histogram generate a histogram of timing frequencies>/dev/null redirect output to /dev/nullFigure 4: Command used to measure round-trip latencies.

0

10000

20000

30000

40000

50000

60000

0 100 200 300 400 500 600

Round-Trip Time/us

Freq

uen

cy

piglet-64 piglet-1024

piglet-256

linux-64

linux-256

linux-1024

Figure 5: Distribution of round-trip times as a function of payload sizeis su�ciently high that an application making many system calls will lose aconsiderable amount of time as overhead.4.3 The Bene�ts of the Piglet ArchitectureIn order to experimentally evaluate key components of the Piglet architecture ahybrid Piglet/Linux kernel was constructed. Adding the functionality of Pigletto an existing Linux kernel allowed for evaluation of that functionality withouthaving to build supporting OS infrastructure from scratch.The key aspects of Piglet that need to be evaluated experimentally are itsreduction of OS intrusion and e�cient support for application-level resourcemanagement. In order to do so a user-level ICMP network protocol stack wasimplemented and used to compare the cost of packet processing with those of aconventional OS.Two di�erent types of comparison were made between the Piglet user-level21

Modal RTT/�sPayload/bytes Linux Piglet64 199 161256 253 2151024 468 413Table 3: Mean round-trip timesprotocol stack and the Linux kernel protocol stack. Application-level perfor-mance was compared using the ping application to measure round-trip latency.A more detailed analysis of the overheads of packet transmit and receive wasperformed using the same instrumentation used to measure the cost of mecha-nism intrusion, as described above.4.3.1 Measurement of Round-Trip LatencyMeasurement of Piglet's e�ectiveness in reducing system call overhead was per-formed using the ping application. ping was chosen because of its simplicity,allowing the impact of system-calls to be easily isolated. Several modi�cationswere made to the program:� Support for the Piglet user-space network interface was added. This en-tailed replacing socket calls with the equivalent calls to the Piglet user-space library.� The CPU's timestamp counter was used for measuring round-trip time(RTT) since this can be read directly from user-space, while gettimeofday()requires a system-call invocation.� A pro�ling option was added to log the value in the timestamp counter atvarious points in execution.� A histogram of RTT distribution can be generated.ping was invoked as shown in Figure 4, with the addition of the --pigletparameter when running the Piglet tests. Three di�erent payload sizes wereused, and the experiment was repeated several times to ensure that results wererepresentative. The histograms generated are shown in Figure 5, with the labelpiglet-XXX attached to the histogram for Piglet with XXX payload bytes, andsimilarly for linux-XXX.The histograms show the observed frequency against RTT for the givenpayload size. The position and shape of the histograms are more signi�cantthan the frequency values themselves. Modal RTTs i.e., the position of themost-frequently observed RTT (highest column in the histogram) are shown inTable 3 as an aid to comparison, with the leftmost histograms having smallestRTTs. Taller, narrower histograms indicate smaller variance in the distributionof round-trip times. 22

The most signi�cant point to take from Figure 5 is that the round-trip timesfor the Piglet system are lower than those for the Linux system by approximately38�s for 64-byte payloads, and up to 55�s for 1024-byte payloads. There issome dependence upon payload size because the Piglet user-space protocol stackperforms DMA of the packet payload (not headers) directly from applicationmemory while Linux copies into a kernel bu�er from which DMA is performed;the cost of copying is negligible for 64-byte payloads however so 38� can betaken as a representative �gure. The semantic implications of performing DMAfrom application memory will be addressed in future work.4.3.2 Comparison of System-Call OverheadsIn order to perform a more detailed analysis we used the event logs generatedby the instrumentation described in Section 4.1. Representative sections of thelog for a single round-trip were extracted and displayed graphically in Figure 6for both Linux and Piglet.The graphical representation shows the relative value of the cycle countercorresponding to various events in the sending and reception of a single packet.The counter value when the top-level function used to send the echo packet iscalled is used as the zero point; the got packet event indicates when the replypacket was received by the application.The unshaded boxes to the right of the time axis represent functions ex-ecuted in the application, while the shaded boxes to the left represent kernelfunctions. Stacked boxes represent nested function calls. Each dotted line in-dicates that the event with the given label occurred at the speci�ed time. Forthe Linux trace certain events have been omitted since they do are not relevantto the analysis; the complete Linux trace is given in [28].The Piglet trace represents the execution of the LDK processor on theleftmost line. Since the Piglet trace is more concise every event is representedin Figure 6.What is most obvious from these traces is the di�erence in the time takento send a packet i.e., from begin send to end send. While Linux takes �6300cycles, Piglet only requires 230. This major di�erence is due to the applicationin Piglet only having to post a message to the shared-memory queue rather thanexecuting the system call itself.The latency between begin send and the packet having been passed to thenetwork interface (Lboomerang tx) is also much lower in Piglet|1400 cycles asopposed to 5000. This corresponds to a contribution of an extra 18�s (with a200MHz CPU clock) to the RTT. One factor in the low latency in Piglet is thatthe delay between the application posting the message and the LDK processingthat message is only 170 cycles.While a small part of the reduced time spent in boomerang tx is due to anoptimisation in the Piglet network driver (not stalling the NIC's DMA engine ifit is known to be already stalled), the biggest factor is the structure of Linux'sprotocol stack. Because Linux implements a much higher-level interface (BSDsockets) it executes multiple levels of functions (sys sendto|the generic socket23

Lnetif_rxEnetif_rx

Lsys_recvfrom

Esys_recvfrom

end_select

Lip_rcv/Lnet_bh

Enet_bh

Eip_rcv

Lvortex_interrupt

got_packet

Evortex_interrupt

begin_select

Lvortex_interrupt

Evortex_interrupt

end_send

Lsys_sendto

Lboomerang_tx

Eboomerang_tx

Esys_sendto

begin_send

11679

9753

7821

7155

6570

5466

2278

1543

41134

40473

37848

36942

34228/34313

32071

31411

30178

27379

23251

4000

8000

12000

24000

28000

36000

32000

2926829450

Kernel Application0

(a) Linux

182818882462

2751

30483453

Lboomerang_rx

1655

Eboomerang_rx23603

25319

Application0

24000

36000

KernelLDK

Lboomerang_txEboomerang_tx

begin_select

begin_sendEuser_tx_pollend_sendLuser_tx_poll

32000

28000

end_select

got_packet

34616

35313

4000

(b) PigletFigure 6: Execution traces of ping24

layer; inet sendmsg|the Internet domain socket family; raw sendto|the In-ternet domain raw socket; ip build xmit|build IP headers for the packet)before �nally passing the packet to the network interface. The substantial over-head of these multiple layers of abstraction contribute the largest amount to thesystem call cost in Linux.While not a�ecting the round-trip time, the traces also provide an ex-ample of mechanism intrusion due to device interrupts. In Linux the �rstvortex interrupt event is caused by the NIC raising an interrupt to informthe OS that the packet has been sent; in Piglet the application does not suf-fer this intrusion because the LDK polls the NIC to determine when packettransmission is complete.After sending the packet, the ping application calls select() to block untilthe reply is received. Since the time when the kernel enters the correspondingtop-level function is not recorded, it has been estimated using the measurementsfrom Section 4.2.3. Noti�cation that the packet has been received and trans-ferred into memory by the NIC occurs in the second vortex interrupt eventfor Linux, and the boomerang rx event for Piglet. The time between that eventand select() returning to the application is greater in Piglet because the LDKmust send an inter-processor interrupt (IPI) to the Linux kernel to indicate thata packet was received.The select() call could be removed in both the Linux and Piglet tests.For Linux it is unnecessary since the recvfrom() function blocks if no packet isavailable; if the call to select() is removed then the Linux RTT decreases by�20�s. In the Piglet environment it should be replaced with operations that usethe Piglet exception and scheduling mechanisms (see Section 3.6); that shouldalso lead to a reduction in RTT.Finally, once returning from select() the application receives the packet; inLinux this entails making a recvfrom() system call, while in Piglet it removesa packet descriptor from a shared-memory queue. The time di�erence here(end select to got packet, 21:0�s vs. 3:5�s|a di�erence of 17:5�s) againdemonstrates the costs of system call overhead and multiple levels of abstraction.This analysis demonstrates two key points. Firstly, the overhead of a sys-tem call, 1500 cycles, is signi�cant|approximately 25% of the total cycles ex-ecuted for sendto(). Secondly, implementing general-purpose interfaces in theOS leads to ine�ciency|the central argument used by proponents of verticaloperating systems.5 Characterisation of a SolutionThe primary bene�ts claimed for the Piglet architecture are twofold: �rstly,it can support a model of application-level resource management equivalentto that provided by vertically-structured operating systems; and secondly, ito�ers a much lower degree of mechanism intrusion while doing so. Therefore, inorder to demonstrate the viability of the Piglet architecture as an alternative toexisting operating systems several properties must be investigated. These are:25

� Functionality - can Piglet support the same application models as existingoperating systems?� Performance - how does application performance under Piglet compare toalternative operating systems?� Scalability - how are the behaviour and properties of Piglet a�ected bychanging system parameters?Each of these aspects is described in more detail below. Before consideringthese details though it is important to �rst determine in which applicationdomains the Piglet architecture is most attractive.5.1 Application DomainsThe key factor in selecting possible application domains for Piglet is that dedi-cating one CPU to running the lightweight device kernel must provide su�cientbene�t to applications, through reduction of OS intrusion, to compensate forthe fact that those applications may only execute on n � 1, rather than all n,CPUs. This rule of thumb has two consequences: applications with a high pro-portion of I/O in their workload will bene�t more from Piglet than those whichare computationally intensive, and the architecture becomes more attractive asn, the number of CPUs, increases, since most applications do not scale linearlywith n [16], and thus the penalty of one processor becomes less signi�cant.In light of these considerations and other factors, the application domainwhich appears to derive most bene�t from the Piglet architecture is networkappliances. Network appliances are systems whose primary purpose is not com-putation but rather the provision of a network service e.g., routers, �rewalls,active network nodes, web/�le servers. Piglet is particularly suited to host-ing such systems since their workloads are typically dominated by I/O. Otherfactors also make this an attractive application domain, such as the increasedinterest in using cheap commodity hardware, including small-scale multiproces-sor systems at which Piglet is targetted, rather than expensive special-purposedevices, as the network appliance platform.5.2 Evaluation of the Piglet ArchitectureThe three key aspects of Piglet which must be evaluated have been speci�edabove as functionality, performance, and scalability.5.2.1 FunctionalityIf the Piglet architecture is to be considered as an alternative to other operatingsystem structures it must at least provide the same functionality as those sys-tems. Since Piglet represents an evolution of the vertically-structured operatingsystem the foremost requirement is that it supports application-level resourcemanagement. 26

5.2.2 PerformanceAlthough the preliminary results obtained for the ping application show thatPiglet can greatly reduce the overhead of invoking OS services, those resultsrepresent only one facet of the performance analysis. Hence, additional evalua-tion is necessary, including both further microbenchmarking e.g., measurementof network throughput attainable, and measurement of application-level perfor-mance.5.2.3 ScalabilityMany of the performance bene�ts of Piglet arise from the use of polling asthe primary communication mechanism. While the preliminary results showthat polling can indeed provide low-latency communication it is important todemonstrate the scalability of the architecture i.e., how many clients, both user-level processes and physical devices, Piglet can support.6 Plan of Action & RoadmapThe following tasks must be completed in order to e�ectively address the re-quirements described in Section 5. Along with a description of each task is giventhe amount of time expected to be required to complete that task.1. Implementation of the scheduling and event handling mechanism. As thecounterpart to the shared-memory communication mechanism this is es-sential in order to investigate the performance of the Piglet architecture.(2{3 weeks).Completion of this component of the work will help de�ne the interfaceapplication programs use to invoke system services. In particular, theimplications of asynchronous communication upon application design haveto be considered.2. Completion of the user-level protocol stack. The current user-level proto-col stack only provides functionality equivalent to UNIX `raw' sockets; aTCP stack has been partially implemented but requires further develop-ment to be suitable for performing throughput measurements and othermicrobenchmarks. A more complete protocol stack is also a requirementto support application-level evaluation (2{3 weeks).As well as providing a basis for performance evaluation, an implementa-tion of a user-space protocol stack must also address issues not just ofperformance but also safety and security. For example, unprivileged ap-plications should not be able to send TCP packets with arbitrary headers,but how can this be prevented without embedding substantial amounts ofthe TCP stack into the OS kernel?3. Evaluation of application-level performance under Piglet. As describedearlier (Section 5.1), Piglet is believed to be particularly appropriate as a27

network appliance operating system. Hence the applications whose per-formance under Piglet is to be evaluated should be drawn from the cor-responding application domain. At this time exactly which applicationsshould be evaluated has not been determined, but it is expected to includetwo of the following:� Firewall - requires support for low latency and high throughput �lter-ing and forwarding of network packets. Most existing systems used assoftware �rewalls e.g., Linux, perform these tasks in-kernel; in Pigletthey would be performed by a user-level application, thus testing theability of Piglet to o�er e�cient access to network packets.� Active network node - similar functional requirements to a �rewall,but typically implemented as a user-level application rather thankernel extension. Thus may provide a fairer basis for comparison,and has the additional advantage of being able to both draw on andprovide data for local active network research e.g., PLANet [17].� Web server/web cache - also depends upon low latency, high through-put network access but di�ers in that tra�c sent is likely to be muchgreater than tra�c received, thus emphasising di�erent functions.There is substantial existing work on web server/cache performance,but taking into account the e�ects of �lesystem performance makecomparisions harder.Porting of the selected applications to the Piglet environment and sub-sequent experimental evaluation is expected to take approximately 2{3months.4. Characterisation of the Lightweight Device Kernel. The scalability of thePiglet architecture is to be assessed using two methods: experimentally,by measuring the response of the test systems under varying loads; andstatistically, by measuring certain properties of the LDK e.g., polling la-tency and service time, and formulating a simple analytical model (3{4weeks).As part of this characterisation, alternative kernel structures, speci�callythe association of message queues with applications and resources, are tobe considered. For example, in order to reduce the number of queueswhich must be polled it is possible to replace per-resource applicationqueues with a single per-application queue, or even a per-processor queue.While attractive in terms of reducing the number of distinct queues tobe polled, other implications e.g., e�ect upon QoS guarantees, must beconsidered.Completion of the above experimental work is thus expected to take a totalperiod of 4{6 months. Some writing up of each component is to be performedconcurrently and has been taken into account in the estimated completion times.A �nal period of 2{3 months is expected to be su�cient to complete the remain-ing written sections of the thesis. 28

References[1] Appel, A. W., and Li, K. Virtual memory primitives for user programs. InProc. of the 4th Int'l. Conf. on Architectural Support for Programming Languagesand Operating Systems (Oct. 1991), pp. 95{109.[2] Barham, P. R. Devices in a Multi-Service Operating System. PhD thesis, Uni-versity of Cambridge, July 1996.[3] Bershad, B., et al. Extensibility, safety and performance in the SPIN operatingsystem. In Proc. of the 15th ACM Symp. on Operating Systems Principles (Dec.1995), pp. 267{284.[4] Burger, D., and Goodman, J. Billion-transistor architectures. IEEE Computer(Sept. 1997), 46{47.[5] Cheriton, D. R., and Duda, K. J. A caching model of operating systemfunctionality. In Proc. of the 1st Symp. on Operating Systems Design and Imple-mentation (Nov. 1994), pp. 179{193.[6] Compaq Computer Corp., Intel Corporation, Microsoft Corporation.Virtual Interface Architecture Speci�cation Version 1.0, 1997. www.viarch.org.[7] Creasy, R. J. The origin of the VM/370 time-sharing system. IBM Journal ofResearch and Development 25, 5 (Sept. 1981), 483{490.[8] Dougan, C., et al. Optimizing the idle task and other MMU tricks. In Proc.of the 3rd Symp. on Operating Systems Design and Implementation (Feb. 1999),pp. 229{236.[9] Edwards, A., and Muir, S. Experiences implementing a high-performanceTCP in user-space. In Proc. of ACM SIGCOMM Conference on Applications,Technologies, Architectures, and Protocols for Computer Communications (Sept.1995), pp. 196{205.[10] Engler, D. R., et al. Exokernel: An operating system architecture forapplication-level resource management. In Proc. of the 15th ACM Symp. onOperating Systems Principles (Dec. 1995), pp. 251{266.[11] Hamilton, G., and Kougiouris, P. The Spring nucleus: A microkernel forobjects. In Proc. of the USENIX Summer 1993 Technical Conference (June 1993),pp. 147{160.[12] Hammond, R. A. Experiences with the Series/1 Distributed System. In Proc. ofthe 21st IEEE Computer Society Int'l Conference (COMPCON 80) (Fall 1980),pp. 585{589.[13] Hand, S. M. Self-paging in the Nemesis operating system. In Proc. of the 3rdSymp. on Operating Systems Design and Implementation (Feb. 1999), pp. 73{86.[14] Hartig, H., et al. The performance of �-kernel based systems. In Proc. of the16th ACM Symp. on Operating Systems Principles (Dec. 1997), pp. 66{77.[15] Harty, K., and Cheriton, D. R. Application-controlled physical memory usingexternal page-cache management. In Proc. of the 5th Int'l. Conf. on ArchitecturalSupport for Programming Languages and Operating Systems (Oct. 1992), pp. 187{199.[16] Hennessy, J. L., and Patterson, D. A. Computer Architecture: A Quantita-tive Approach, 2nd ed. Morgan Kaufmann, 1996, ch. 8.29

[17] Hicks, M., et al. PLANet: An active internetwork. In Proc. of the 18thIEEE Computer and Communication Society INFOCOM Conference (Mar. 1999),pp. 1124{1133.[18] Kaashoek, M. F., et al. Application performance and exibility on Exokernelsystems. In Proc. of the 16th ACM Symp. on Operating Systems Principles (Oct.1997), pp. 52{65.[19] Leslie, I., et al. The design and implementation of an operating system tosupport distributed multimedia applications. IEEE Journal on Selected Areas inCommunications 14, 7 (Sept. 1996), 1280{1297.[20] Liedtke, J. Improving IPC by kernel design. In Proc. of the 14th ACM Symp.on Operating Systems Principles (Dec. 1993), pp. 175{188.[21] Lycklama, H., and Bayer, D. L. The MERT operating system. The BellSystem Technical Journal 57, 6 (July/August 1978), 2049{2086.[22] Massalin, H. An E�cient Implementation of Fundamental Operating SystemServices. PhD thesis, Columbia University, 1992.[23] McKusick, M. K., et al. The Design and Implementation of the 4.4BSD UNIXOperating System. Addison-Wesley Publishing Company, 1996.[24] McVoy, L., and Staelin, C. lmbench: Portable tools for performance analysis.In Proc. of the USENIX 1996 Annual Technical Conference (Jan. 1996), pp. 279{294.[25] Michael, M. M., and Scott, M. L. Simple, fast, and practical non-blockingand blocking concurrent queue algorithms. In Proc. of the 15th Annual Symposiumon Principles of Distributed Computing (May 1996), pp. 267{275.[26] Mogul, J. C., and Borg, A. The e�ect of context switches on cache perfor-mance. In Proc. of the 4th Int'l. Conf. on Architectural Support for ProgrammingLanguages and Operating Systems (Apr. 1991), pp. 75{84.[27] Mogul, J. C., and Ramakrishnan, K. K. Eliminating receive livelock in aninterrupt-driven kernel. ACM Transactions on Computer Systems 15, 3 (Aug.1997), 217{252.[28] Muir, S., and Smith, J. Piglet: a Low-Intrusion Vertical Operating System.Technical Report MS-CIS-00-04, Distributed Systems Lab, University of Penn-sylvania, Jan. 2000.[29] Nelson, G. Systems Programming with Modula-3. Prentice-Hall, Apr. 1991.[30] Niehaus, D., et al. Architecture and OS support for predictable real-time sys-tems. Tech. rep., Department of Computer Science, University of Massachusetts,Mar. 1992.[31] Patterson, R. H., et al. Informed prefetching and caching. In Proc. of the15th ACM Symp. on Operating Systems Principles (Dec. 1995), pp. 79{95.[32] Rashid, R., et al. Mach: A foundation for open systems. In Proc. of the 2ndWorkshop on Workstation Operating Systems (WWOS-II) (Sept. 1989), pp. 109{113.[33] Rashid, R., et al. Mach: A system software kernel. In Proc. of the 34th IEEEComputer Society Int'l Conference (COMPCON 89) (Feb. 1989), pp. 176{178.30

[34] Sincoskie, W. D., and Farber, D. J. The Series/1 Distributed System: De-scription and comments. In Proc. of the 21st IEEE Computer Society Int'l Con-ference (COMPCON 80) (Fall 1980), pp. 579{584.[35] Solomon, D. Inside Windows NT. Microsoft Press, 1998.[36] Stonebraker, M. Operating system support for database management. Com-munications of the ACM 24, 7 (July 1981), 412{417.[37] Thekkath, C. A., et al. Implementing network protocols at user level. In Proc.of ACM SIGCOMM Conference on Applications, Technologies, Architectures, andProtocols for Computer Communications (Sept. 1993), pp. 64{73.[38] von Eicken, T., et al. U-Net: A user-level network interface for parallel anddistributed computing. In Proc. of the 15th ACM Symp. on Operating SystemsPrinciples (Dec. 1995), pp. 40{53.

31

Thesis Proposal: Reducing Operating System Intrusion in the Piglet ...

Documents

Transcript of Thesis Proposal: Reducing Operating System Intrusion in the Piglet ...