Proteus: Efficient Resource Use in Heterogeneous...

14
Proteus: Efficient Resource Use in Heterogeneous Architectures Sankaralingam Panneerselvam and Michael M. Swift Computer Sciences Department, University of Wisconsin–Madison {sankarp, swift}@cs.wisc.edu Abstract Current processors provide a variety of different process- ing units to improve performance and power efficiency. For example, ARM’S big.LITTLE, AMD’s APUs, and Oracle’s M7 provide heterogeneous processors, on-die GPUs, and on- die accelerators. However, the performance experienced by programs on these accelerators can be highly variable due to issues like contention from multiprogramming or thermal constraints. In these systems, the decision of where to exe- cute a task will have to consider not only stand-alone perfor- mance but also current system conditions and the program’s performance goals such as throughput, latency or real-time deadlines. We built Proteus, a kernel extension and runtime library, to perform scheduling and handle task placement in such dynamic heterogeneous systems. Proteus enables programs to perform task placement decisions by choosing the right processing units with the libadept runtime, since programs are best suitable to determine how to achieve their perfor- mance goals. System conditions such as load on accelera- tors are exposed by the OpenKernel, the kernel component of Proteus, to the libadept enabling programs to make an in- formed decision. While placement is determined by libadept, the OpenKernel is also responsible for resource manage- ment including resource allocation and enforcing isolation among applications. When integrated with StarPU, a runtime system for heterogeneous architectures, Proteus improves StarPU by performing 1.5-2x better than its native schedul- ing policies in a shared heterogeneous environment. 1. Introduction Systems with heterogeneous processors and a variety of ac- celerators are becoming common as a solution to the power wall [36, 62]. Major vendors like Intel and AMD have started packing CPU and GPU on the same chip [2, 9]. ARM’s big.LITTLE architecture [3] packs processors made from different microarchitecture on the same die. IBM’s Wire- Speed processor [32] and Oracle’s M7 [49] are provisioned with accelerators for specific tasks like XML parsing, com- pression, encryption, and query processing. In addition to static heterogeneity from different process- ing units, future systems will exhibit dynamic heterogeneity where the performance observed by programs on these pro- cessing units can vary over time for various reasons. First, contention for accelerators [34, 56] alters the performance benefits for programs. Given the maturity in programming models [4, 11, 14], many programs will try to leverage ac- celerators and compete for access to them. The growth in virtualization also promotes contention, as accelerators may by shared not just by multiple programs but also by multiple guest operating systems [34]. Second, power and thermal limits can prevent all processing units from being used at full power [38] and processing units performance (e.g., proces- sor p-state) can vary as a result. More generally, future chips will have more Dark Silicon [31] where only a fraction of the logic can be used concurrently. Finally, the data copy overhead can greatly diminish performance benefits offered by certain accelerators [58]. Building systems to accomodate programs with different metrics such as throughput for servers, latency for interactive applications, deadlines for media applications, or minimum resource usage for cloud applications, in such a highly dy- namic environment becomes extremely difficult. However, a major trend that can be leveraged in such heterogeneous sys- tems is that same program task—parallel region of code or encrypting a data chunk—can often be executed on different processing units with different performance or energy effi- ciency. For example, a parallel region can be run on a GPU or multi-core CPUs and encryption can be performed using a crypto accelerator or AES instructions. Programs designed to be adaptable can thus be scheduled on alternate process- ing units to achieve their desired goals. Current operating systems treat accelerators such as GPUs as external devices, and hence provide little support for applications to decide where to best run their code. Similarly, there is little support for heterogeneous proces- sors (e.g., big.LITTLE) today. In research, there are pro- posed system schedulers to schedule tasks on heterogeneous cores (e.g., [57]) or coordinated CPU/GPU scheduling (e.g., [34, 47, 56]). These systems modify the OS kernel to address contention for resources, but do not support applications that can use multiple different processing units, such as running code either on a CPU or a GPU or using an accelerator. Conversely, application runtimes for heterogeneous sys- tems, automatically place application tasks on a process- ing unit from user mode [17, 21]. They offload user-defined functional blocks, such as encrypting a chunk of data, com- pressing an image, or a parallel computation, to the best ac- celerator or processing unit available. However, existing run- times assume static heterogeneity, where the performance of 1 2016/3/10

Transcript of Proteus: Efficient Resource Use in Heterogeneous...

Page 1: Proteus: Efficient Resource Use in Heterogeneous Architecturespages.cs.wisc.edu/~swift/papers/TR1832.pdf · Proteus: Efficient Resource Use in Heterogeneous Architectures Sankaralingam

Proteus: Efficient Resource Use in Heterogeneous Architectures

Sankaralingam Panneerselvam and Michael M. SwiftComputer Sciences Department, University of Wisconsin–Madison

{sankarp, swift}@cs.wisc.edu

AbstractCurrent processors provide a variety of different process-ing units to improve performance and power efficiency. Forexample, ARM’S big.LITTLE, AMD’s APUs, and Oracle’sM7 provide heterogeneous processors, on-die GPUs, and on-die accelerators. However, the performance experienced byprograms on these accelerators can be highly variable dueto issues like contention from multiprogramming or thermalconstraints. In these systems, the decision of where to exe-cute a task will have to consider not only stand-alone perfor-mance but also current system conditions and the program’sperformance goals such as throughput, latency or real-timedeadlines.

We built Proteus, a kernel extension and runtime library,to perform scheduling and handle task placement in suchdynamic heterogeneous systems. Proteus enables programsto perform task placement decisions by choosing the rightprocessing units with the libadept runtime, since programsare best suitable to determine how to achieve their perfor-mance goals. System conditions such as load on accelera-tors are exposed by the OpenKernel, the kernel componentof Proteus, to the libadept enabling programs to make an in-formed decision. While placement is determined by libadept,the OpenKernel is also responsible for resource manage-ment including resource allocation and enforcing isolationamong applications. When integrated with StarPU, a runtimesystem for heterogeneous architectures, Proteus improvesStarPU by performing 1.5-2x better than its native schedul-ing policies in a shared heterogeneous environment.1. IntroductionSystems with heterogeneous processors and a variety of ac-celerators are becoming common as a solution to the powerwall [36, 62]. Major vendors like Intel and AMD have startedpacking CPU and GPU on the same chip [2, 9]. ARM’sbig.LITTLE architecture [3] packs processors made fromdifferent microarchitecture on the same die. IBM’s Wire-Speed processor [32] and Oracle’s M7 [49] are provisionedwith accelerators for specific tasks like XML parsing, com-pression, encryption, and query processing.

In addition to static heterogeneity from different process-ing units, future systems will exhibit dynamic heterogeneitywhere the performance observed by programs on these pro-cessing units can vary over time for various reasons. First,contention for accelerators [34, 56] alters the performancebenefits for programs. Given the maturity in programming

models [4, 11, 14], many programs will try to leverage ac-celerators and compete for access to them. The growth invirtualization also promotes contention, as accelerators mayby shared not just by multiple programs but also by multipleguest operating systems [34]. Second, power and thermallimits can prevent all processing units from being used at fullpower [38] and processing units performance (e.g., proces-sor p-state) can vary as a result. More generally, future chipswill have more Dark Silicon [31] where only a fraction ofthe logic can be used concurrently. Finally, the data copyoverhead can greatly diminish performance benefits offeredby certain accelerators [58].

Building systems to accomodate programs with differentmetrics such as throughput for servers, latency for interactiveapplications, deadlines for media applications, or minimumresource usage for cloud applications, in such a highly dy-namic environment becomes extremely difficult. However, amajor trend that can be leveraged in such heterogeneous sys-tems is that same program task—parallel region of code orencrypting a data chunk—can often be executed on differentprocessing units with different performance or energy effi-ciency. For example, a parallel region can be run on a GPUor multi-core CPUs and encryption can be performed usinga crypto accelerator or AES instructions. Programs designedto be adaptable can thus be scheduled on alternate process-ing units to achieve their desired goals.

Current operating systems treat accelerators such asGPUs as external devices, and hence provide little supportfor applications to decide where to best run their code.Similarly, there is little support for heterogeneous proces-sors (e.g., big.LITTLE) today. In research, there are pro-posed system schedulers to schedule tasks on heterogeneouscores (e.g., [57]) or coordinated CPU/GPU scheduling (e.g.,[34, 47, 56]). These systems modify the OS kernel to addresscontention for resources, but do not support applications thatcan use multiple different processing units, such as runningcode either on a CPU or a GPU or using an accelerator.

Conversely, application runtimes for heterogeneous sys-tems, automatically place application tasks on a process-ing unit from user mode [17, 21]. They offload user-definedfunctional blocks, such as encrypting a chunk of data, com-pressing an image, or a parallel computation, to the best ac-celerator or processing unit available. However, existing run-times assume static heterogeneity, where the performance of

1 2016/3/10

Page 2: Proteus: Efficient Resource Use in Heterogeneous Architecturespages.cs.wisc.edu/~swift/papers/TR1832.pdf · Proteus: Efficient Resource Use in Heterogeneous Architectures Sankaralingam

a processing unit does not vary over time due to contentionfrom other applications or power limits.

The goal of our work is to efficiently execute programson dynamically heterogeneous systems. This requires a sys-tem that adapts to current system conditions, such as loadon different processing units. We propose that the operat-ing system alone should manage shared resources, such asaccelerators, by performing resource allocation, monitoringand isolation. Programs then dynamically adapt to availableresources by placing tasks on suitable processing units toachieve their own performance goals.

We built Proteus, a decentralized system that extendsLinux by separating resource management from task place-ment decisions, as a starting step towards this goal. TheOpenKernel is a kernel component that provides resourcemanagement—allocating resources, enforcing schedulingpolicies over all processing units and publishing usage infor-mation such as accelerator utilization that enables programsto adapt to system conditions.

The libadept runtime layer is the user-mode componentof Proteus that makes task placement decisions. It presentsa simple interface to execute tasks and acts as an abstrac-tion layer for the processing units present. The runtime alsobuilds a performance model of program tasks on differentprocessing units. At runtime, libadept combines this modelwith system state information from the OpenKernel to pre-dict a task’s performance on different processing units. Ap-plications can specify a performance goal that the runtimeuses to select the best unit for the task to achieve the appli-cation’s goal, such as low latency or high throughput.

Applications that benefit more from using a specific pro-cessing unit will continue to use it even under contention,while those that benefit only slightly will migrate to otherprocessing units [53]. As a result, even with fully decen-tralized placement decisions, applications with the highestspeedup will receive access to the processing unit.

While designed for stand-alone use, Proteus can also actas an execution model for other programming systems. Forexample, we modified StarPU [21], which distributes tasksamong heterogeneous processing units but is not aware ofcontention, to use Proteus to make task placement decisionsbased on system conditions. This enables existing applica-tions written for StarPU to be easily ported to our systemwithout any modifications.

We performed an extensive set of experiments with GPUsand emulated heterogeneous CPUs on a variety of work-loads. We show that Proteus enables StarPU to perform bet-ter than StarPU’s native scheduling policies by 1.5-2x. Thisillustrates that Proteus is better than systems that rely onstatic performance. We also show that the performance of theProteus’s decentralized design is comparable to a centralizedscheduler with global view of the system. Proteus providesisolation among applications though it allows applications to

make task placement decisions. Finally, we show that appli-cations can achieve customized goals in Proteus.2. DesignThe major design goals for Proteus are:1. Adaptability. A program should continuously adapt to

changing system conditions when choosing where to ex-ecute code.

2. Efficiency. Programs that achieve higher speedups on aprocessing unit should receive more access to it.

3. Isolation. The accelerators should be treated as sharedresources and isolation among applications using themshould be governed by a global policy irrespective of theruntimes used by the applications.

4. Unintrusive. Minimal OS changes should be required.We designed Proteus as a decentralized architecture sep-

arating resource management (how much access a processgets to a resource), from task placement (where tasks run)decisions as shown in Figure 1. The former is left to thekernel, which is responsible for isolation, while the latteris pushed closer to applications. The application runtimeidentifies the best processing unit to execute each task. Un-like other systems that performs resource management andtask placement in the kernel [34, 56, 57], Proteus divides thefunctionalities for the following reasons:• Moving placement to applications simplifies the kernel,

which only needs to handle scheduling [30].• Selecting the best processing unit requires a performance

model of the application. Moving this to application al-lows them to benefit from application-specific models.

• Applications have the best knowledge of their schedulinggoals, such as real time for media applications, and henceare best equipped to choose between different processingunits with different performance characteristics [6, 18].

• Some accelerators can be accessed directly from user-mode [32], so invoking a centralized system can increaselatency [47].

Terminology. (i) We use the words application and programinterchangeably. (ii) The term processing units refers to allcompute units including CPUs, GPUs, and accelerators. (iii)Task in Proteus refers to a coarse-grained functional blocksuch as a parallel region or a function, such as encryptinga block of data, that executes independently and to comple-tion. (iv) We use the term scheduling to mean selecting thenext task to execute on a processing unit, and for decidinghow long that task may run (if possible). We use placementto mean selecting the processing unit for a task. Many sys-tems, including Linux, have a per-core scheduler and a sep-arate placement mechanisms for assigning threads to cores.

2.1 Platform Requirements

We designed Proteus to target heterogeneous systemswith multiprogrammed workloads such as hand-held de-vices [19], cloud servers hosting virtual machines [1] andshared clusters [41].

2 2016/3/10

Page 3: Proteus: Efficient Resource Use in Heterogeneous Architecturespages.cs.wisc.edu/~swift/papers/TR1832.pdf · Proteus: Efficient Resource Use in Heterogeneous Architectures Sankaralingam

Kernel  Space  

Accelerator  Monitor  

Devices  

Small    Cores  

Big  Core  GPUs  

Crypto  Accelerator  

User  S

pace  

Applica;on  using  Accelerators  

Libadept  Run;me  Stubs   Profiler  

   

Accelerator  Agents  (H-­‐Core,  CPUs,  GPU,  Crypto)  

   

Ope

nKerne

l  

Proteus  

Figure 1. Proteus architectureHeterogeneous. Proteus target architectures that provideheterogeneous processing units in the form of discrete oron-chip accelerators, and both single-ISA and multi-ISAheterogeneous processors. Target systems include unifiedCPU/GPU chip, such as AMD’s APUs [2], systems with dis-crete GPUs, or with on-chip accelerator devices [32, 61].Multiprogrammed. Without multiprogramming, a singleapplication can take complete control of the hardware andthus assume the heterogeneity is static and predictable. Incontrast, multiprogramming leads to contention and variableperformance as applications come and go.

2.2 Resource Management

Proteus promotes shared resources like accelerators as first-class entities through its resource management support in thekernel. The major responsibilities of the kernel layer in Pro-teus are resource allocation, enforcing isolation and expos-ing system state information about processing units. We callthis kernel component the OpenKernel because of the trans-parency it provides to applications. The OpenKernel consistsof Accelerator Agents, which act as resource managers andthe Accelerator monitor, which publishes usage informationfrom agents.

2.2.1 Accelerator Agents

In order to support a wide variety of devices with differentcharacteristics, Proteus attaches an agent to each unique typeof processing unit. The agent acts as a sub-scheduler for apool of similar processing units, such as small CPU cores.The resource allocation decisions are based on a policy su-ported by every agent. Policy types such as share-based andpolicy information like higher share for important applica-tions are provided to the agent by the administrators. Policyenforcement allows Proteus to provide isolation even in thepresence of a malicious application that could overload theaccelerator device. Irrespective of the runtime—OpenCL orCUDA to access GPU—used by an application to offloadtask to a processing unit, the task has to reach the devicethrough the agent in Proteus. After policy enforcement (likethrottling after exhausting shares) the agent uses the devicedriver to launch tasks on accelerator devices. For asymmet-ric CPU cores, it creates a worker thread affinitized to each

core, to which an application enqueue tasks. Proteus only ex-poses the ability to select a type of processing unit for a task,similar to processor affinity, but does not allow user-modecode to make scheduling or resource allocation decisions.

Agents are also responsible for sharing state informationabout the processing units it manages, such as the usage ofdifferent processing units, to applications. The agent tracksinformation like the utilization of a processing unit by ev-ery application, average size of tasks, and number of appli-cations. This information is provided by the agents to theaccelerator monitor, which publishes it to applications. Ap-plications can use this information to predict when they canuse the device and for how long.

2.2.2 Accelerator Monitor

The accelerator monitor is a centralized kernel service thatpublishes state information from various agents and enablesapplications to determine the best processing unit to run atask. Some information, such as total utilization of a process-ing unit, is published globally to all applications. Other in-formation, such as an application’s expected share of a unit,is private to a process. The per-application information re-flects the priority, share, or scheduling guarantees of an ap-plication. For example, processing units that have been re-served for exclusive use by one application will show up asbusy for all other applications.

2.3 Task Placement

Proteus follows other runtimes for heterogeneous architec-tures [4, 11, 21, 52] and employs a task-based programminginterface. The programmer or compiler generates task im-plementations for multiple processing units (not necessaryfor single-ISA systems). There are several runtimes [4, 11],research systems [29, 44] and initiatives like HSAIL [8] thatallow code to be written once and then compiled into bina-ries optimized for different compute devices. For more var-ied architectures, such as fixed-function accelerators, a pro-grammer may have to code multiple implementations.

The user-mode runtime layer of Proteus is contained inthe libadept library and provides three key services. First,the runtime builds a performance model for every programtask on each processing unit, which allows it to predict per-formance. Second, it provides a simple interface for plac-ing tasks on the best available processing unit. Finally, it al-lows applications to specify performance goals that guidethe placement decision. These three services combined pro-vide the ability for the runtime to make smart task placementdecisions in order to achieve application specific goals.Performance model. The runtime builds a model to predictthe performance of program tasks on different processingunits with the help of a task profiler. The model is built byexecuting the program’s tasks on different processing unitsand extracting information like processing time per unit datafrom the task execution time and the amount of data thetask operated on. In addition, the profiler measures overhead

3 2016/3/10

Page 4: Proteus: Efficient Resource Use in Heterogeneous Architecturespages.cs.wisc.edu/~swift/papers/TR1832.pdf · Proteus: Efficient Resource Use in Heterogeneous Architectures Sankaralingam

of using a processing unit such as dispatch time and atacopying. These measurements form a model that can be usedto predict performance of a task on any processing unit inthe system. We later discuss applications that do not followthis assumption in § 3.2. We should also note that the modeloutput is the stand-alone task performance. The model mustbe combined with the system state information to predict theactual latency to finish a task on a processing unit.Accelerator stubs. Proteus abstracts different processingunits into a single procedural interface, so application de-velopers reason about invoking a task but not where the taskshould run. Applications invoke the stub interface to launcha task. Stubs use information from the OpenKernel and theperformance model from the profiler to predict the perfor-mance of the task on all available processing units. The pro-cessing unit that yields the best performance is chosen basedon these predictions and the stub passes data to the imple-mentation by handling data migration for processing unitsthat lack coherent memory, such as GPUs.Optimizer. The optimizer is a background service in the run-time and runs as part of the application. The optimizer act asa long term scheduler [55] aiming to achieve performancegoals set by the user like a minimum throughput guarantee.The optimizer either requests more or fewer resources fromOS, or notifies the application to increase or decrease work-load based on resources available. For example, a compres-sion algorithm may reduce its compression factor to keep upwith the rate of incoming data, or a latency-sensitive applica-tions may choose to give up on quality of the output [37, 51]by opting for a processing unit with the shortest queuing de-lay, while throughput-oriented programs may be willing towait in order to process a larger volume of data.3. ImplementationWe implemented Proteus as an extension to the Linux 3.4.4kernel. The code consists of accelerator agents for CPU andGPU, the accelerator monitor, and the libadept shared librarylinked to user-mode applications. The OpenKernel and thelibadept runtime consists of around 2000 and 3400 lines ofcode respectively. In this section we describe the implemen-tation of Proteus, beginning with how agents schedule tasksand then discuss about user-mode task placement.

3.1 Accelerator Agents

We implemented accelerator agents for GPUs and heteroge-neous CPU (with both standard and fast cores). These agentsenforce a scheduling policy that reorders requests to achievepolicy goals and export device usage information to the ac-celerator monitor. Table 1 shows the details exposed by theagents. Proteus does not yet handle directly accessible ac-celerators, but disengaged schedulers [47] can be extendedto implement the agent functionality for such devices.GPU agent. The GPU agent manages all GPUs present andcurrently relies on vendor-specific kernel drivers to submittasks to the GPU units. GPU drivers are usually closed bi-

Agent Details Published VisibilityCPU # active threads at each priority level Global

Schedule time ratio for each priority Global# standard cores per application Local

GPU Current utilization by an application LocalMaximum share allowed for an application LocalAverage task size Global# active applications GlobalRunqueue status of each priority queue Global

Table 1. Details published by agents.naries that we cannot extend with agent functionality. Wetherefore move the GPU agent functionality into a separatekernel-mode component that receives scheduling informa-tion from an administrator and enforces scheduling policy.The runtime calls the GPU agent before offloading any tasks,which can then stall tasks to enforce the scheduling policy.To track device usage, libadept pass task execution time tothe agent after the task complete. The runtime’s assistancecould be avoided if the agent could be better integrated withthe GPU driver. Additional support like interfaces to add newscheduling policies might be needed.

The agent calculates the utilization for each applicationwith the information passed by the runtime. Short-lived pro-cesses do not accumulate much utilization (current utiliza-tion is zero) and so, agents also report the average task sizeon a processing unit and the number of applications using thedevice. We implement three different policies in the agent:FIFO, Priority-based and Share-based. The FIFO policy isthe default policy in GPU drivers and exposes a simple FIFOqueue for scheduling tasks from different applications.

The FIFO model cannot provide equal utilization if taskshave different execution times: a long task can monopolizethe device because current GPU drivers only support con-text switching at kernel (i.e., task) granularity. The Priority-based policy prevents monopolization by allowing higher-priority applications to run before lower-priority ones. How-ever, without preemption support from GPU drivers, low la-tency cannot be guaranteed if a long task is running. CurrentGPU drivers do not support preemption.

The Share-based policy provides strong performance iso-lation among applications. This policy defines three appli-cation classes: CUSTOM, NORMAL and BACKGROUND.In the CUSTOM class, designed for applications with strictperformance requirements, applications reserve an absoluteshare of one or more GPUs, and the total shares on a devicecannot exceed 100. Possession of 50 shares on a GPU guar-antees the application a minimum of 50% time on the de-vice. Shares offered by this policy are specific to a process-ing unit. The shares unused by the CUSTOM class are givento applications in the NORMAL class, where the remainingshares are equally distributed amoung applications. Finally,BACKGROUND applications use the GPUs only when ap-plications of the other two classes are not using the device.

The share-based policy is the default policy in Proteus.The agent throttles applications using more than their share.Thus, following a long task, an application will be blocked

4 2016/3/10

Page 5: Proteus: Efficient Resource Use in Heterogeneous Architecturespages.cs.wisc.edu/~swift/papers/TR1832.pdf · Proteus: Efficient Resource Use in Heterogeneous Architectures Sankaralingam

from running more tasks to let other applications use theGPU. Since the shares are proportional [63], the share fora device is always normalized to the total shares of activeapplications using the device.CPU agent. The CPU agent supports two classes of CPUcores, standard and fast, under the assumption that parallelcode will run better on more standard cores, while mostlysequential code will run better on one or a few fast cores.The agent leverages Linux’s native CFS scheduler [5] forscheduling on CPU cores. It also publishes state informa-tion like the # of threads and ratio of time obtained bythreads at each priority level. This information helps to pre-dict contention on fast cores. The share based policies (de-scribed above) for strong performance guarantees can be im-plemented by dynamically mapping the amount of shares toa nice priority value. This leverages the fact that every prior-ity level has a time slice ratio over other levels.

Additionally, for standard cores, the agent provides hintson the number of standard cores a program can use withoutcontending with other parallel programs. The calculation isbased on a min-funding [64] policy: all cores are allocatedto applications based on their priority or share, and unusedcores are redistributed. For example, if a single-threaded ap-plication uses a single core, and other cores it could use areinstead given to parallel applications. This additional infor-mation enables parallel applications to tune their parallelismaccordingly.Accelerator monitor. The accelerator monitor aggregatesinformation from all agents and exposes it to applications.It shares two pages of data with every Proteus application,which have read-only access to the pages. The global datapage is mapped in all Proteus processes and has globalinformation, such as the average task size on a GPU device.The private page has application-specific information, suchas its maximum allowed utilization for each processing unit.

The monitor provides a simple interface that takes as pa-rameters the visibility type, information identifier and a newvalue to be updated. When an agent’s state changes, such aswhen new tasks are submitted or existing ones complete, itnotifies the monitor. Agents calculate utilization informationat periodic intervals of 25ms, which then gets pushed to themonitor. We currently do not synchronize access betweenthe monitor and applications. However, the data is entirelyscalar values used as scheduling hints, so races are benign.

3.2 libadept Adaptive Runtime

Every Proteus application links to libadept, which consistsof the task profiler, optimizer and stubs.Performance model. The runtime includes a profiler thatis responsible for building and maintaining a performancemodel for application tasks. The profiler predicts the taskexecution time by the amount of data processed by thetask. If the model parameters—like processing time per unitdata—are not available (start of a new application), the run-

time executes the initial set of application’s tasks on differ-ent processing units similar to the techniques employed inQilin [45]. The task execution time over the amount of datapassed to the task gives the processing time per unit data forthat processing unit. We are assuming that the task execu-tion is proportional to the amount of data it uses. The modelis saved for use across program invocations. For short-livedapplications, the model parameters are calibrated over mul-tiple invocations of the application.

To detect if the performance model must be updated dueto program changes, the profiler constantly measures the pre-dicted task execution time against the actual execution time.The model calibration process is repeated if the variation ismore than 25% for more than 10 predictions. The model cal-ibration is performed again to recalculate model parameters.libadept runs a generic profiling task during system startupto measure the data copy latency and bandwidth for all pro-cessing units in the system. The data copy overhead is setto zero for devices with coherent access to memory (e.g.,CPUs), because not explicit copy is required.

The current implementation of libadept does not accountfor the contention in I/O bus. So, the data copy is consideredto have a fixed overhead in addition to the per-byte cost. Wealso found that our current model did not suit workloads withdata-dependent execution time, like Sphyraena (Table 4) thatperforms SQL select query execution. The execution timewas dependent on the selectivity of the operators which ourmodel did not consider. We are investigating more matureprofilers that could provide better models [33, 43, 48].Accelerator stubs. The stub interface is responsible for find-ing the right processing unit for a given task. It requiresmultiple implementations of the task for different process-ing units. The code can be identical for the single-ISA archi-tectures or if a compiler can generate code for multiple ar-chitectures. Currently, we rely on OpenCL [11] to generateversions of task for GPU and CPUs. But, application devel-opers can also provide hand-optimized implementation forprocessing units. Task setup entails registering a task identi-fier with a task object that points to a set of implementations.

Applications launch tasks with the Accelerate() func-tion, which takes a task identifier and boolean variable todenote a synchronous or an asynchronous offload. This func-tion uses the model from the profiler and the utilization in-formation from the monitor to choose the best processingunit for the task. The actual task latency is predicted as:Latency = Data Copy + (Predicted Task Latency ∗

(100 / Utilization))

The utilization factor implicitly captures the running timeas well as the waiting time of the task. The profiler providesthe predicted latency and the agent provides the utilization.The stub chooses the current utilization or the maximum uti-lization value (Table 1) based on the policy (FIFO or share-

5 2016/3/10

Page 6: Proteus: Efficient Resource Use in Heterogeneous Architecturespages.cs.wisc.edu/~swift/papers/TR1832.pdf · Proteus: Efficient Resource Use in Heterogeneous Architectures Sankaralingam

based) employed by the agent. The maximum utilization re-flects the share given to the application.

Dispatching task onto a GPU involves informing the GPUagent about the task start, managing any data movement andfinally invoking the OpenCL or CUDA runtime to submit thetask to the kernel driver. For processing units with coherentmemory, no movement is required. For those without coher-ence, such as discrete GPUs, stubs explicitly migrate datato and from the GPU’s memory when necessary. libadeptcalculates data movement cost, based on profiler measure-ments, when predicting task performance.

To dispatch tasks onto CPU cores, libadept starts workerthreads for each CPU class in the system during programstartup (e.g., standard and fast), and dispatches tasks toworkqueues serviced by the worker threads. The runtimesets affinity mask for these threads to the CPUs in the class.

For asynchronous tasks, stubs return an event handle tothe application and queue the task to an internal schedulerthat maintains a central queue. Tasks are processed (the pre-diction process) in-order and then pushed onto processingunit-specific task queues. The application, though, must en-force data dependencies between tasks. Task execution maynot happen immediately upon queuing to the central queue.Scheduler keeps the number of pending tasks for a process-ing unit low to maintain predictor accuracy by avoiding along latency between selecting a unit and executing a task.Optimizer. Proteus enables applications to target differentperformance goals, such as throughput guarantees or min-imum execution time (best performance) or energy effi-ciency. The optimizer expects two inputs: the expected per-formance (e.g. 100 MB/Sec or 10 Frames/Sec) and, a trackerfunction that measures application performance. The opti-mizer spawns a thread to periodically (every 500ms) invokethe tracker function. libadept implements trackers to recordamount of data processed per second and number of tasksrun per second. Applications can choose from them or pro-vide a custom performance tracker function.

The optimizer operates in three phases when the trackerfunction indicates that an application is not meeting its goal.First, the optimizer tries to spread tasks across more pro-cessing units. Applications can thus offload multiple (asyn-chronous) tasks at once to different units. Second, it attemptsto increase shares on different processing units by callinginto the OS (agents). This phase expects the application to bein CUSTOM class, and is important if applications need tomeet strict performance goals (e.g., throughput or soft real-time). If the second phase does not succeed, the optimizernotifies the application via a registered callback function,and the application can modify its workload. For example, agame could reduce its frame rate or resolution. Upon avail-ability of more resources (when other applications exit), theapplication automatically gets additional resources since theshares are proportional.

The above three phases are performed in reverse if anapplication exceeds its goals. This allows the application tomodify its workload to increase its workload like frame rate.Applications without any goal are run by default for best per-formance (minimum execution time) and the optimizer doesnot run in this case. Applications belonging to other schedul-ing classes can still leverage the optimizer but a consistentperformance cannot be guaranteed. The application perfor-mance is mostly best-effort with the option of applicationreducing fidelity for increased performance.

/*** Libadept ***/void Accelerate(Task *task , bool sync) {

/* 1. Iterate through all processing units2. Calculate actual task latency on all units3. Let punit be processing unit offering minimum

latency */Schedule(task , punit);

}/*** Application ***/

void TaskGpu(void *args) {/* Task logic on GPU */}void TaskCPU(void *args) {/* Task logic on CPU */}int main() {

...Task *task = IntializeAcceleratorTask(SIMD , TaskGPU

, TaskCPU , args);Accelerate(task , ASYNC);WaitForTask(task);

...}

Listing 1. Program using libadeptData movement. When the optimizer chooses a process-ing unit for the task, the stub must migrate dependent datato the destination accelerator. Switching between differentprocessing units hurts performance, as data moves back andforth. This can occur when a task is at the break-even pointbetween two processing units, so a small change in utiliza-tion can push the task to switch processors. It can also occurwhen there are rapid changes in the utilization of a unit andpredictions of performance are inaccurate.

Proteus implements mechanisms to avoid both causes oftask migration. For tasks near the break-even point, Pro-teus dampens movement with a speedup threshold: tasksonly migrate if the expected speedup is high enough toquickly amortize the cost of data movement. Thus, smallperformance improvements that require expensive copiesare avoided. In addition, for subsequent task offloads, stubsmake an aggregate decision that determines where to dis-patch the next group of tasks, rather than making the dis-patch decision for each task before switching to the newunit. In this task aggregation, group size grows with the datamovement overhead.Stability. Rapid changes in utilization can occur when mul-tiple applications simultaneously make an offload decision:they may all run tasks on the same processing unit, leading topoor performance and then all migrate away from the unit.This ping-pong problem is common to distributed controlsystems, such as in routing [42]. The major source of thisproblem is the staleness in the published information [28].Proteus tries to reduce the staleness by making the agentsupdate the utilization information at regular intervals rather

6 2016/3/10

Page 7: Proteus: Efficient Resource Use in Heterogeneous Architecturespages.cs.wisc.edu/~swift/papers/TR1832.pdf · Proteus: Efficient Resource Use in Heterogeneous Architectures Sankaralingam

Name System Configuration Acceleratorsasym config 12 cores in 2 Intel Xeon X5650

chips, 12 GB RAM2 fast cores–one per socket–at 62.5% duty cycleRemaining 10 cores at 37.5% duty cycle

simd config 12 cores in 2 Intel Xeon X5650chips, 12 GB RAM

GPU-Bulky (GPU-B): NVIDIA GeForce 670 with 2 GB GDDR5 RAMGPU-Wimpy (GPU-W): NVIDIA GeForce 650 with 512 MB GDDR5 RAM

Table 2. System configurations

Applications Task Size Task Speedup RatioRatio GPU-B GPU-W

AES 1 32 22LBM 3.33 1.4 0.5DXT 5.6 16 4.5

lavaMD 20 27 7.5Grep 33 10 3.3

Histogram 83 12 4

Table 3. GPU application characteristics. Speedups are relative tothe CPU alone, and size is relative to AES.

than waiting for the tasks to complete. Proteus also takes atwo-step approach to resolve the ping-pong problem. First,libadept identifies the problem by detecting when the pre-dicted runtime is different than the actual runtime. This in-dicates that the utilization obtained from the monitor waswrong. Applications then resolve the issue without any com-munication between them. Second, on detection of the prob-lem, libadept temporarily uses the actual utilization from thelast task instead of predicted utilization. Thus, the first ap-plications to use the unit runs quickly and observe high uti-lization, while those arriving later experience queuing delayfrom congestion and hence lower utilization. Applicationswith higher delays will tend to choose a different processingunit, while those with less delay stay put.Programmer effort. The stub exposes new interfaces forprogrammers to write their applications. We created a sim-ple task-based programming model to help prototype a com-plete system from the programming model to the systemsoftware. Though writing a program using the new interfacescan be done without much effort as shown in Listing 1, itstill requires the program to be rewritten for the new inter-face. To avoid this extra work, we have started intergratinglibadept runtime with currently available runtimes for het-erogeneous architectures, StarPU [21]. libadept is used asan execution model - the ability of stubs to combine per-formance model with the system state to predict task per-formance has been ported to StarPU’s dispatch mechanism.We did not make any changes to the StarPU’s interfacesmeant for application development. Thus all programs al-ready written for the StarPU model can leverage the smart-ness of libadept and OpenKernel without any code change.We should note that the tasks (implementations) could relyon other runtimes [11, 27, 52] or be hand-optimized code fora particular accelerator.4. EvaluationWe try to answer the four major questions through our eval-uation results: (a) How beneficial is adaptation in a sharedenvironment? (b) How good is the performance of a decen-

Name DescriptionBlackscholes [24] Mathematical model for a financial marketDedup [24] DeduplicationPbzip [12] file compressionAES [16] AES-128-ECB mode encryption, OpenCLLBM [60] Fluid dynamics simulation, OpenCLDXT [10] Image Compression, OpenCLlavaMD [26] Particle simulation, OpenCLGrep [59] Search for a string in a set of files, OpenCL

and OMP for CPUHistogram [59] Finding the frequency of dictionary words in a

list of files, OpenCL and OMP for CPUSphyraena [22] Select queries on a sqlite database, CUDAEncFS [56] FUSE based encrypted file system, CUDATruecrack [15] Password cracker, CUDAx264 [46] Video Encoder, OpenCL

Table 4. Workloadstralized system? (c) Can Proteus isolate the impact of mali-cious or non-cooperative applications? (d) Can Proteus pro-vide application-specific performance? Finally, we also eval-uate the overhead and accuracy of individual components.

4.1 Experimental methods

Table 2 lists the configurations we use for our experiments.We disable Turbo Boost and hyperthreading to avoid perfor-mance variability and to enable control over power usage.Platform. Processors with a mix of CPU types were notavailable, so we emulate an asymmetric CPU (asym config)using Intel’s clock-modulation feature [39] (in our platformsDVFS cannot be used for a single core) to slow down allcores but the fast cores. Similar mechanisms have been usedin past research to emulate asymmetric processors [20]. Weemulate powerful cores by setting all but two cores to runat 62.5% duty cycle and the remaining two at 37.5%. So,the speedup of fast core is 1.6x over slow core [3]. Thesimd config configuration models a collection of GPU de-vices and multi-core CPUs. Applications access these pro-cessing units through OpenCL, which uses the NVIDIAOpenCL SDK for GPUs and Intel OpenCL SDK for CPUs.

The GPU workloads are run simultaneously for all ex-periments performed on simd config. Performance (systemthroughput) is measured as the number of tasks executedover a period of 60 seconds. We restart workload if it fin-ishes before the experiment is completed. We allow earlyworkload completion only for results shown in Figure 3 andFigure 6. We normalize the number of tasks executed basedon the task size ratio as shown in Table 3.Workloads. We run the workloads listed in Table 4 ina multi-programmed environment to create contention forhardware resources. We select workloads with specific char-acteristics like tasks suitable for fast cores, varying task sizes

7 2016/3/10

Page 8: Proteus: Efficient Resource Use in Heterogeneous Architecturespages.cs.wisc.edu/~swift/papers/TR1832.pdf · Proteus: Efficient Resource Use in Heterogeneous Architectures Sankaralingam

0

1

2

3

4

5

6

Dedup Blackscholes Pbzip

Speeduprelative

toserial

Always-FastAsym-Aware

ProteusProteus+

Figure 2. Contention for fast cores.

and real applications to demonstrate Proteus’s capabilitiesand to exercise different forms of heterogeneity.

Dedup, Pbzip and Blackscholes are parallel applicationswith tasks that can benefit from running on powerful CPUcores. We use six workloads for GPUs, described in Table 4.These workloads demonstrate the effect of task size andimplementation style. For OpenCL programs, we compileto both GPU and CPU code. The same set of applicationswas also ported on to StarPU runtime. For CUDA programs,we use a separate implementation for the CPU. Table 3shows the speedup for these applications when running onGPUs and their average task sizes (relative execution time).The speedup shown is relative to using all 12 CPUs at fullperformance. We report the average of five runs and variationwas below 2% unless stated explicitly.

4.2 Adaptation

A major benefit of Proteus is the ability to dynamically adaptto run-time conditions. We evaluate how well applicationsadapt to a set of common scenarios.Contention for fast cores. Proteus allows applications toaccelerate important code regions on a small number of fastcores. We run multiple parallel applications concurrentlythat have tasks suitable for a fast CPU: the compressionphase in Dedup and Pbzip, and the financial calculation inBlackscholes. The speedup of fast core over regular core is1.6x. Using asym config, we compare three configurations:(a) Always-Fast is a compile-time static policy that alwaysruns all tasks on the fast cores, (b) Asym-Aware modifiesthe Linux scheduler to execute tasks on normal cores butmigrate them to a powerful core if it becomes idle, and (c)Proteus, where the stub decides where to run tasks based onspeedup achievable. To demonstrate the ability of Proteus toprioritize fast cores for applications with better speedup, wemade Pbzip to receive a speedup of 2.65x on the faster cores.This was done by modifying the scheduler to increase theduty cycle to 100% only when Pbzip runs on the faster core.The Proteus+ configuration runs Pbzip at higher speedup.

Figure 2 shows performance normalized to serial versionsof the applications running on a regular core. The over-all performance of Proteus and Asym-Aware is comparablethough both configurations perform differently on Dedup

GPU

-­‐B    G

PU-­‐W

     CP

Us  

GPU

-­‐B    G

PU-­‐W

   CPU

s  

0   20   40   60   80   100  Time  in  Seconds  

Histogram   lavaMD   Grep   DXT   AES   LBM  

GPU

-­‐B    G

PU-­‐W

     CP

Us  

Figure 3. Proteus task placement with FIFO policy.

and Blackscholes. This is because of the variation in tasksizes for Dedup in both configurations. The Always-Fastperforms poorly because it under-utilizes the regular CPUcores while waiting for the fast CPUs. Proteus+ runs appli-cation with the best speedup, Pbzip on the fast cores, whichgives application a 27% speedup compared to Proteus. Cor-responding performance drop can be observed for Blacksc-holes. The results show the high speedup setup only for Pro-teus+ configuration since performance was similar for otherconfigurations for both the setup.Contention for accelerators. A key goal of Proteus is tomanage contention for a shared accelerator. In current sys-tems, applications cannot choose between processing unitsand thus must wait for the contented accelerator. Proteus,though, allows applications to use other processing units ifthey are faster than waiting. As a result, applications withlarge benefit from an accelerator tend to use it preferentially,while applications with lesser benefit will migrate their workto other processing units. Task size also plays a role withthe FIFO policy: large tasks dominate the usage on bulkyGPU. We ran all the OpenCL GPU workloads concurrentlyon simd config using the FIFO policy.

In the Native system, all applications offload tasks toGPU-B, which leaves GPU-W and the CPUs idle. In con-trast, Proteus achieves 1.8x better performance because itmakes use of GPU-W and the CPU to run tasks as well. Toexplain these results, Figure 3 plots the processing unit usedby each applications from a run where we launch all apps atthe same time. LBM uses the CPU only exclusively becauseit gets low performance on the GPUs. In contrast, Histogramand Grep always uses GPU-B, because of their relativelylarger tasks. This causes long delays for other applications,and hence lavaMD and DXT move to GPU-W. When His-togram completes, lavaMD switches to GPU-B and sharesthe device with Grep. On the other hand, AES switches ex-ecution between CPU and GPUs since the amount of datato encrypt varies with every task offloaded by the program.For smaller data, AES runs on the CPU to avoid costly datacopies. The three column stacks labeled FIFO in Figure 5

8 2016/3/10

Page 9: Proteus: Efficient Resource Use in Heterogeneous Architecturespages.cs.wisc.edu/~swift/papers/TR1832.pdf · Proteus: Efficient Resource Use in Heterogeneous Architectures Sankaralingam

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Native c-sched Proteus StarPUDMDA

StarPUEager

StarPUProteusP

erform

ance

relative

toNative

Figure 4. Performance (throughput) of different systems.

shows the percentage of time each application gets on dif-ferent processing units. The native stack at the left showsthe default behavior where all tasks are offloaded to GPU-Balone.Proteus as an execution model. We integrated Proteus asthe execution engine for StarPU, a runtime for heteroge-neous architectures, to analyze the impact of contentionawareness in the runtime. The results are shown in Figure 4.

We ran the same experiment previously mentioned run-ning all six workloads simultaneously on four different sys-tems with a total of six different configurations. c-sched isa simple centralized system that we describe more in Sec-tion 4.3. We tested two policies with StarPU [13]: DMDAoffloads task to the best performing processing unit and Ea-ger employs a task stealing approach where worker threadsof any processing unit can run a task as long as the imple-mentation for that unit is available. All six workloads wererewritten to run on StarPU runtime.

StarPU+Proteus outperforms the native policies ofStarPU (Eager and DMDA) by 1.5 and 2x. Ideally, theStarPU on Proteus should have performance similar to c-sched and Proteus. When LBM with StarPU runs on CPUs,it suffers performance degradation due to interference fromworker threads in StarPU runtime. This results in the reducedthroughput for the configuration. The performance of otherapplications were similar to that of c-sched and Proteus.

4.3 Decentralized vs. Centralized

We want to evaluate the performance of Proteus against acentralized system to understand how much performance islost due to a decentralized architecture.Centralized scheduler. We want to build a simple systemthat behaves like a real centralized systems like PTask orPegasus but without the complexity and kernel modificationsof those systems. We built a centralized scheduler (c-sched)aware of tasks from all applications and can make perfecttask placement decisions. c-sched is built on top of libadeptand it maintains task information from all applications ona shared page (page mapped onto all applications addressspace). c-sched consists of global queue per processing unitrather than per application queues. This enables c-sched tohave a global view of the system and to predict the wait timefor every processing unit accurately. However, it expectsapplications to be cooperative and the system can be easilygamed by providing erroneous speedup values.

0%  

20%  

40%  

60%  

80%  

100%  

Percen

t  Usage  

Histogram   Grep   LBM   lavaMD   DXT   AES  

Na-ve  &  

StarPU  GPU-­‐B      GPU-­‐W        CPUs  

FIFO  Proteus  

Shares  Na-ve  &  StarPU  

 

GPU-­‐B    GPU-­‐W      CPUs  Shares  Proteus  

Shares  Assigned  

Figure 5. Percentage of time spent on different devices withvarious policies. Native & StarPU and Shares Native & StarPUcolumns uses only GPU-B. Shares Assigned column is input.

Short-lived applications. Task placement for short-livedapplications is hard without the global knowledge of thesystem since the utilization information can fluctuate. Wegenerated such applications that start, run a small task for0.5 - 2ms and then exit. We generated small tasks withvarying inputs for AES and DXT workloads. We ran anexperiment running multiple such applications periodicallyto test the overall throughput of the system in terms of tasksprocessed per second. The c-sched system should providethe best performance since it has a global knowledge of thesystem. The performance of Proteus is only 3.5% lower thanc-sched whereas the native system that offloads all tasks toGPU-B is 20% slower than c-sched. Proteus performs welleven without global knowledge of all workloads by usingthe average task size and average number of applicationsmetrics from the agents rather than utilization information.This shows that Proteus employing a decentralized designperforms as good as a centralized system.Varying mix of workloads. We want to compare Proteuson a more varied set of workloads. We created workloadsby varying three different parameters: Task sizes (large andsmall), speedup(high and low), and longevity of applications(long running or short-lived). We generated eight syntheticworkloads with manually generated performance model fordifferent configurations. All eight workloads were ran andthe system throughput was measured. We compared the re-sults from c-sched, which could see all tasks and sleect thebest ones for each workload against Proteus. Overall, wefound that Proteus performed similar to c-sched whereas thenative system performed 40% lower than c-sched. Despitethe lack of global knowledge, Proteus achieves performanceequal to an omniscient global scheduler.

4.4 Preserving Isolation

Proteus only does placement in user-mode, but leavesscheduling and policy enforcement in the kernel to protectagainst malicious applications. To demonstrate this, we man-ually allotted CUSTOM shares on GPU-B in the ratio of

9 2016/3/10

Page 10: Proteus: Efficient Resource Use in Heterogeneous Architecturespages.cs.wisc.edu/~swift/papers/TR1832.pdf · Proteus: Efficient Resource Use in Heterogeneous Architectures Sankaralingam

25:25:25:10:10:5 to Histogram, Grep, DXT, lavaMD, LBMand AES. Such a share ratio was used to show that (a) appli-cations with large tasks (Histogram) can be constrained, (b)small tasks (LBM) can receive guaranteed share, and (c) iso-lation is achieved even in the presence of varied task sizes.Proteus applications. The three stacks on the right labeledShares Proteus in Figure 5 show the portion of each pro-cessing unit used by each application. Histogram gets majorportion of GPU-B because of its large task sizes in FIFO-based policy. With share-based scheduling, Histogram, Grepand DXT evenly share GPU-B, while lava-MD uses GPU-W since it yields better performance than the guaranteed 10shares on GPU-B. We observe that three applications (His-togram, Grep and DXT) receive more than their assigned25% on GPU-B (30% each) because other applications de-cided to offload tasks on GPU-W or CPUs. So, these threeactive application enjoy equal share on GPU-B. Also, appli-cations with lower shares of GPU-B, such as lavaMD andLBM, offload tasks to GPU-W and CPUs respectively sincethe performance is better than on their 10% of GPU-B. Fi-nally, Proteus guarantees isolation by enforcing shares andmakes best effort to provide better performance by migrat-ing to alternate processing units.StarPU applications. The leftmost stack in Figure 5 showsthe utilization received by each application when no policyis enforced. StarPU uses DMDA policy that makes it similarto the native system where all tasks are run on GPU-B. TheShares native and starpu column towards the right showsthe performance when share-based policy is enforced by theagents. There are two key points: (a) The native system andStarPU do not provide isolation. (b) As seen on the fourthstack, isolation is provided even when applications are notbuilt using libadept.

4.5 Application-Specific Goals

The libadept library allows applications to set their ownperformance goals through an optimizer that guides of-fload decisions. We demonstrate Proteus’s ability to sup-port application-specific performance goals by running thefollowing applications–x264, Sphyraena, and EncFS—withtheir own goals in CUSTOM class. We also run Truecrack onBKGRND class. Native performance of application is shownin Table 5. The goal of each application is listed below.• x264: Frame rate of 15 Frames/Sec where encoding qual-

ity can be given up for performance.• EncFS: Bandwidth of 275MB/Sec on sequential reads.• Sphyraena: Minimum throughput of 30 Queries/Sec.

We expect Proteus to automatically adapt—find the rightset of processing units or adjust shares or callback to applica-tions to alter configurations to adjust performance—withoutany manual intervention from applications in achieving theirgoals. We started the applications with no shares but letlibadept to gather shares. The adaptation of the system isshown in form of a timegraph in Figure 6.

Truecrack Sphyraena EncFS (read) x264GPU-B 320 W/S 38 Q/S 340 MB/S 15 F/SGPU-W 230 W/S 13 Q/S 190 MB/S -

CPU 15 W/S 0.2 Q/S 13 MB/S -

Table 5. Stand-alone performance of CUDA workloads. W/S -Words/Sec; Q/S - Queries/Sec. F/S - Frames/Sec

The first phase until time 40 shows the ability of runtimeto enable applications to adapt themselves in reducing fi-delity to achieve desired performance goals. x264 has theability to trade-off its output quality for better performance(FPS). During the initial phase, x264 and sphyraena wererun and the goals were set as 15 FPS and 30 Queries/Sec.

On application start-up, libadept allocates default sharesfor each application. As the optimizer runs, it detects the lagin performance and attempts to use multiple GPUs. x264being a synchronous application cannot leverage multipleGPUs. However, Sphyranea spreads its tasks across bothGPU-B and GPU-W. x264 gets around 80 shares on GPU-B and sphyraena gets 20 on GPU-B and complete GPU-Wbefore per-device shares are exhausted. Around 5th second,the runtime notifies applications to reduce fidelity to improveperformance. x264 adapts by reducing quality and is ableto reach performance up to 13 FPS. On the other hand,Sphyraena does not adapt and its performance is limited to22 Queries/Sec.

Encfs cannot run at all in the first phase since all cus-tom shares are consumed. We started Encfs after x264 com-pleted which marks the second phase (from 45th second).Sphyraena continues to run (we reduced the goal to 20Queries/Sec) along with EncFS. These two goals can be ac-commodated within the capacity of the system. The opti-mizer adjusts the shares such that both applications achievetheir goals. 85 shares of GPU-B were given to EncFS, andremaining of GPU-B and the whole GPU-W were given toSphyraena. The Truecrack application in BKGRND classuses the accelerators only when they are not available elseit resorts to running on the CPUs.

4.6 Overhead and Accuracy

We separately measured the overhead of Proteus’s mecha-nisms and the accuracy of its profiler.Overheads. The primary overhead in Proteus comes fromstubs, which must decide where to dispatch tasks. The over-head of stubs ranged between 1µs when choosing betweenfast and regular CPU cores to 2µs for selecting among dif-ferent GPUs. Aggregation can reduce this by changing thedispatch decision less often.Task profiler accuracy. We measure the difference betweenthe task profiler’s predicted run time and the actual task la-tency including the wait time. Across all our experiments,the prediction error is between 8–16%. The error came fromtwo sources. First, Proteus predicts that task size is a linearfunction of input data size, which is not true for all applica-

10 2016/3/10

Page 11: Proteus: Efficient Resource Use in Heterogeneous Architecturespages.cs.wisc.edu/~swift/papers/TR1832.pdf · Proteus: Efficient Resource Use in Heterogeneous Architectures Sankaralingam

1

10

100

0 20 40

60 80

100 120

140 160

180

Word

s/S

ec

Time in Seconds

Truecrack

GPU-B

GPU-W

CPUS

0

100

200

300

400

Bandw

idth

/Sec

EncFS

GPU-B

0

10

20

30

40

Queries/S

ec

Sphyraena

GPU-B

GPU-W

0

5

10

15F

ram

es/S

ec

x264

GPU-B

Figure 6. CUDA application performance.tions (E.g. Sphyraena). Second, we found that the data copylatency to the GPU varied due to contention for the PCIe bus.

To understand the importance of accurate performancepredictions, we built an analysis tool to observe the impactof profiler error on task placement. For applications with10x speedup, when the error rate increased from 5-100% theprobability of making a incorrect decision increases from0.5% to 5%. For the same range of error rates, the errorprobability varies from 2.3% to 25% for applications with2x speedup. This shows that error in profiler prediction doesnot impact placement for tasks with better speedups. Sim-ilar analysis based on task sizes showed that the impact ofprofiler is low for larger tasks.Reducing data movement. Proteus reduces the amountof data movement by limiting the frequency with whichtasks move between a processing units with task aggrega-tion and the speedup threshold. When we randomly variedthe utilization between 65-75%, LBM’s preference flippedbetween GPU-B and the CPUs and performance suffered.Compared to a system without the threshold and aggregationmechanisms, the speedup threshold improves performanceby 5%, and task aggregation improves performance by anadditional 60%.Ping-pong problem. We also investigated the impact ofthe ping-pong problem by comparing Proteus against our c-sched centralized scheduler. Because it has knowledge of allapplications, c-sched does not have the ping-pong problem.For the contention experiments described in § 4.2, we com-pared the task movements of Proteus to c-sched. We foundthat both systems had similar amounts of task movement.Because of varied task size, applications saw different pro-cessing unit utilization and hence made different schedul-ing decisions. To force a ping-pong problem, we ran fivecopies of the same Grep workload, which offloaded tasksof the same size. Proteus’s ping-pong avoidance mecha-

nism helped in stabilizing task placement sooner, resultingin the same task throughput same as the c-sched. Withoutthe mechanism, the system took 8-10 task offloads to stabi-lize, as compared to 2-3 with ping-pong prevention.5. Related WorkRuntimes for heterogeneous architectures. Many systemsprovide runtime layers to aid applications in using hetero-geneous units (e.g., StarPU[21], Merge[44], Harmony[29],Lithe[54] and others). Proteus shares the similarity of auto-matically selecting where to run a task. Unlike these run-times, Proteus supports multi-programmed systems wherethere may be contention for processing units. In addition,Proteus exposes system-wide resource usage to applica-tions to enable them to self-select processing units. Run-times [25, 35] for databases help in task scheduling on het-erogeneous architectures. However, they assume the systemresources are not shared with other applications.

Other systems abstract the presence of heterogeneousprocessing units. For example, PTask [56] provide an ab-straction for GPU programming that aids in scheduling anddata movement. Proteus differs by exposing multiple typesof processing units, not just GPUs, and by exposing usageto applications to self-select a place to execute. In contrast,PTask manages allocation and scheduling in the kernel.Resource scheduling. Many past systems provide resourcemanagement for heterogeneous systems (e.g., PTask[56],Barrelfish[23], Helios[50], Pegasus [34]). In contrast to thesesystems, where the OS transparently handles scheduling andresource allocation, Proteus cooperatively involves the ap-plication. It exposes processing unit utilization via the accel-erator monitor to allow applications to predict where to runtheir code, instead of making the choice for the application.

Other systems provide contention-aware scheduling(Grand central dispatch[7], Scheduler activations[18]),where applications or the system can adjust the degree ofparallelism. Proteus enables a similar outcome, but overlonger time scales. Unlike scheduler activations, Proteusdoes not notify applications during change in resource al-location, but allows applications to monitor their utilization.Adaptive Systems. Proteus is inspired by many previousworks in the area of adaptive systems. Odyssey [51] wasbuilt as an adaptation platform for mobile systems. Proteusemploys a similar architectures for shared heterogeneoussystems and uses alternate processing units as a mode ofadaptation. Application heartbeats [37] tunes applicationsaccording to available resources. However, it does not handleglobal resource management. We think the heartbeats workis orthogonal to Proteus. Similar adaptive techniques havebeen employed in databases [40] where better interfaces be-tween OS and databases are designed for better performance.6. ConclusionHeterogeneity in various forms will be prevalent in future ar-chitectures and the maturity in programming models is go-

11 2016/3/10

Page 12: Proteus: Efficient Resource Use in Heterogeneous Architecturespages.cs.wisc.edu/~swift/papers/TR1832.pdf · Proteus: Efficient Resource Use in Heterogeneous Architectures Sankaralingam

ing to make it easier to program accelerators. Proteus ex-poses available heterogeneity with the OpenKernel, but useslibadept to conceal it from application programmers. Thekernel retains control over resource management but pro-vides application-level code with the flexibility to executewherever is best for the application. The decentralized de-sign employed by Proteus performs well and at the sametime does not require extensive changes to the kernel. Asfuture work, we plan to explore how to better handle power-limited heterogeneous systems.

References[1] Amazon EC2. https://aws.amazon.com/ec2/.

[2] AMD Kaveri. http://www.amd.com/us/products/

desktop/processors/a-series/Pages/nextgenapu.

aspx.

[3] ARM big.LITTLE Processing. http://www.

arm.com/products/processors/technologies/

biglittleprocessing.php.

[4] C++ AMP : Language and Programming Model.http://download.microsoft.com/download/4/

0/E/40EA02D8-23A7-4BD2-AD3A-0BFFFB640F28/

CppAMPLanguageAndProgrammingModel.pdf.

[5] CFS Scheduler. https://www.kernel.org/doc/

Documentation/scheduler/sched-design-CFS.txt.

[6] Frame Rate Target Control. http://www.amd.com/en-

us/innovations/software-technologies/

technologies-gaming/frtc.

[7] Grand Central Dispatch. http://developer.apple.

com/library/ios/#documentation/Performance/

Reference/GCD_libdispatch_Ref/Reference/

reference.html.

[8] HSA Intermediate Language. https://hsafoundation.

app.box.com/s/m6mrsjv8b7r50kqeyyal.

[9] Intel Sandy Bridge. http://software.intel.com/en-

us/blogs/2011/01/13/a-look-at-sandy-bridge-

integrating-graphics-into-the-cpu.

[10] NVIDIA OpenCL SDK . http://developer.download.

nvidia.com/compute/cuda/3_0/sdk/website/

OpenCL/website/samples.html.

[11] OpenCL - The open standard for parallel pro-gramming of heterogeneous systems. http:

//download.microsoft.com/download/4/0/

E/40EA02D8-23A7-4BD2-AD3A-0BFFFB640F28/

CppAMPLanguageAndProgrammingModel.pdf.

[12] Parallel Implementation of bzip2 . http://compression.

ca/pbzip2/.

[13] Task Scheduling Policy. http://starpu.gforge.inria.

fr/doc/html/Scheduling.html.

[14] The OpenACC Application Program Interface. http://www.openacc-standard.org/.

[15] Truecrack. https://code.google.com/p/truecrack/.

[16] Advanced encryption standard (AES). http://www.

csrc.nist.gov/publications/fips/fips197/fips-197.pdf, 2001.

[17] Qualcomm MARE: Enabling Applica-tions for Heterogeneous Mobile Devices.https://developer.qualcomm.com/downloads/whitepaper-qualcomm-mare-enabling-applications-heterogeneous-mobile-devices, 2014.

[18] T. E. Anderson, B. N. Bershad, E. D. Lazowska, and H. M.Levy. Scheduler activations: Effective kernel support for theuser-level management of parallelism. ACM TOCS, 10(1):53–79, Feb. 1992.

[19] Andrei Frumusanu. The Samsung Exynos 7420 Deep Dive- Inside A Modern 14nm SoC. http://www.anandtech.

com/show/9330/exynos-7420-deep-dive/2.

[20] M. Annavaram, E. Grochowski, and J. Shen. Mitigating am-dahl’s law through epi throttling. In Proc. of the 32nd ISCA,pages 298 – 309, June 2005.

[21] C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier.StarPU: A Unified Platform for Task Scheduling on Heteroge-neous Multicore Architectures. In Proc. 15th Euro-Par, pages863–874, Aug. 2009.

[22] P. Bakkum and K. Skadron. Accelerating sql database oper-ations on a gpu with cuda. In Proceedings of the 3rd Work-shop on General-Purpose Computation on Graphics Process-ing Units, GPGPU ’10, pages 94–103, 2010.

[23] A. Baumann, P. barham, P.-E. Dagand, T. Harris, R. Isaacs,S. Peter, T. Roscoe, A. Schupbach, and A. Singhania. Themultikernel: A new os architecture for scalable multicore sys-tems. In Proc. 22nd SOSP, Oct 2009.

[24] C. Bienia and K. Li. PARSEC 2.0: A new benchmark suitefor chip-multiprocessors. In Proc. 5th Workshop on Modeling,Benchmarking and Simulation, June 2009.

[25] S. Breßand G. Saake. Why it is time for a hype: A hy-brid query processing engine for efficient gpu coprocessingin dbms. Proc. VLDB Endow., 6(12).

[26] S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee,and K. Skadron. Rodinia: A benchmark suite for heteroge-neous computing. In Proc. IISWC, pages 44–54, 2009.

[27] L. Dagum and R. Menon. OpenMP: an industry standardAPI for shared-memory programming. IEEE ComputationalScience and Engineering, 5(1), January–March 1998.

[28] M. Dahlin. Interpreting Stale Load Information. In IEEETransactions on Parallel and Distributed Systems, volume 11,Oct. 2001.

[29] G. F. Diamos and S. Yalamanchili. Harmony: an executionmodel and runtime for heterogeneous many core systems. InProc 17th HPDC, pages 197–200, 2008.

[30] D. R. Engler, M. F. Kaashoek, and J. O’Toole, Jr. Ex-okernel: An operating system architecture for application-level resource management. In Proceedings of the FifteenthACM Symposium on Operating Systems Principles, SOSP ’95,pages 251–266, 1995.

[31] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam,and D. Burger. Dark Silicon and the End of Multicore Scaling.In Proc. ISCA, June 2011.

[32] H. Franke, J. Xenidis, C. Basso, B. M. Bass, S. S. Woodward,J. D. Brown, and C. L. Johnson. Introduction to the wire-

12 2016/3/10

Page 13: Proteus: Efficient Resource Use in Heterogeneous Architecturespages.cs.wisc.edu/~swift/papers/TR1832.pdf · Proteus: Efficient Resource Use in Heterogeneous Architectures Sankaralingam

speed processor and architecture. IBM Journal of Researchand Development, 54(1):3:1 –3:11, january-february 2010.

[33] D. Grewe and M. F. P. O’Boyle. A static task partitioningapproach for heterogeneous systems using opencl. In Proc.20th CC/ETAPS, pages 286–305, 2011.

[34] V. Gupta, K. Schwan, N. Tolia, V. Talwar, and P. Ranganathan.Pegasus: coordinated scheduling for virtualized accelerator-based systems. In Proceedings of the 2011 USENIX con-ference on USENIX annual technical conference, pages 3–3,2011.

[35] M. Heimel, M. Saecker, H. Pirk, S. Manegold, andV. Markl. Hardware-oblivious parallelism for in-memorycolumn-stores. Proc. VLDB Endow.

[36] M. D. Hill and M. R. Marty. Amdahl’s law in the multicoreera. IEEE Computer, pages 33–38, July 2008.

[37] H. Hoffmann, S. Sidiroglou, M. Carbin, S. Misailovic,A. Agarwal, and M. Rinard. Dynamic knobs for respon-sive power-aware computing. In Proceedings of the SixteenthInternational Conference on Architectural Support for Pro-gramming Languages and Operating Systems, ASPLOS XVI,pages 199–212, 2011. ACM.

[38] Intel Corp. Intel turbo boost technology 2.0.http://www.intel.com/content/www/us/en/

architecture-and-technology/turbo-boost/turbo-

boost-technology.html.

[39] Intel Corp. Thermal protection and monitoring features:A software perspective. http://www.intel.com/cd/ids/developer/asmo-na/eng/downloads/54118.htm, 2005.

[40] Jana Giceva and Tudor-ioan Salomie and Adrian Schpbachand Gustavo Alonso and Timothy Roscoe. COD: Database /Operating System Co-Design. In Conference on InnovativeData Systems Research (CIDR), 2013.

[41] S. Kanev, J. P. Darago, K. Hazelwood, P. Ranganathan,T. Moseley, G.-Y. Wei, and D. Brooks. Profiling a warehouse-scale computer. In Proceedings of the 42Nd Annual Interna-tional Symposium on Computer Architecture, ISCA ’15, pages158–169, 2015.

[42] A. Khanna and J. Zinky. The revised ARPANET routingmetric. In Proceedings of ACM SIGCOMM, 1989.

[43] K. Kofler, I. Grasso, B. Cosenza, and T. Fahringer. An auto-matic input-sensitive approach for heterogeneous task parti-tioning. In Proceedings of the 27th International ACM Con-ference on International Conference on Supercomputing, ICS’13, 2013.

[44] M. D. Linderman, J. D. Collins, H. Wang, and T. H. Meng.Merge: A Programming Model for Heterogeneous Multi-coreSystems. In Proc. 13th ASPLOS, pages 287–296, 2008.

[45] C.-K. Luk, S. Hong, and H. Kim. Qilin: Exploiting parallelismon heterogeneous multiprocessors with adaptive mapping. InProceedings of the 42Nd Annual IEEE/ACM InternationalSymposium on Microarchitecture, MICRO 42, 2009.

[46] Marth, Erich and Marcus, Guillermo. Parallelization ofthe x264 encoder using OpenCL. http://li5.ziti.uni-

heidelberg.de/x264gpu/.[47] K. Menychtas, K. Shen, and M. L. Scott. Disengaged schedul-

ing for fair, protected access to fast computational accelera-

tors. In Proc. 19th ASPLOS, 2014.

[48] N. Mishra, H. Zhang, J. D. Lafferty, and H. Hoffmann. Aprobabilistic graphical model-based approach for minimizingenergy under performance constraints. In Proceedings of theTwentieth International Conference on Architectural Supportfor Programming Languages and Operating Systems, ASP-LOS ’15, pages 267–281, 2015.

[49] T. P. Morgan. Oracle cranks up the cores to 32 with sparc m7chip. http://www.enterprisetech.com/2014/08/13/

oracle-cranks-cores-32-sparc-m7-chip/, Aug. 2014.EnterpriseTech.

[50] E. B. Nightingale, O. Hodson, R. McIlroy, C. Hawblitzel, andG. Hunt. Helios: Heterogeneous multiprocessing with satellitekernels. In Proc. of the 22nd SOSP, Oct. 2009.

[51] B. D. Noble, M. Satyanarayanan, D. Narayanan, J. E. Tilton,J. Flinn, and K. R. Walker. Agile application-aware adaptationfor mobility. In Proceedings of the Sixteenth ACM Symposiumon Operating Systems Principles, SOSP ’97, 1997.

[52] NVidia, Inc. CUDA toolkit 4.1. http://www.developer.

nvidia.com/cuda-toolkit-41, 2011.

[53] A. Odlyzko. Paris metro pricing for the internet. In Proceed-ings of the 1st ACM Conference on Electronic Commerce, EC’99, pages 140–147, 1999.

[54] H. Pan, B. Hindman, and K. Asanovic. Composing parallelsoftware efficiently with lithe. In Proc PLDI, pages 376–387,2010.

[55] S. Peter, A. Schupbach, P. Barham, A. Baumann, R. Isaacs,T. Harris, and T. Roscoe. Design principles for end-to-endmulticore schedulers. In Proceedings of the 2Nd USENIXConference on Hot Topics in Parallelism, HotPar’10, pages10–10, 2010.

[56] C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, andE. Witchel. PTask: Operating System Abstractions To ManageGPUs as Compute Devices. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles,pages 233–248, 2011.

[57] J. C. Saez, M. Prieto, A. Fedorova, and S. Blagodurov. Acomprehensive scheduler for asymmetric multicore systems.In Proc. EuroSys, 2010.

[58] M. Shoaib Bin Altaf and D. Wood. Logca: A performancemodel for hardware accelerators. Computer Architecture Let-ters, PP(99):1–1, 2014.

[59] M. Silberstein, B. Ford, I. Keidar, and E. Witchel. Gpufs:integrating a file system with gpus. In Proc. 18th ASPLOS,pages 485–498, 2013.

[60] J. Stratton, C. Rodrigues, I. Sung, N. Obeid, L. Chang,N. Anssari, G. Liu, and W. Hwu. Parboil: A revised bench-mark suite for scientific and commercial throughput comput-ing. Center for Reliable and High-Performance Computing,2012.

[61] N. Sun and C.-C. Lin. Using the cryptographic ac-celerators in the UltraSparc T1 and T2 processors.http://www.oracle.com/technetwork/server-

storage/archive/a11-014-crypto-accelerators-

439765.pdf, Nov. 2007.

13 2016/3/10

Page 14: Proteus: Efficient Resource Use in Heterogeneous Architecturespages.cs.wisc.edu/~swift/papers/TR1832.pdf · Proteus: Efficient Resource Use in Heterogeneous Architectures Sankaralingam

[62] M. B. Taylor. Is dark silicon useful?: harnessing the fourhorsemen of the coming dark silicon apocalypse. DAC ’12.

[63] C. A. Waldspurger and W. E. Weihl. Lottery scheduling: Flex-ible proportional-share resource management. In Proceedingsof the 1st USENIX Symposium on Operating Systems Design

and Implementation, pages 1–11, Nov. 1994.

[64] C. A. Waldspurger and W. E. Weihl. An object-orientedframework for modular resource management. In Proceedingsof the 5th International Workshop on Object Orientation inOperating Systems (IWOOOS ’96), IWOOOS ’96, 1996.

14 2016/3/10