TimeGraph: GPU Scheduling for Real-Time Multi … · TimeGraph: GPU Scheduling for Real-Time...

TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments

Shinpei Kato† ‡, Karthik Lakshmanan†, and Ragunathan (Raj) Rajkumar†, Yutaka Ishikawa‡

† Department of Electrical and Computer Engineering, Carnegie Mellon University‡ Department of Computer Science, The University of Tokyo

Abstract

The Graphics Processing Unit (GPU) is now commonlyused for graphics and data-parallel computing. As moreand more applications tend to accelerate on the GPU inmulti-tasking environments where multiple tasks accessthe GPU concurrently, operating systems must provideprioritization and isolation capabilities in GPU resourcemanagement, particularly in real-time setups.

We present TimeGraph, a real-time GPU schedulerat the device-driver level for protecting important GPUworkloads from performance interference. TimeGraphadopts a new event-driven model that synchronizes theGPU with the CPU to monitor GPU commands issuedfrom the user space and control GPU resource usage ina responsive manner. TimeGraph supports two priority-based scheduling policies in order to address the trade-off between response times and throughput introducedby the asynchronous and non-preemptive nature of GPUprocessing. Resource reservation mechanisms are alsoemployed to account and enforce GPU resource usage,which prevent misbehaving tasks from exhausting GPUresources. Prediction of GPU command execution costsis further provided to enhance isolation.

Our experiments using OpenGL graphics benchmarksdemonstrate that TimeGraph maintains the frame-ratesof primary GPU tasks at the desired level even in theface of extreme GPU workloads, whereas these tasks be-come nearly unresponsive without TimeGraph support.Our findings also include that the performance overheadimposed on TimeGraph can be limited to 4-10%, and itsevent-driven scheduler improves throughput by about 30times over the existing tick-driven scheduler.

1 Introduction

The Graphics Processing Unit (GPU) is the burgeoningplatform to support high-performance graphics and data-parallel computing, as its peak performance is exceeding1000 GFLOPS, which is nearly equivalent of10 timesthat of traditional microprocessors. User-end windowingsystems, for instance, use GPUs to present a morelivelyinterface that improves the user experience significantlythrough 3-D windows, high-quality graphics, and smooth

transition. Especially recent trends on 3-D browser anddesktop applications, such as SpaceTime, Web3D, 3D-Desktop, Compiz Fusion, BumpTop, Cooliris, and Win-dows Aero, are all intriguing possibilities for future userinterfaces. GPUs are also leveraged in various domainsof general-purpose GPU (GPGPU) processing to facili-tate data-parallel compute-intensive applications.

Real-time multi-tasking support is a key requirementfor such emerging GPU applications. For example, userscould launch multiple GPU applications concurrently intheir desktop computers, including games, video play-ers, web browsers, and live messengers, sharing the sameGPU. In such a case, quality-aware soft real-time ap-plications like games and video players should be pri-oritized over live messengers and any other applicationsaccessing the GPU in the background. Other examplesinclude GPGPU-based cloud computing services, suchas Amazon EC2, where virtual machines sharing GPUsmust be prioritized and isolated from each other. Morein general, important applications must be well-isolatedfrom others for quality and security issues on GPUs,as on-line and user-space programs can createanyarbi-trary set of GPU commands, and access the GPUdirectlythrough generic I/O system calls, meaning that maliciousand buggy programs can easily cause the GPU to be over-loaded. Thus, GPU resource management consolidatingprioritization and isolation capabilities plays a vital rolein real-time multi-tasking environments.

GPU resource management is usually supported at theoperating-system level, while GPU program code itselfincluding GPU commands is generated through libraries,compilers, and runtime frameworks. Particularly, it isa device driver that transfers GPU commands from theCPU to the GPU, regardless of whether they producegraphics or GPGPU workloads. Hence, the developmentof a robust GPU device driver is of significant impact formany GPU applications. Unfortunately, existing GPUdevice drivers [1, 5, 7, 19, 25] are not tailored to supportreal-time multi-tasking environments, but accelerateoneparticular high-performance application in the system orprovidefairnessamong applications.

We have conducted a preliminary evaluation to see theperformance of existing GPU drivers, (i) the NVIDIAproprietary driver [19] and (ii) the Nouveau open-source

0

0.2

0.4

0.6

0.8

1

NVIDIA Nouveau NVIDIA Nouveau

Re

lati

ve

fra

me

ra

te t

o s

tan

da

lon

ew/ Engine

w/ Clearspd

Graphics drivers / Graphics cards

GeForce 9800 GT GeForce GTX 285

Figure 1: Decrease in performance of the OpenArena ap-plication competing with different GPU applications.

driver [7], in multi-tasking environments, using two dif-ferent NVIDIA graphics cards, (i) GeForce 9800 GT and(ii) GeForce GTX 285, where the Linux 2.6.35 kernel isused as the underlying operating system. It should benoted that this NVIDIA driver evaluated on Linux is alsoexpected to closely match the performance of the Win-dows driver (WDDM [25]), as they share about90% ofcode [23]. Figure 1 shows the relative decrease in per-formance (frame-rate) of an OpenGL game (OpenArena)competing with two GPU-accelerated programs (Engineand Clearspd[15]) respectively. TheEngineprogramrepresents a regularly-behaved GPU workload, whiletheClearspdprogram produces a GPU commandbombcausing the GPU to be overloaded, which represents amalicious or buggy program. To achieve the best possi-ble performance, this preliminary evaluation assigns thehighest CPU (nice) priority to theOpenArenaapplica-tion as an important application. As observed in Fig-ure 1, the performance of the importantOpenArenaap-plication drops significantly due to the existence of com-peting GPU applications. It highlights the fact that GPUresource management in the current state of the art iswoefully inadequate, lacking prioritization and isolationcapabilities for multiple GPU applications.

Contributions: We propose, design, and implementTimeGraph, a GPU scheduler to provide prioritizationand isolation capabilities for GPU applications insoftreal-time multi-tasking environments. We address a corechallenge for GPU resource management posed due tothe asynchronous and non-preemptive nature of GPUprocessing. Specifically, TimeGraph adopts anevent-drivenscheduler model that synchronizes the GPU withthe CPU in a responsive manner, using GPU-to-CPUinterrupts, to schedule non-preemptive GPU commandsfor the asynchronously-operatingGPU. Under this event-driven model, TimeGraph supports two scheduling poli-cies toprioritize tasks on the GPU, which address thetrade-off between response times and throughput. Time-Graph also employs two resource reservation policies to

isolatetasks on the GPU, which provide different levelsof quality of service (QoS) at the expense of different lev-els of overhead. To the best of our knowledge, this is thefirst work that enables GPU applications to be prioritizedand isolated in real-time multi-tasking environments.

Organization: The rest of this paper is organized asfollows. Section 2 introduces our system model, in-cluding the scope and limitations of TimeGraph. Sec-tion 3 provides the system architecture of TimeGraph.Section 4 and Section 5 describe the design and imple-mentation of TimeGraph GPU scheduling and reserva-tion mechanisms respectively. In Section 6, the perfor-mance of TimeGraph is evaluated, and its capabilities aredemonstrated. Related work is discussed in Section 7.Our concluding remarks are provided in Section 8.

2 System Model

Scope and Limitations:We assume a system composedof a generic multi-core CPU and an on-board GPU. Wedo not manipulate any GPU-internal units, and henceGPU commands are not preempted once they are submit-ted to the GPU. TimeGraph is independent of libraries,compilers, and runtime engines. The principles of Time-Graph are therefore applicable for different GPU archi-tectures (e.g., NVIDIA Fermi/Tesla and ATI Stream)and programming frameworks (e.g., OpenGL, OpenCL,CUDA, and HMPP). Currently, TimeGraph is designedand implemented for Nouveau [7] available in the Gal-lium3D [15] OpenGL software stack, which is alsoplanned to support OpenCL. Moreover, TimeGraph hasbeen ported to the PSCNV open-source driver [22] pack-aged in the PathScale ENZO suite [21], which supportsCUDA and HMPP. This paper is, however, focused onOpenGL workloads, given the currently-available set ofopen-source solutions: Nouveau and Gallium3D.

Driver Model: TimeGraph is part of the device driver,which is an interface for user-space programs to submitGPU commands to the GPU. We assume that the devicedriver is designed based on theDirect Rendering Infras-tructure(DRI) [14] model that is adopted in most UNIX-like operating systems, as part of the X Window System.Under the DRI model, user-space programs are allowedto access the GPU directly to render frames without us-ing windowing protocols, while they still use the win-dowing server to blit the rendered frames to the screen.GPGPU frameworks require no such windowing proce-dures, and hence their model is more simplified.

In order to submit GPU commands to the GPU, user-space programs must be allocated GPUchannels, whichconceptually represent separate address spaces on theGPU. For instance, the NVIDIA Fermi and Tesla archi-tectures support128 channels. Our GPU command sub-mission model for each channel is shown in Figure 2.

GET

PUT

size offset

64 bits

24 bits 40 bits

offset

size

User Push Buffer

Kernel Push Buffer

header

data

header

header

. . .

header

data

data

dataIndirect Buffer

command group

command group

command group

command group

command group

. . .

. . .

ring buffer

GPU commanddispatch unit

read buffer

Figure 2: GPU command submission model.

Each channel uses two types of kernel-space buffers:User Push BufferandKernel Push Buffer. The User PushBuffer is mapped on to the address space of the corre-sponding task, where GPU commands are pushed fromthe user space. GPU commands are usuallygroupedasnon-preemptiveregions to match user-space atomic-ity assumptions. The Kernel Push Buffer, meanwhile, isused for kernel primitives, such as host-device synchro-nization, GPU initialization, and GPU mode setting.

While user-space programs push GPU commands intothe User Push Buffer, they also writepackets, each ofwhich is a (sizeand address) tuple to locate a certainGPU command group, into a specific ring buffer partof the Kernel Push Buffer, calledIndirect Buffer. Thedriver configures the command dispatch unit on the GPUto read the buffer for command submission. This ringbuffer is controlled byGET andPUT pointers. The point-ers start from the same place. Every time packets arewritten to the buffer, the driver moves thePUT pointerto the tail of the packets, and sends a signal to the GPUcommand dispatch unit to download the GPU commandgroups located by the packets between theGET andPUTpointers. TheGET pointer is then automatically updatedto the same place as thePUT pointer. Once these GPUcommand groups are submitted to the GPU, the driverdoes not manage them any longer, and continues to sub-mit the next set of GPU command groups, if any. Thus,this Indirect Buffer plays a role of a command queue.

Each GPU command group may include multipleGPU commands. Each GPU command is composed ofthe header and data. The header containsmethodsandthe data size, while the data contains the values beingpassed to the methods. Methods represent GPU instruc-tions, some of which are shared between compute andgraphics, and others are specific for each. We assumethat the device driver does not preempt on-the-fly GPUcommand groups, once they are offloaded on to the GPU.GPU command execution is out-of-order within the same

Ring Buffer

Wait Queue

GPU

GPU Command Scheduler

GPU Reserve Manager

GPU Command Profiler

PushBuf Interface

IRQ Handler

interrupt

notify

queuing

isolation management

execution cost prediction

TimeGraph

Device Driver

GPU command group

X Server

User-Space Driver

OpenGL

Library & Runtime

OpenCL

CUDA HMPP

X Client GPU Application

2D Driver

Figure 3: TimeGraph system architecture.

GPU channel. The GPU channels are switched automat-ically by the GPU engines.

Our driver model described above is based onDirectRendering Manager(DRM) [6], and especially target theNVIDIA Fermi and Tesla architectures, but can also beused for other architectures with minor modification.

3 TimeGraph System Architecture

The architecture of TimeGraph and its interaction withthe rest of the software stack is illustrated in Figure 3.No modification is required for user-space programs, andGPU command groups can be generated through exist-ing software frameworks. However, TimeGraph needs tocommunicate with a specific interface, called PushBuf,in the device driver space. The PushBuf interface enablesthe user space to submit GPU command groups stored inthe User Push Buffer. TimeGraph uses this PushBuf in-terface to queue GPU command groups. It also uses theIRQ handler prepared for GPU-to-CPU interrupts to dis-patch the next available GPU command groups.

TimeGraph is composed ofGPU command scheduler,GPU reserve manager, andGPU command profiler. TheGPU command scheduler queues and dispatches GPUcommand groups based on task priorities. It also coordi-nates with the GPU reserve manager to account and en-force GPU execution times of tasks. The GPU commandprofiler supports prediction of GPU command executioncosts to avoid overruns out of reservation. There are twoscheduling policies supported to address the trade-off be-tween response times and throughput:

• Predictable-Response-Time (PRT): This policyminimizes priority inversion on the GPU to providepredictable response times based on priorities.

• High-Throughput (HT): This policy increases totalthroughput, allowing additional priority inversion.

Get Buffer Object

Lock Mutex

Activate Buffer

Reservation Requested?

Submit Command Group

Start Accounting

Setup Interrupt

Deactivate Buffer

Unlock Mutex

Sleep<Scheduling Policy>

Run or Wait?

<Reservation Policy>

Run or Wait?

Stop Accounting

Enter PushBuf Interface

Leave PushBuf Interface

Sleep

Run

No

Wait

Deactivate Buffer

Unlock Mutex

Yes

Wait

Run

GPU Idle?

Wake Up Next Task

Leave IRQ Handler

Enter IRQ Handler

Yes

No

Figure 4: Diagram of the PushBuf interface and the IRQhandler with the TimeGraph scheme.

It also supports two GPU reservation policies that ad-dress the trade-off between isolation and throughput:

• Posterior Enforcement (PE): This policy enforcesGPU resource usage after GPU command groupsare completed without sacrificing throughput.

• Apriori Enforcement (AE): This policy enforcesGPU resource usage before GPU command groupsare submitted using prediction of GPU executioncosts at the expense of additional overhead.

In order to unify multiple tasks into a single reserve, theTimeGraph reservation mechanism provides theSharedreservation mode. Particularly, TimeGraph creates a spe-cial Shared reserve instance with thePE policy whenloaded, calledBackground, which serves all GPU-accelerated tasks that do not belong to any specific re-serves. The detailed design and implementation for GPUscheduling and GPU reservation will be described inSection 4 and Section 5 respectively.

Figure 4 shows a high-level diagram of the PushBufinterface and the IRQ handler, where modifications in-troduced by TimeGraph are highlighted by bold frames.This diagram is based on the Nouveau implementation,but most GPU drivers should have similar control flows.

The PushBuf interface first acquires the buffer objectassociated with the incoming GPU command group. Itthen applies the scheduling policy to determine whetherthis GPU command group can execute on the GPU. If it

should not be dispatched immediately, the correspondingtask goes to sleep. Else, the User Push Buffer object isactivated for command submission with the mutex lockto ensure the GPU command group to be located in theplace accessible from the GPU, though the necessity ofthis procedure depends on driver implementation. Time-Graph next checks if GPU reservation is requested forthis task. If so, it applies the reservation policy to ver-ify the GPU resource usage of this task. If it overruns,TimeGraph winds up buffer activation, and suspends thistask until its resource budget becomes available. Thistask will be rescheduled later when it is waken up, sincesome higher-priority tasks may arrive by then. Finally, ifthe GPU command group is qualified by the schedulingand reservation policies, it is submitted to the GPU. Asthe reservation policies need to track GPU resource us-age, TimeGraph starts accounting for the GPU executiontime of this task. It then configures the GPU commandgroup to generate an interrupt to the CPU upon comple-tion so that TimeGraph can dispatch the next GPU com-mand group. After deactivating the buffer and unlockingthe mutex, the PushBuf interface returns.

The IRQ handler receives an interrupt notifying thecompletion of the current GPU command group, whereTimeGraph stops accounting for the GPU executiontime, and wakes up the next task to execute on the GPUbased on the scheduling policy, if the GPU is idle.

Specification: System designers may use aspecifica-tion primitive to activate the TimeGraph functionality,which is inspired by the Redline system [31]. For eachapplication, system designers can specify the schedulingparameters as:<name:sched:resv:prio:C:T>,where name is the application name,sched is itsscheduling policy,resv is its reservation policy,priois its priority, and a set ofC andT represents that theapplication task is allowed to execute on the GPU forCmicroseconds everyTmicroseconds. The specification isa text file (/etc/timegraph.spec), and TimeGraphreads it every time a new GPU channel is allocated to atask. If there is a matching entry based on the applica-tion name associated with the task, the specification isapplied to the task. Otherwise, the task is assigned thelowest GPU priority and theBackground reserve.

Priority Assignment: While system designers mayassign static GPU priorities in their specification, Time-Graph also supports automatic GPU priority assignment(AGPA), which is enabled by using a wild-card “*” entryin theprio field. TimeGraph provides a user-space dae-mon executing periodically to identify the task with theforeground window through a window programming in-terface, such as the_NET_ACTIVE_WINDOW and the_NET_WM_PID properties in the X Window System.TimeGraph receives the foreground task information viaa system call, and assigns the highest priority to this

task among those running under the AGPA mechanism.These tasks execute at thedefault static GPU prioritylevel. Hence, different tasks can be prioritized over themby assigning higher static GPU priorities. AGPA is, how-ever, not available if the above window programming in-terface is not supported. TimeGraph instead providesanother user-space tool for system designers to assignpriorities. For instance, designers can provide an opti-mal priority assignment based on reserve periods [13],as widely adopted in real-time systems.

Admission Control: In order to achieve predictableservices in overloaded situations, TimeGraph providesan admission control scheme that forces the new re-serve to be a background reserve so that currently ac-tive reserves continue to execute in a predictable manner.TimeGraph provides a simple interface where designersspecify the limit of total GPU resource usage by 0-100%in a text file (/etc/timegraph.ac). The amount of limitis computed by a traditional resource-reservation modelbased onC andT of each reserve [26].

4 TimeGraph GPU Scheduling

The goal of the GPU command scheduler is toqueueanddispatchnon-preemptive GPU command groups in ac-cordance with task priorities. To this end, TimeGraphcontains await queueto stall tasks. It also manages aGPU-online list, a list of pointers to the GPU commandgroups currently executing on the GPU.

The GPU-online list is used to check if there arecurrently-executing GPU command groups, when a GPUcommand group enters into the PushBuf interface. If thelist is empty, the corresponding task is inserted into it,and the GPU command group is submitted to the GPU.Else, the task is inserted into the wait queue to be sched-uled. The scheduling policies supported by TimeGraphwill be presented in Section 4.1.

Management of the GPU-online list requires the in-formation about when GPU command groups complete.TimeGraph adopts an event-driven model that uses GPU-to-CPU interrupts to notify the completion of each GPUcommand group, rather than a tick-driven model adoptedin the previous work [1, 5]. Upon every interrupt, the cor-responding GPU command group is removed from theGPU-online list. Our GPU-to-CPU interrupt setting andhandling mechanisms will be described in Section 4.2.

4.1 Scheduling Policies

TimeGraph supports two GPU scheduling policies. ThePredictable-Response-Time (PRT) policy encouragessuch tasks that should behave on a timely basis withoutaffecting important tasks. This policy is predictable ina sense that GPU command groups are scheduled based

CPU

GPU

HP task MP task LP taskTimeGraph

commandsubmission

interrupt

time

time

wait queue task arrival

(a)PRT scheduling.

CPU

GPU

HP task MP task LP taskTimeGraph

commandsubmission

interrupt

time

time

wait queue task arrival

(b) HT scheduling.

Figure 5: Example of GPU scheduling in TimeGraph.

on task priorities to make high-priority tasks responsiveon the GPU. TheHigh-Throughput (HT) policy, on theother hand, is suitable for such tasks that should executeas fast as possible. There is a trade-off that thePRTpolicy prevents tasks from interference at the expense ofthroughput, while theHT policy achieves high through-put for one task but may block others. For instance,desktop-widget, browser-plugin, and video-player tasksare desired to use thePRT policy, while 3-D game andinteractive 3-D interfacing tasks can use theHT policy.

PRT Scheduling: The PRT policy forces any GPUcommand groups to wait for the completion of the pre-ceding GPU command group, if any. Specifically, a newGPU command group arriving at the device driver can besubmitted to the GPU immediately, if the GPU-online listis empty. Else, the corresponding task must sleep in thewait queue. The highest-priority task in the wait queue,if any, is waken up upon every interrupt from the GPU.

Figure 5 (a) indicates how three tasks with differentpriorities, high-priority, medium-priority (MP), and low-priority (LP), are scheduled on the GPU under thePRTpolicy. When the MP task arrives, its GPU commandgroup can execute on the GPU, since no GPU commandgroups are executing. Given that the GPU and CPU op-erate asynchronously, the MP task can arrive again whileits previous GPU command group is executing. How-ever, the MP task is queued this time, because the GPUis not idle, according to thePRT policy. Even the nextHP task is also queued due to the same reason, since fur-ther higher-priority tasks may arrive soon. The specificset of GPU commands appended at the end of every GPUcommand group by TimeGraph generates an interrupt tothe CPU, and the TimeGraph scheduler is invoked ac-

cordingly to wake up the highest-priority task in the waitqueue. Hence, the HP task is next chosen to execute onthe GPU rather than the MP task. In this manner, the nextinstance of the LP task and the second instance of the HPtask are scheduled in accordance with their priorities.

Given that the arrival times of GPU command groupsare not known a priori, and each GPU command groupis non-preemptive, we believe that thePRT policy is thebest possible approach to provide predictable responsetimes. However, it inevitably incurs overhead to makea scheduling decision at every GPU command groupboundary, as shown in Figure 5 (a).

HT Scheduling: TheHT policy reduces this schedul-ing overhead, compromising predictable response timesa bit. It allows GPU command groups to be submitted tothe GPU immediately, if (i) the currently-executing GPUcommand group was submitted by the same task, and(ii) no higher-priority tasks are ready in the wait queue.Otherwise, they must suspend in the same manner as thePRT policy. Upon an interrupt, the highest-priority taskin the wait queue is waken up,only whenthe GPU-onlinelist is empty (the GPU is idle).

Figure 5 (b) depicts how the same set of GPU com-mand groups used in Figure 5 (a) is scheduled under theHT policy. Unlike thePRT policy, the second instance ofthe MP task can submit its GPU command group imme-diately, because the currently-executing GPU commandgroup was issued by itself. These two GPU commandgroups of the MP task can execute successively withoutproducing the idle time. The same is true for the twoGPU command groups of the HP task. Thus, theHT pol-icy is more for throughput-oriented tasks, but the HP taskis blocked by the MP task for a longer internal. This isa trade-off, and if priority inversion is critical, the PRTpolicy is more appropriate.

4.2 Interrupt Setting and Handling

In order to provide an event-driven model, TimeGraphconfigures the GPU to generate an interrupt to the CPUupon the completion of each GPU command group. Thescheduling point is thus made at every GPU commandgroup boundary. We now describe how the interrupt isgenerated. For simplicity of description, we here focuson the NVIDIA GPU architecture.

Completion Notifier: The NVIDIA GPU providestheNOTIFY command to generate an interrupt from theGPU to the CPU. TimeGraph puts this command at theend of each GPU command group. However, the in-terrupt is not launched immediately when theNOTIFYcommand is operated but when the next command is dis-patched. TimeGraph therefore adds theNOP commandafter theNOTIFY command, as a dummy command. Wealso need to consider that GPU commands execute out

of order on the GPU. If theNOTIFY command is oper-ated before all commands in the original GPU commandgroup are operated, the generated interrupt is not timelyat all. TimeGraph hence adds theSERIALIZE com-mand right before theNOTIFY command, which forcesthe GPU to stall until all on-the-fly commands complete.There is no need to add another piece of theSERIALIZEcommand after theNOTIFY command, since we knowthat no tasks other than the current task can use the GPUuntil TimeGraph is invoked upon the interrupt.

Interrupt Association: All interrupts from the GPUcaught in the IRQ handler are relayed to TimeGraph.When TimeGraph receives an interrupt, it first referencesthe head of the GPU-online list to obtain the task in-formation associated with the corresponding GPU com-mand group. TimeGraph next needs to verify whetherthis interrupt is truly generated by the commands thatTimeGraph inserted into at the end of the GPU commandgroup, given that user-space programs may also use theNOTIFY command. In order to recognize the right inter-rupt, TimeGraph further adds theSET_REF commandbefore theSERIALIZE command, which instructs theGPU to write a specified sequence number to a particularGPU register. This number is identical for each task, andis simply incremented by TimeGraph. TimeGraph readsthis GPU register when an interrupt is received. If theregister value is less than the expected sequence num-ber associated with the corresponding GPU commandgroup, this interrupt should be ignored, since it must havebeen caused by someone else before theSET_REF com-mand. Another piece of theSERIALIZE command alsoneeds to be added before theSET_REF command to en-sure in-order command execution. As a consequence,TimeGraph inserts the following commands at the endof each GPU command group:SERIALIZE,SET_REF,SERIALIZE, NOTIFY, NOP.

Task Wake-Up: Once the interrupt is verified, Time-Graph removes the GPU command group at the head ofthe GPU-online list. If the corresponding task is sched-uled under thePRT policy, TimeGraph wakes up thehighest-priority task in the wait queue, and inserts itsGPU command group into the GPU-online list. If thetask is assigned theHT policy, meanwhile, TimeGraphwakes up the highest-priority task in the same manner asthePRT policy,only whenthe GPU-online list is empty.

5 TimeGraph GPU Reservation

TimeGraph provides GPU reservation mechanisms toregulate GPU resource usage for tasks scheduled underthe PRT policy. Each task is assigned areservethat isrepresented by capacityC and periodT . Budgete is theamount of time that a task is entitled for execution. Time-Graph uses a popular rule for budget consumption and

GPU time

budget C

0

arrival arrival arrival arrival

other commands

T T T

(a)PE reservation.

GPU time

budget C

0

arrival arrival arrival arrival

other commands

T T T

predicted cost

(b) AE reservation.

Figure 6: Example of GPU reservation in TimeGraph.

replenishment used in real-time systems [20, 26]. Specif-ically, the budget is decreased by the amount of time con-sumed on the GPU, and is replenished by at most capac-ity C once every periodT . However, we need differentreservation policies than previous work due to the asyn-chronous and non-preemptive nature of GPU processing,as we will describe in Section 5.1. Our GPU resource ac-counting and enforcement mechanisms will be describedin Section 5.2. TimeGraph further supports prediction ofGPU execution costs for strict isolation. Section 5.3 willdescribe our approach to GPU execution cost prediction.

5.1 Reservation Policies

TimeGraph supports two GPU reservation policies. ThePosterior Enforcement (PE) policy is aimed for light-weight reservation, allowing tasks to overrun out of theirreserves to an extent. TheApriori Enforcement (AE)policy reduces reserve overruns by predicting GPU ex-ecution costs a priori at the expense of additional over-head. We recommend that thePE policy be primarilyused when isolation is required, and theAE policy beused only if extremely time-critical applications are con-currently executed on the GPU.

PE Reservation: The PE policy permits GPU com-mand groups to be submitted to the GPU, if their budgetis greater than zero. Else, the task goes to sleep untilthe budget is replenished. The budget can be negative,when the task overruns out of reservation. The over-run penalty is, however, imposed on the next budget re-plenishment. The budget for the next period is thereforegiven bye = min(C, e + C).

Figure 6 (a) shows how four GPU command groupsof the same task are enforced under thePE policy. Thebudget is initialized toC. When the second GPU com-mand group completes, the budget is negative. Hence,

the third GPU command group must wait for the bud-get to be replenished, even though the GPU remains idle.Since GPU reservation is available under thePRT policy,the fourth GPU command group is blocked even thoughthe budget is greater than zero, since another GPU com-mand group is currently executing.

AE Reservation: For each GPU command group sub-mission, theAE policy first predicts a GPU executioncostx. The GPU command group can be submitted to theGPU, only if the predicted cost is no greater than the bud-get. Else, the task goes to sleep until the budget is replen-ished. The next replenishment amount depends on thepredicted costx and the currently-remaining budgete. Ifthe predicted costx is no greater than the capacityC, thebudget for the next period is bounded bye = C to avoidtransient overload. Else, it is set toe = min{x, e + C}.The task can be waken up only whene ≥ x.

Figure 6 (b) depicts how the same set of four GPUcommand groups used in Figure 6 (a) is controlled un-der theAE policy. For simplicity of description, we as-sume for now that prediction of GPU execution costs isperfectly accurate, and Section 5.3 will describe how topractically predict GPU execution costs. Unlike thePEpolicy, the second GPU command group is not submittedto the GPU, as its budget is less than the predicted cost,but is submitted later when the budget is replenished tobee = min{x, e + C} > x. The fourth GPU commandgroup also needs to wait until the budget is sufficientlyreplenished. However, unlike the second GPU commandgroup, the replenished budget is bounded byC, sincex < C. This avoids transient overload.

Shared Reservation: TimeGraph allows multipletasks to share a single reserve under theShared mode.When some task creates aShared reserve, other taskscan join it. TheShared mode can be used together withboth thePE andAE policies. TheShared mode is use-ful when users want to cap the GPU resource usage ofmultiple tasks to a certain range. There is no need toadjust the capacity and period for each task. It can alsoreduce the overhead of reservation, since it only needs tomanage one reserve for multiple tasks.

5.2 Accounting and Enforcement

GPU execution times are accounted in the PushBuf in-terface and the IRQ handler as illustrated in Figure 4.TimeGraph saves CPU timestamps when GPU commandgroups start and complete. Specifically, when each GPUcommand group is qualified to be submitted to the GPU,TimeGraph records the current CPU time as itsstart timein the PushBuf interface, and at some later point of timewhen TimeGraph is notified of the completion of thisGPU command group, the current CPU time is recordedas itsfinish timein the IRQ handler. The difference be-

tween the start time and the finish time is accounted foras the execution time of this GPU command group, andis subtracted from the budget.

Enforcement works differently for thePE and theAEpolicies. In the PushBuf interface, theAE policy predictsthe execution costx of each GPU command group basedon the idea presented in Section 5.3, while thePE pol-icy always assumesx = 0. Then, both policies comparethe budgete and the costx. Only if e > x is satisfied,the GPU command group can be submitted to the GPU.Otherwise, the corresponding task is suspended until thebudget is replenished. It should be noted that this en-forcement mechanism is very different from traditionalCPU reservation mechanisms [20, 26] that use timers orticks to suspend tasks, since GPU command groups arenon-preemptive, and hence we need to perform enforce-ment at GPU command group boundary. TimeGraphhowever still uses timers to replenish the budget period-ically. Every time the budget is replenished, it comparese andx again. Ife > x is satisfied, the task is waken up,but it needs to be rescheduled, as illustrated in Figure 4.

5.3 Command Profiling

TimeGraph contains the GPU command profiler to pre-dict GPU execution costs forAE reservation. Each GPUcommand is composed of the header and data, as shownin Figure 2. We hence parse the methods and the datasizes from the headers.

We now explain how to predict GPU execution costsfrom these pieces of information. GPU applications tendto repeatedly create GPU command groups with the samemethods and data sizes, since they use the same set ofAPI functions, e.g., OpenGL, and each function likelygenerates the same sequence of GPU commands in termsof methods and data sizes, while data values are quitevariant. Given that GPU execution costs depend highlyon methods and data sizes, but not on data values, wepropose a history-based prediction approach.

TimeGraph manages a history table to record the GPUcommand group information. Each record consists of aGPU command group matrixand the average GPU ex-ecution cost associated to this matrix. The row and thecolumn of the matrix contain the methods and their datasizes respectively. TimeGraph also attaches a flag to eachGPU command group, indicating if it hits some record.When the methods and the data sizes of the GPU com-mand group are obtained from the remapped User PushBuffer, TimeGraph looks at the history table. If thereexists a record that contains exactly the same GPU com-mand group matrix, i.e., the same set of methods anddata sizes, it uses the average GPU execution cost storedin this record, and the flag is set. Otherwise, the flagis cleared, and TimeGraph uses the worst-case GPU ex-

ecution cost among all the records. Upon the comple-tion of the GPU command group, TimeGraph referencesthe flag attached to the corresponding task. If the flagis set, it updates the average GPU execution cost of therecord with the actual execution time of this GPU com-mand group. Otherwise, it inserts a new record wherethe matrix has the methods and the data size of this GPUcommand group, and the average GPU execution timeis initialized with its actual execution time. The size ofthe history table is configurable by designers. If the totalnumber of the records exceeds the table size, the least-recently-used (LRU) record is removed.

Preemption Impact: Even the same GPU commandgroup may consume very different GPU execution times.For example, if reusable texture data is cached, graph-ics operation is much faster. We realize that when theGPU contexts (channels) are switched, GPU executiontimes can vary. Hence, TimeGraph verifies GPU con-text switches at every scheduling point. If the context isswitched, TimeGraph will not update the average GPUexecution cost, since the context switch may have af-fected the actual GPU execution time. Instead, it savesthe difference between the actual GPU execution timeand the average GPU execution cost as thepreemptionimpact. TimeGraph keeps updating the average preemp-tion impact. A single preemption cost is measured be-forehand when TimeGraph is loaded. The preemptionimpact is then added to the predicted cost.

6 Evaluation

We now provide a detailed quantitative evaluation ofTimeGraph on the NVIDIA GeForce 9800 GT graphicscard with the default frequency and 1 GB of video mem-ory. Our underlying platform is the Linux 2.6.35 kernelrunning on the Intel Xeon E5504 CPU and 4 GB of mainmemory. While our evaluation and discussion are fo-cused on this graphics card, Similar performance benefitsfrom TimeGraph have also been observed with differentgraphics cards viz, GeForce GTX 285 and GTX 480.

As primary 3-D graphics benchmarks, we use thePhoronix Test Suite [24] that executes the OpenGL 3-D games,OpenArena, World of Padman, Urban Terror,and Unreal Tournament 2004 (UT2004), in the demomode based on the test profile, producing various GPU-intensive workloads. We also useMPlayer as a pe-riodic workload. In addition, the Gallium3DEnginedemo program is used as a regularly-behaved workload,and the Gallium3DClearspd demo program that ex-ploits a GPU commandbombis used as a misbehavingworkload. Furthermore, we useSPECviewperf 11[28]to evaluate the throughput of different GPU schedulingmodels. The screen resolution is set to1280 × 1024.The scheduling parameters are loaded from the pre-

configured TimeGraph specification file. The maximumnumber of records in the history table for GPU executioncost prediction is set to100.

6.1 Prioritization and Isolation

We first evaluate the prioritization and isolation proper-ties achieved by TimeGraph. As described in Section 3,TimeGraph automatically assigns priorities. CPUnicepriorities are always effective, while GPU priorities areeffective only when TimeGraph is activated. The prioritylevel is aligned between the GPU and CPU. We use thePRT policy for the X server to prevent it from affectingprimary applications, but it is scheduled by the highestGPU/CPU priority, since it should still be responsive toblit the rendered frames to the screen.

Coarse-grained Performance: Figure 7 shows theperformance of the 3-D games, while the Engine widgetis concurrently sharing the GPU. We use theHT policyfor the 3-D games, while the Engine widget is assignedthePRT policy under TimeGraph. As shown in Figure 7,TimeGraph improves the performance of the 3-D gamesby about11% for OpenArena,27% for World of Pad-man,22% for Urban Terror, and2% for UT2004, withGPU priority support. Further performance isolation isobtained by GPU reservation, capping the GPU resourceusage of the Engine widget. Our experiment assigns theEngine widget a reserve of2.5ms every25ms to retainGPU resource usage at10%. As compared to the casewithout GPU reservation support, the performance of the3-D games is improved by2 ∼ 21% underPE reser-vation, and by4 ∼ 36% underAE reservation. Thus,the AE policy provides better performance for the 3-Dgames at the expense of more conservative scheduling ofthe Engine widget with prediction.

Figure 8 presents the results from a setup similar to theabove experiments, where the Clearspd bomb generatesheavily-competing workload instead of the Engine wid-get. The performance benefit resulting from assigninghigher GPU priorities to the games under theHT pol-icy is clearer in this setup. Even without GPU reser-vation support, TimeGraph enables the 3-D games torun about3 ∼ 6 times faster than the vanilla Nouveaudriver, though they still face a performance loss of about24 ∼ 52% as compared to the previous setup where theEngine widget contends with the 3-D games. Regulatingthe GPU resource usage of the Clearspd bomb throughGPU reservation limits this performance loss to be within3%. Particularly, theAE policy yields improvements ofup to5% over thePE policy.

Extreme Workloads: In order to evaluate the capabil-ities of TimeGraph in the face of extreme workloads, weexecute the 3-D games with five instances of the Clear-spd bomb. In this case, the cap of each individual re-

0

10

20

30

40

50

60

OpenArena World of Padman Urban Terror UT2004

Ave

rage

fram

e ra

te (

fps)

3D game applications

No TimeGraph supportHT & PRT scheduling

HT & PRT scheduling w/ PE reservationHT & PRT scheduling w/ AE reservation

Figure 7: Performance of the 3-D games competing witha single instance of the Engine widget.

0

10

20

30

40

50

60


Ave

rage

fram

e ra

te (

fps)




Figure 8: Performance of the 3-D games competing witha single instance of the Clearspd bomb.

0

10

20

30

40

50

60


Ave

rage

fram

e ra

te (

fps)




HT & PRT scheduling w/ Shared PE reservation

Figure 9: Performance of the 3-D games competing withfive instances of the Clearspd bomb.

serve is correspondingly decreased to0.5ms every25msso that the total cap of the five Clearspd-bomb tasks isaligned with2.5ms every25ms. As here are multipleClearspd-bomb tasks, we evaluate an additional setupwhere a singlePE reserve of2.5ms every25ms runswith the Shared reservation mode. As shown in Fig-ure 9, the 3-D games are nearly unresponsive withoutTimeGraph support due to the scaled-up GPU workload,whereas TimeGraph can isolate the performance of the3-D games even under such an extreme circumstance. Infact, the performance impact is reduced to7 ∼ 20% byusing GPU priorities, and leveraging GPU reservation re-

0

50

100

150

200

250

10 20 30 40 50 60 70

Fra

me

rate

(fp

s)

Reserve size of the Engine widget (%)

Engine against OpenarenaEngine (AE reservation) against OpenarenaOpenarena against EngineOpenarena against Engine (AE reservation)

Figure 10: Performance regulation by GPU reservationfor the 3-D game and the 3-D widget.

0

40

80

120

160

200

240

280

320

0 10 20 30 40 50 60 70 80 90 100 110 120

Fra

me

rate

(fp

s)

Elapsed time (second)

No TimeGraph supportPRT schedulingPRT scheduling w/ PE reservationPRT scheduling w/ AE reservationPRT scheduling w/ Shard PE reservation

Figure 11: Performance of the Engine widget competingwith five instances of the Clearspd bomb.

sults in nearly no performance loss, similar to results inFigure 8. TheShared reservation mode also providesslightly better performance withPE reserves.

Performance Regulation: We next demonstrate theeffectiveness of TimeGraph in regulating the frame-ratefor each task by changing the size of GPU reserve. Fig-ure 10 shows the performance of the OpenArena gameand the Engine widget contending with each other. Thesolid lines indicate a setup where thePE policy is as-signed for both the applications, while the dotted linesindicate a setup where theAE policy is assigned for theEngine widget instead. GPU reservation is configuredso that the total GPU resource usage of the two applica-tions is capped at90%, and the remaining10% is avail-able for the X server. Assigning theAE policy for theEngine widget slightly improves the performance of theOpenArena game, while it brings a performance penaltyfor the Engine widget itself due to the overhead for pre-diction of GPU execution costs. In either case, however,TimeGraph successfully regulates the frame-rate in ac-cordance with the size of GPU reserve. In this exper-iment, we conclude that it is desirable to assign a GPUreserve for the OpenArena game withC/T = 60 ∼ 80%and that for the Engine widget withC/T = 10 ∼ 30%,given that this configuration provides both the applica-tions with an acceptable frame-rate over25 fps.

0

40

80

120

160

200

0 10 20 30 40 50 60 70 80 90 100 110 120

Fra

me

rate

(fp

s)


Engine #1Engine #2Engine #3

(a) No TimeGraph support.

0

40

80

120

160

200

0 10 20 30 40 50 60 70 80 90 100 110 120

Fra

me

rate

(fp

s)



(b) PRT scheduling.

0

40

80

120

160

200

0 10 20 30 40 50 60 70 80 90 100 110 120

Fra

me

rate

(fp

s)



(c) PRT scheduling andPE reservation.

Figure 12: Interference among three widget instances.

Fine-grained Performance:The 3-D games demon-strate highly variable frame-rate workloads, while 3-Dwidgets often exhibit nearly constant frame-rates. In or-der to study the behavior of TimeGraph on both thesetwo categories of applications, we look at the variabil-ity of frame-rate with time for the Engine widget con-tending with five instances of the Clearspd bomb, asshown in Figure 11. The total GPU resource usage ofthe Clearspd-bomb tasks is capped at2.5ms every25msthrough GPU reservation, and a higher priority is given tothe Engine widget. These results show that GPU reser-vation can provide stable frame-rates on a time for theEngine widget. Since the Engine widget is not as GPU-intensive as the 3-D games, it is affected more by theClearspd bomb making the GPU overloaded, when GPUreservation is not applied. The benefits of GPU reserva-tion are therefore more clearly observed.

Interference Issues: We now evaluate the interfer-ence among regularly-behaved concurrent 3-D widgets.Figure 12 (a) shows a chaotic behavior arising from exe-cuting three instances of the Engine widget concurrently,with different CPU priorities but without TimeGraphsupport. Although the Engine widget by itself is a veryregular workload, when competing with more instancesof itself, the GPU resource usage exhibits high variabil-

0

5

10

15

20

25

0 20 40 60 80 100 120 140 160 180 200

Fra

me

rate

(fp

s)


No TimeGraph supportPRT scheduling w/ PE reservation

Figure 13: Performance of MPlayer competing with fiveinstances of the Clearspd bomb.

ity and unpredictability. Figure 12 (b) illustrates the im-proved behavior under TimeGraph using thePRT pol-icy, where we assign the high, the medium, and the lowGPU priorities forEngine #1, Engine #2, andEngine #3respectively, using the user-space tool presented in Sec-tion 3. TimeGraph successfully provides predictable re-sponse times for the three tasks according based on theirpriorities. Further performance isolation can be achievedby GPU reservation, exploiting different sizes of GPUreserves: (i)15ms every25ms to Engine #1, (ii) 5msevery50ms to Engine #2, and (iii) 5ms every100ms toEngine #3, as shown in Figure 12 (c). ThePE policy isused here. Since the Engine widget has a non-trivial de-pendence on the CPU, the absolute performance is lowerthan expected for smaller reserves.

Periodic Behavior: For evaluating the impact on ap-plications with periodic activity, we execute MPlayerin the foreground when five instances of the Clearspdbomb contend for the GPU. We use an H264-compressedvideo, with a frame size of 1920×800 and a frame rateof 24 fps, which uses x-video acceleration on the GPU.As shown in Figure 13, the video playback experienceis significantly disturbed without TimeGraph support.When TimeGraph assigns aPE reserve of10ms every40ms for MPlayer, and aPE reserve of5ms every40msfor the Clearspd bomb tasks in theShared reservationmode, the playback experience is significantly improved.It closely follows the ideal frame-rate of24 fps for videoplayback. This illustrates the benefits of TimeGraph forinteractivity, where performance isolation plays a vitalrole in determining user experience.

6.2 GPU Execution Cost Prediction

We now evaluate the history-based prediction of GPU ex-ecution costs for realizing GPU reservation with theAEpolicy. The effectiveness ofAE reservation relies highlyon GPU execution cost prediction. Hence, it is impor-tant to identify the types of applications for which we

0

200

400

600

800

Act

ual G

PU

cos

t (us

)

-400

-200

0

200

400

5000 6000 7000 8000 9000 10000

Pre

dict

ion

erro

r (u

s)

GPU command group counts

(a) Engine widget.

0

500

1000

1500

2000

Act

ual G

PU

cos

t (us

)-1000

-500

0

500

1000

5000 6000 7000 8000 9000 10000

Pre

dict

ion

erro

r (u

s)


(b) Clearspd bomb.

0

1000

2000

3000

4000

5000

Act

ual G

PU

cos

t (us

)

-3000

-2000

-1000

0

1000

2000

5000 6000 7000 8000 9000 10000

Pre

dict

ion

erro

r (u

s)


(c) OpenArena game.

Figure 14: Errors for GPU execution cost prediction.

can predict GPU execution costs more precisely. Fig-ure 14 shows both actual GPU execution costs and pre-diction errors for the 3-D graphics applications used inearlier experiments: Engine, Clearspd, and OpenArena.Since these applications issue a very large number ofGPU command groups, we focus on a snapshot betweenthe 5000th and the 10000th GPU command groups ofthe applications. In the following discussion, positive er-rors denote pessimistic predictions. Figure 14 (a) showsthat the history-based prediction approach employed byTimeGraph performs within a15% error margin on theEngine widget that uses a reasonable number of meth-ods. For the Clearspd bomb that issues a very limited setof methods, while producing extreme workloads, Time-

Graph can predict GPU execution costs within a7% er-ror margin as shown in Figure 14 (b). On the otherhand, the results of GPU execution cost prediction un-der OpenArena, provided in Figure 14 (c), show thatonly about65% of the observed GPU command groupshave the predicted GPU execution costs within a20%error margin. Such unpredictability arises from the in-herently dynamic nature of complex computer graphicslike abrupt scene changes. The actual penalty of mis-prediction is, however, suffered only once per reserveperiod, and is hence not expected to be significant forreserves with long periods.

In addition to the presented method, we have also ex-plored static approaches using pre-configured values forpredicted costs. Our experiments show that such staticapproaches perform worse than the presented dynamicapproach, largely due to the dynamically changing andnon-stationary nature of application workloads.

GPU execution cost prediction plays a vital role inreal-time setups, where it is unacceptable for low-prioritytasks to even cause the slightest interference to high-priority tasks. As the above experimental results showthat our prediction approach tends to fail for complexinteractive applications like OpenArena. However, weexpect the structure of real-time applications to be lessdynamic and more regular like the Engine and Clearspdtasks. GPU reservation with theAE policy for complexapplications like OpenArena would require support fromthe application program itself, since their behavior is noteasily predictable from historic execution results. Other-wise, thePE policy is desired for low overhead.

6.3 Overhead and Throughput

In order to quantify the performance overhead imposedby TimeGraph, we measure the standalone performanceof the 3-D game benchmarks. Figure 15 shows that as-signing theHT policy for both the games and the Xserver incurs about4% performance overhead for thegames. This small overhead is attributed to the fact thatTimeGraph is still invoked upon every arrival and com-pletion of GPU command group. It is interesting to seethat assigning thePRT policy for the X server increasesthe overhead for the games up to about10%, even thoughthe games use theHT policy. As the X server is used toblit the rendered frames to the screen, it can lower theframe-rate of the game, if it is blocked by the game itself.On the other hand, assigning thePRT policy for boththe X server and the game adds a non-trivial overhead ofabout17 ∼ 28% largely due to queuing and dispatchingall GPU command groups. This overhead is introducedby queuing delays and scheduling overheads, as Time-Graph needs to be invoked for submission of each GPUcommand group. We however conjecture that such over-

0

10

20

30

40

50

60

70


Ave

rage

fram

e ra

te (

fps)


No timing supportHT scheduling (both game and Xorg)

HT schedulingPRT scheduling

PRT scheduling w/ AE reservationPRT scheduling w/ AE reservation

Figure 15: Performance overheads of TimeGraph.

0

0.5

1

1.5

2

2.5

3

3.5

4

hand

Sha

ded

hand

Sha

ded

HQ

hand

wire

fram

e

squi

d sh

aded

squi

d sh

aded

HQ

toy

stor

e sh

aded

toy

stor

e w

irefr

ame

wol

f fur

sel

ect

wol

f fur

sel

ect H

Q

wol

f sha

ded

wol

f sha

ded

HQ

Fra

me

rate

(fp

s)

SPECViewperf11 Maya benchmarks

No TimeGraph supportEvent-driven modelTick-driven model

Figure 16: Throughput of event-driven and tick-drivenschedulers in TimeGraph.

head cost is inevitable for GPU scheduling at the device-driver level to shield important GPU applications fromperformance interference. If there is no TimeGraph sup-port, the performance of high-priority GPU applicationscould significantly decrease in the presence of competingGPU workloads (as shown in Figure 1), which affects thesystem performance more than the maximum schedulingoverhead of28% introduced by TimeGraph. As a con-sequence, TimeGraph is evidently beneficial in real-timemulti-tasking environments.

Finally, we compare the throughput of two conceiv-able GPU scheduling models: (i) the event-driven sched-uler adopted in TimeGraph, and (ii) the tick-drivenscheduler that was presented in the previous work, calledGERM [1, 5]. The TimeGraph event-driven scheduleris invoked only when GPU command groups arrive andcomplete, while the GERM tick-driven scheduler is in-voked at every tick, configured to be1ms (Linux jiffies).Figure 16 shows the results of SPECviewperf bench-marking with theMayaviewset created for 3-D computergraphics, where thePRT policy is used for GPU schedul-ing. Surprisingly, the TimeGraph event-driven schedulerobtains about15 ∼ 30 times better scores than the tick-driven scheduler for most test cases. According to ouranalysis, this difference arises from the fact that many

GPU command groups can arrive in a very short interval.The GERM tick-driven scheduler reads a particular GPUregister every tick to verify if the current GPU commandgroup has completed. Suppose that there are30 GPUcommand groups with a total execution time of less than1ms. The tick-driven scheduler takes at least30ms tocompete these GPU command groups because the GPUregister must be read10 times, while the event-drivenscheduler could complete them in1ms as GPU-to-CPUinterrupts are used. Hence, non-trivial overheads are im-posed on the tick-driven scheduler.

7 Related Work

GPU Scheduling:The Graphics Engine Resource Man-ager (GERM) [1, 5] aims for GPU multi-tasking supportsimilar to TimeGraph. The resource management con-cepts of TimeGraph and GERM are, however, fundamen-tally different. TimeGraph focuses on prioritization andisolation among competing GPU applications, while fair-ness is a primary concern for GERM. Since fair resourceallocation cannot shield particular important tasks frominterference in the face of extreme workloads, as reportedin [31], TimeGraph addresses this problem for GPU ap-plications through priority and reservation support. Ap-proaches to synchronize the GPU with the CPU are alsodifferent between TimeGraph and GERM. TimeGraph isbased on an event-driven model that uses GPU-to-CPUinterrupts, whereas GERM adopts a tick-driven modelthat polls a particular GPU register. As demonstratedin Section 6.3, the tick-driven model can become unre-sponsive when many GPU commands arrive in a shortinterval, which could likely happen for graphics andcompute-intensive workloads, while TimeGraph is re-sponsive even in such cases. Hence, TimeGraph is moresuitable for real-time applications. In addition, Time-Graph canpredict GPU execution costs a priori, takinginto account both methods and data sizes, while GERMestimatesthem posteriorly, using only data sizes. SinceGPU execution costs are very dependent not only on datasizes but also on methods, we claim that TimeGraphcomputes GPU execution costs more precisely. How-ever, additional computation overheads are required forprediction. TimeGraph therefore provides light-weightreservation with thePE policy without prediction to ad-dress this trade-off. Furthermore, TimeGraph falls insidethe device driver, while GERM is spread across the de-vice driver and user-space library. Hence, GERM couldrequire major modifications for different runtime frame-works, e.g., OpenGL, OpenCL, CUDA, and HMPP.

The Windows Display Driver Model (WDDM) [25] isa GPU driver architecture for the Microsoft Windows.While it is proprietary, GPU priorities seem to be sup-ported in our experience, but are not explicitly exposed to

the user space as a first-class primitive. Apparently, thereis no GPU reservation support. In fact, since NVIDIAshares more than90% of code between Linux and Win-dows [23]. Therefore, it eventually suffers from the per-formance interference as demonstrated in Figure 1.

VMGL [11] supports virtualization in the OpenGLAPIs for graphics applications running inside a Vir-tual Machine (VM). It passes graphics requests fromguest OSes to a VMM host, but GPU resource man-agement is left to the underlying device driver. TheGPU-accelerated Virtual Machine (GViM) [8] virtual-izes the GPU at the level of abstraction for GPGPU ap-plications, such as the CUDA APIs. However, since thesolution is ’above’ the device driver layer, GPU resourcemanagement is coarse-grained and functionally limited.VMware’s Virtual GPU [3] enables GPU virtualizationat the I/O level. Hence, it operates faster and its usageis not limited to GPGPU applications. However, multi-tasking support with prioritization, isolation, or fairnessis not supported. TimeGraph could coordinate with theseGPU virtualization systems to provide predictable re-sponse times and isolation.

CPU Scheduling: TimeGraph shares the concept ofpriority and reservation, which has been well-studied bythe real-time systems community [13, 26], but there isa fundamental difference from these traditional studiesin that TimeGraph is designed to address an arbitrarily-arriving non-preemptive GPU execution model, whereasthe real-time systems community has often considereda periodic preemptive CPU execution model. Severalbandwidth-preserving approaches [12, 29, 30] for anarbitrarily-arriving model exist, but a non-preemptivemodel has not been much studied yet. The conceptof priority and reservation has also been considered inthe operating systems literature [4, 9, 10, 17, 20, 31].Specifically, batch scheduling has a similar constraint toGPU scheduling in that non-preemptive regions disturbpredictable responsiveness [27]. These previous workare, however, mainly focused on synchronous on-chipCPU architectures, whereas TimeGraph addresses thosescheduling problems for asynchronous on-board GPU ar-chitectures where explicit synchronization between theGPU and CPU is required.

Disk Scheduling: Disk devices have similarity toGPUs in that they operate with non-preemptive regionsoff the chip. Disk scheduling for real-time and interactivesystems [2, 16] therefore considered priority and reserva-tion support for non-preemptive operation. However, theGPU is typically a coprocessor independent of the CPU,which has its own set of execution contexts, registers,and memory devices, while the disk is more dedicated toI/O. Hence, TimeGraph uses completely different mech-anisms to realize prioritization and isolation than theseprevious work. In addition, TimeGraph needs to ad-

dress the trade-off between predictable response timesand throughput since synchronizing the GPU and CPUincurs overhead, while disk I/O is originally synchronouswith read and write operation.

8 Conclusions

This paper has presented TimeGraph, a GPU schedulerto support real-time multi-tasking environments. We de-veloped the event-driven model to schedule GPU com-mands in a responsive manner. This model allowed usto propose two GPU scheduling policies,PredictableResponse Time (PRT) andHigh Throughput (HT),which address the trade-off between response times andthroughput. We also proposed two GPU reservationpolicies,Posterior Enforcement (PE) and theAprioriEnforcement (AE), which present an essential designknob for choosing the level of isolation and through-put. Our detailed evaluation demonstrated that Time-Graph can protect important GPU applications evenin the face of extreme GPU workloads, while provid-ing high-throughput, in real-time multi-tasking environ-ments. TimeGraph is open-source software, and may bedownloaded from our website athttp://rtml.ece.cmu.edu/projects/timegraph/.

In future work, we will elaborate coordination ofGPU and CPU resource management schemes to fur-ther consolidate prioritization and isolation capabilitiesfor the entire system. We are also interested in coordina-tion of video memory and system memory managementschemes. Exploration of other models for GPU schedul-ing is another interesting direction of future work. Forinstance, modifying the current API to introduce non-blocking interfaces could improve throughput at the ex-pense of modifications to legacy applications. Schedul-ing overhead and blocking time may also be reduced byimplementing an real-timesatellite kernel[18] on mi-crocontrollers present in modern GPUs. Finally, we willtackle the problem of mapping application-level specifi-cations, such as frame-rates, into priority and reservationproperties at the operating-system level.

References[1] BAUTIN , M., DWARAKINATH , A., AND CHIUEH, T. Graphics

Engine Resource Management. InProc. MMCN(2008).

[2] D IMITRIJEVIC , Z., RANGAWAMI , R., AND CHANG, E. Designand Implementation of Semi-preemptible IO. InProc. USENIXFAST(2003).

[3] DOWTY, M., AND SUGEMAN, J. GPU Virtualization onVMware’s Hosted I/O Architecture.ACM SIGOPS OperatingSystems Review 43, 3 (2009), 73–82.

[4] DUDA , K., AND CHERITON, D. Borrowed-Virtual-Time (BVT)Scheduling: Supporting Latency-Sensitive Threads in a General-Purpose Scheduler. InProc. ACM SOSP(1999), pp. 261–276.

[5] DWARAKINATH , A. A Fair-Share Scheduler for the GraphicsProcessing Unit. Master’s thesis, Stony Brook University,2008.

[6] FAITH , R. The Direct Rendering Manager: Kernel Support forthe Direct Rendering Infrastructure. Precision Insight, Inc., 1999.

[7] FREEDESKTOP. Nouveau Open-Source Driver.http://nouveau.freedesktop.org/.

[8] GUPTA, V., GAVRILOVSKA , A., TOLIA , N., AND TALWAR ,V. GViM: GPU-accelerated Virtual Machines. InProc. ACMHPCVirt (2009), pp. 17–24.

[9] JONES, M., ROSU, D., AND ROSU, M.-C. CPU Reservationsand Time Constraints: Efficient, Predictable Scheduling ofInde-pendent Activities. InProc. ACM SOSP(1997), pp. 198–211.

[10] KRASIC, C., SAUBHASIK , M., AND GOEL, A. Fair and TimelyScheduling via Cooperative Polling. InProc. ACM EuroSys(2009), pp. 103–116.

[11] LAGAR-CAVILLA , H., TOLIA , N., SATYANARAYANAN , M.,AND DE LARA , E. VMM-Independent Graphics Acceleration.In Proc. ACM VEE(2007), pp. 33–43.

[12] LEHOCZKY, J., SHA , L., AND STROSNIDER, J. Enhanced Ape-riodic Responsiveness in Hard Real-Time Environments. InProc.IEEE RTSS(1987), pp. 261–270.

[13] L IU , C., AND LAYLAND , J. Scheduling Algorithms for Multi-programming in a Hard Real-Time Environment.Journal of theACM 20(1973), 46–61.

[14] MARTIN , K., FAITH , R., OWEN, J.,AND AKIN , A. Direct Ren-dering Infrastructure, Low-Level Design Document. PrecisionInsight, Inc., 1999.

[15] MESA3D. Gallium3D.http://www.mesa3d.org/.

[16] MOLANO, A., JUWA , K., AND RAJKUMAR , R. Real-TimeFilesystems. Guaranteeing Timing Constraints for Disk Accessesin RT-Mach. InProc. IEEE RTSS(1997), pp. 155–165.

[17] NIEH, J., AND LAM , M. SMART: A Processor Scheduler forMultimedia Applications. InProc. ACM SOSP(1995).

[18] NIGHTINGALE , E., HODSON, O., MCLLORY, R., HAW-BLITZEL , C., AND HUNT, G. Helios: Heterogeneous Multipro-cessing with Satellite Kernels. InProc. ACM SOSP(2009).

[19] NVIDIA C ORPORATION. Proprietary Driver.http://www.nvidia.com/page/drivers.html.

[20] OIKAWA , S.,AND RAJKUMAR , R. Portable RT: A Portable Re-source Kernel for Guaranteed and Enforced Timing Behavior.InProc. IEEE RTAS(1999), pp. 111–120.

[21] PATHSCALE INC. ENZO.http://www.pathscale.com/.

[22] PATHSCALE INC. PSCNV. https://github.com/pathscale/pscnv.

[23] PHORONIX. NVIDIA Developer Talks Openly About Linux Sup-port. http://www.phoronix.com/scan.php?page=article&item=nvidia qa linux&num=2.

[24] PHORONIX. Phoronix Test Suite. http://www.phoronix-test-suite.com/.

[25] PRONOVOST, S., MORETON, H., AND KELLEY, T. WindowsDisplay Driver Model (WDDM v2) And Beyond. InWindowsHardware Engineering Conference(2006).

[26] RAJKUMAR , R., LEE, C., LEHOCZKY, J., AND SIEWIOREK,D. A Resource Allocation Model for QoS Management. InProc.IEEE RTSS(1997), pp. 298–307.

[27] ROUSSOS, K., BITAR , N., AND ENGLISH, R. DeterministicBatch Scheduling Without Static Partitioning. InProc. JSSPP(1999), pp. 220–235.

[28] SPEC. SPECviewperf.http://www.spec.org/gwpg/gpc.static/vp11info.html.

[29] SPRUNT, B., LEHOCZKY, J., AND SHA , L. Exploiting UnusedPeriodic Time for Aperiodic Service using the Extended PriorityExchange Algorithm. InProc. IEEE RTSS(1988), pp. 251–258.

[30] SPURI, M., AND BUTTAZO, G. Efficient Aperiodic Service un-der Earliest Deadline Scheduling. InProc. IEEE RTSS(1994),pp. 2–11.

[31] YANG, T., LIU , T., BERGER, E., KAPLAN , S.,AND MOSS, J.-B. Redline: First Class Support for Interactivity in CommodityOperating Systems. InProc. USENIX OSDI(2008), pp. 73–86.

TimeGraph: GPU Scheduling for Real-Time Multi … · TimeGraph: GPU Scheduling for Real-Time...

Documents

Transcript of TimeGraph: GPU Scheduling for Real-Time Multi … · TimeGraph: GPU Scheduling for Real-Time...