Anatomy of GPU Memory System for Multi-Application Execution · As GPUs make headway in the...

Anatomy of GPU Memory System forMulti-Application Execution

Adwait Jog1,2 Onur Kayiran2,4 Tuba Kesten2 Ashutosh Pattnaik2

Evgeny Bolotin3 Niladrish Chatterjee3 Stephen W. Keckler3,5Mahmut T. Kandemir2 Chita R. Das2

1The College of William and Mary 2The Pennsylvania State University 3NVIDIA4AMD Research 5The University of Texas at Austin

[email protected], (onur, tzk133, ashutosh, kandemir, das)@cse.psu.edu(ebolotin, nchatterjee, skeckler)@nvidia.com

ABSTRACTAs GPUs make headway in the computing landscape span-ning mobile platforms, supercomputers, cloud and virtualdesktop platforms, supporting concurrent execution of mul-tiple applications in GPUs becomes essential for unlock-ing their full potential. However, unlike CPUs, multi-application execution in GPUs is little explored. In thispaper, we study the memory system of GPUs in a con-currently executing multi-application environment. We firstpresent an analytical performance model for many-threadedarchitectures and show that the common use of misses-per-kilo-instruction (MPKI) as a proxy for performance is notaccurate without considering the bandwidth usage of ap-plications. We characterize the memory interference of ap-plications and discuss the limitations of existing memoryschedulers in mitigating this interference. We extend theanalytical model to multiple applications and identify thekey metrics to control various performance metrics. Weconduct extensive simulations using an enhanced version ofGPGPU-Sim targeted for concurrently executing multipleapplications, and show that memory scheduling decisionsbased on MPKI and bandwidth information are more effec-tive in enhancing system throughput compared to the tra-ditional FR-FCFS and the recently proposed RR FR-FCFSpolicies.

1. INTRODUCTIONGraphics Processing Units (GPUs) are becoming an in-

evitable part of heterogeneous computing systems becauseof their ability to accelerate applications consisting of abun-dant data-level parallelism. The computing trajectory ofGPUs has evolved from traditional graphics rendering to ac-celerating general purpose and high performance computingapplications, and of late to cloud and virtual desktop com-puting. Two trends have driven this trajectory. First, GPUresources are rapidly growing with each technology genera-tion [2, 4, 6] to provide the workhorse for efficient computa-tion. Second, advances in software and virtualization tech-nology for GPUs such as NVIDIA GRID [3], and OpenStackIaaS framework have made the transition possible.

GPU virtualization is required for concurrent access to theGPU resources by multiple applications, potentially origi-nating from different users. This can be facilitated by spatialas well as temporal allocation of GPU resources. The cur-

rent NVIDIA GRID [3] and other cloud providers supportvirtualization by time multiplexing. Spatial resource sharinghas yet to evolve because unlike CPUs, traditionally, GPUswere designed to execute only a single application at a time.However, it has been shown recently that only executing asingle application at a time may not effectively utilize theavailable computing resources in state-of-the-art GPUs [35],thereby making a compelling case for multi-application exe-cution [22,35]. Thus, supporting multi-application executionis essential both from performance and utility (adoption incloud environments) perspectives.

Unlike CPU-based architectures, where resource alloca-tion and scheduling for multiple application execution hasbeen studied extensively, only a few recent papers (e.g., [5,17, 22, 35, 44]) have scratched the issues related to multipleapplication execution in the context of GPUs. Among themonly a few works [22] have addressed the problem of multi-application interference in the GPU memory system. There-fore, many of the design issues are still little understood. Inthis paper, we focus on the interactions of multiple applica-tions in GPU memory system, and specifically attempt toanswer the following questions: (i) How do we characterizethe interactions between multiple applications in the GPUmemory system? (ii) What are the limitations of traditionalapplication-agnostic FR-FCFS [37,38,46] and RR FR-FCFSscheduling [22] in the context of throughput oriented GPUplatforms?, (iii) Is it possible to push the performance en-velope further with an efficient scheduling mechanism?, and(iv) How do we explore the design space and develop analyti-cal performance models to find appropriate knobs for guidingthe scheduling decisions? In this context, this paper makesthe following contributions:• Contrary to the common use of misses-per-kilo-instruction(MPKI) as a metric for gauging the memory intensity, and asa proxy for application performance, we show that a modelbased on both MPKI and the achieved DRAM bandwidthinformation is able to gauge the memory intensity and esti-mate the performance of GPU applications more accurately.• We perform a detailed analysis of application characteris-tics and interactions among multiple applications in GPUsto classify applications and understand the scheduling de-sign space based on this classification.•We develop a simple analytical model to demonstrate thatL2-MPKI and bandwidth utilization can be used as twocontrol knobs to optimize performance metrics: instruction

1

L2MC

L2MC

L2MC

L2MC

App N

On Chip Network

SM SM

L1 L1

SM SM

L1 L1

App N-1App 2

SM SM

L1 L1

SM SM

L1 L1

App 1

Figure 1: Overview of our baseline architecture capable ofexecuting multiple applications.

throughput (IT) and weighted speedup (WS), respectively.Based on this analytical model, we develop two memoryscheduling schemes, ITS and WEIS, which are customizedto improve IT and WS, respectively. We show that the pro-posed solutions are still effective with different core parti-tioning configurations and scalable for running up to threeapplications concurrently.• We qualitatively and quantitatively compare our schemeswith the traditional FR-FCFS and the recently proposedround-robin RR FR-FCFS [22] schedulers. Across 25 repre-sentative workloads, ITS improves IT by 34% and 8% overFR-FCFS and RR FR-FCFS, respectively; and WEIS im-proves WS by 10% and 5% over FR-FCFS and RR FR-FCFS, respectively.•We believe this is the first paper that conducts an in-depthevaluation on GPU memory systems in multi-application en-vironment. In this context, we have developed a GPU Con-current Application suite (GCA) and a parallel workloadsimulation framework by extending GPGPU-Sim [8], a cy-cle accurate GPU simulator. We will open-source the GCAframework for fostering future research.

2. BACKGROUND

2.1 Baseline ArchitectureIn this paper, we consider a generic NVIDIA-like GPU as

our baseline, where multiple cores, also called as streamingmulti-processors (SMs)1, are connected to multiple memorycontrollers (MCs) via an on-chip network as shown in Fig-ure 1. Each MC is a part of the memory partition that alsocontains a slice of L2 cache for faster data access. The detailsof our baseline configuration are shown later in Table 2.Single Application Scheduling: CUDA uses computa-tional kernels to take advantage of parallel regions in theapplication. Each application may include multiple kernels.GPUs execute all kernels of an application sequentially, i.e.one kernel at a time. Each kernel is organized as blocksof cooperative thread arrays (CTAs) that are executed in aparallel fashion on the whole GPU. During kernel launch,the CTA scheduler initiates scheduling of the CTAs relatedto that kernel, and tries to distribute them evenly [8].Multiple Application Scheduling: In this paper, we si-multaneously execute kernels from different applications.We use a kernel-to-SMs allocation scheme where we dis-tribute SMs evenly in a spatial manner based on the numberof applications as shown in Figure 1. For example, if ourGPU consists of 30 SMs that need to be partitioned amongtwo applications, we assign the first 15 SMs to the first ap-plication, and the rest of the SMs to the second application.As the focus of this work is memory (not caches), unless

1In this work, we use“core”and“SM”terms interchangeably.

otherwise specified, we equally partition SMs and L2 cacheacross concurrently executing applications. We also evaluateour schemes with two different SM-partitioning techniques(i.e. 10-20 and 20-10) to demonstrate the robustness of ourmodel with respect to core partitioning in Section 8.4.Memory Scheduling: The most widely used memoryscheduler in GPUs is first-ready FCFS (FR-FCFS) [37, 38,46], which is implemented in the hardware. This scheme istargeted at improving DRAM row hit rates, so request pri-oritization order is: 1) row-hit requests are prioritized overother requests; then 2) older requests are prioritized overyounger ones.

2.2 Evaluation Metrics and Application SuiteTypically, performance of single application is captured by

its instruction throughput (IT). When multiple applicationsexecute concurrently, instruction throughput measures theraw machine throughput and is given by IT =

∑Ni=1 IPCi,

where there are N co-running applications, and IPCi is thenumber of committed instructions per cycle of the ith appli-cation. Note that this metric only considers IPC through-put, without taking fairness into account. In this work, wefocus on this metric to evaluate pure machine performanceof GPUs without considering the fairness aspect.

For evaluating system throughput, we use Weightedspeedup (WS), which indicates how many jobs are executed

per unit time: WS =∑Ni=1 SDi, where SDi is the slow-

down of ith application given by SDi = IPCi

IPCalonei

, where

IPCalonei is IPC of ith application when running alone. As-suming there is no constructive interference among applica-tions, the maximum value of WS is equal to the number ofapplications.

We also show the impact of our schemes on HarmonicSpeedup (HS) that not only measures system performance,but also has a notion of fairness [28] and is given by, HS =

1/(∑Ni=1

1SDi

). Also, the Average Normalized Turn-around

Time (ANTT) metric is the reciprocal of HS.Application Suite: For experimental evaluations, weuse a wide range of GPGPU applications implemented inCUDA. These applications are chosen from Rodinia [10],Parboil [41], CUDA SDK [8], and SHOC [11], and are listedin Table 1. In total we study 25 applications.

3. PERFORMANCE CHARACTERIZA-TION OF MANY-THREADEDARCHITECTURES

In this section, we revisit a performance model for many-threaded architectures proposed by Guz et al. [18], and clas-sify our applications based on this model.

3.1 A Model for Many-threadedArchitectures

Recent works have shown that bandwidth is usuallythe critical bottleneck in many-threaded architectures likeGPUs [23,24,26,43]. The considered model [18] shows thatperformance of many-threaded architectures is directly pro-portional to the bandwidth that the application receivesfrom DRAM (attained DRAM bandwidth). In this model,application performance is expressed as,

P =BW

breg × rm × Lmiss(1)

2

whereP = performance [Operations/second (Ops)]BW = attained DRAM bandwidth [Bytes/second (Bps)]rm = the ratio of the number of memory instructions to

the number of total instructionsbreg = the operand size [Bytes]Lmiss= cache miss rate

This generic model assumes a single memory space, asingle-level cache and DRAM. As a result, the miss rate isused for quantifying DRAM accesses. In addition, it ignoresthe case of multiple memory spaces and the case of scratch-pad usage; e.g. the case where scratch-pad can provide anadditional source of bandwidth. Operand size represents theamount of data fetched to/from DRAM on each cache miss,which is equal to the cache line size.

In this model, as the numerator in (1) is BW , performanceis directly proportional to the bandwidth attained by theapplication. We express the denominator as,

breg × rm × Lmiss = breg × imemitot

× cmissctot

= breg ×cmissitot︸︷︷︸MPI

× imemctot︸︷︷︸1

= breg ×MPI︸︷︷︸Bytes required to transfer from DRAM

for commiting an instruction

(2)whereimem = the number of memory instructionsitot = the number of total instructionscmiss = the number of cache missesctot = the number of cache accessesMPI= the number of cache misses per instruction

Thus, from (1) and (2), we note that performance is di-rectly proportional to the achieved DRAM bandwidth, andis inversely proportional to the size of datum that needs to befetched to/from DRAM in order to commit an instruction.Since we consider a system with two levels of cache witha constant cache-line size, the overall performance becomesinversely proportional to the number of L2 misses that needto be served for committing one thousand instructions (L2misses per kilo-instruction (L2MPKI)), as given in (3). Inthe rest of the paper, we use the term MPKI to representL2MPKI.

P ∝ BW

MPKI(3)

We validate this model by simulating applications fromTable 1 using GPGPU-Sim simulator. Figure 2 showsthe observed IPC from the simulator and the calculatedIPC values by using Equation (3) and simulated BW andMPKI values. For all 25 applications, the mean absoluterelative error (MARE) is 10.3%. We observe that formany (21 out of 25) applications, the MARE is only 4.2%.However, for other 4 applications: MM, HS, BP, and SAD,MARE is higher (42.1%), because these applications makeextensive use of the software-managed scratchpad memory.As a result, their performance is not only dependent onthe achieved DRAM bandwidth, but is also driven by theadditional scratchpad bandwidth, which is not captured inour model. For cases that involve usage of a scratch-padmemory, our model underestimates the actual IPC obtainedby the application. We conducted a similar analysis on realGPU hardware2, and Figure 3 shows the absolute relative

2NVIDIA K20m, CUDA capability 3.5, Driver 6.0

0

0.25

0.5

0.75

1

GU

PS

MU

M

QT

C

BF

S2

NW

LU

H

RE

D

SC

AN

SC

P

CF

D

FW

T

BL

K

SR

AD

LIB

JP

EG

3D

S

CO

NS

HIS

TO

MM

BP

HS

SA

D

NN

RA

Y

TR

D

No

rmalized

IP

C Simulator Model

Figure 2: Application performance obtained via simulationand our model. IPC is normalized with respect to the max-imum achievable IPC supported by our architecture.

0%

20%

40%

60%

80%

100%

BF

S2

NW

LU

H

RE

D

SC

AN

SC

P

CF

D

FW

T

BL

K

SR

AD

LIB

JP

EG

3D

S

CO

NS

HIS

TO

MM

BP

HS

SA

D

NN

RA

Y

TR

D

MA

REAb

so

lute

Re

lati

ve

Err

or

Figure 3: Absolute relative error between IPCs obtainedfrom real hardware (NVIDIA Kepler K20m) and our model.

error for each application. We omit the results for GUPS,MUM and QTC, because we could not execute them faithfullyon real hardware. For 22 applications, we observe MAREto be 9.5%.

3.2 Application CharacterizationSeveral previous works (e.g., [12, 27, 28, 30, 36]) have 1)

characterized the memory intensity of applications primarilybased on their last-level cache MPKI values, and 2) usedMPKI as a proxy for performance. However, in this work,we show that considering only MPKI is not enough for theboth cases in GPUs.

Table 1 summarizes MPKI and the ratio between at-tained DRAM bandwidth and peak bandwidth (BW/C) forour applications, where C is the peak memory bandwidth.We list them in the descending order of their MPKI. First,we show in Table 1 that only considering MPKI values isnot sufficient for estimating the memory intensity of a par-ticular application. It is evident that high MPKI levelsmay not necessarily lead to a very high DRAM bandwidthutilization, as it is also a function of the inherent computeto bandwidth ratio of a particular application. For example,although QTC has the third highest MPKI among our ap-plications, there are 15 other applications in our suite thathave higher DRAM bandwidth utilization than QTC.

Second, as we showed in Section 3.1, performance is notonly dependent on MPKI, but also BW . For the applica-tions shown in Table 1, the correlation between MPKI andIPC is only -44.4%. While MPKI is the number of missesthat needs to be served to commit 1000 instructions, it lacksthe information about the rate at which these misses areserved by DRAM. In other words, L2-MPKI is a good mea-sure of the bandwidth demand of the application, but per-formance is a function of both the demanded and the achievedbandwidth.

3

Table 1: Application characteristics: (A) MPKI: L2 cachemisses per kilo-instructions. (B) BW/C: The ratio of at-tained bandwidth to the peak bandwidth of the system.

GPGPU Application Abbr. MPKI BW/C(in %)

Random Access GUPS 57.1 93.0MUMmerGPU [33] MUM 22.6 79.7

Quality Threshold Clustering [11] QTC 7.9 15.3Breadth-First Search [33] BFS2 5.3 11.6Needleman-Wunsch [10] NW 5.1 16.7

Lulesh [25] LUH 4.7 35.6Reduction [11] RED 2.8 68.8Exclusive [11]

Parallel Prefix SumSCAN 2.7 45.0

Scalar Product [33] SCP 2.7 86.3Fluid Dynamics [10] CFD 2.3 27.2

Fast Walsh Transform [33] FWT 2.2 45.1BlackScholes BLK 1.6 67.6

Speckle Reducing AnisotropicDiffusion [10]

SRAD 1.6 36.4

LIBOR Monte Carlo [33] LIB 1.1 25.4JPEG Decoding JPEG 1.1 29.8

3D Stencil 3DS 1.0 42.3Convolution Separable [33] CONS 1.0 43.5

2D Histogram [41] HISTO 0.6 18.8Matrix Multiplication [41] MM 0.5 5.1

Backpropogation [10] BP 0.4 16.5Hotspot [10] HS 0.4 6.1

Sum of Absolute Differences [41] SAD 0.1 5.6Neural Networks [33] NN 0.1 0.3

Ray Tracing [33] RAY 0.1 1.8Stream Triad [11] TRD 0.1 1.4

4. ANALYZING MEMORY SYSTEMINTERFERENCE

In this section, we build a foundation and draw key ob-servations towards designing a more efficient, application-aware memory scheduler for GPUs. We start with present-ing the nature of interference among concurrent applicationsin GPU memory system, then we discuss the inefficiencies ofexisting memory schedulers, and finally we draw initial in-sights for designing a better memory scheduling technique.

4.1 The Problem: Application InterferenceWhen multiple applications are co-scheduled on the same

GPU hardware, they interfere at various levels of the mem-ory hierarchy, such as interconnect, caches and memory. Asmemory system is the critical bottleneck for a large num-ber of GPU applications, explicitly addressing contentionissues in the memory system is essential. We find that anuncoordinated allocation of GPU resources, especially mem-ory system resources, can lead to significant performancedegradations both in terms of IT and WS. To demonstratethis, consider Figure 4, which shows the impact on WS andIT , when BLK is co-scheduled with three different applica-tions (GUPS, QTC, NN). The workloads formed are denoted byBLK_GUPS, BLK_QTC and BLK_NN, respectively. Let us firstconsider the impact on WS in Figure 4a, where we alsoshow the breakdown of WS in terms of slowdowns (SD)experienced by individual applications. The slowdowns ofthe applications in the workload are denoted by SD-App-1and SD-App-2 for the first and the second applications, re-spectively. Note that when the applications do not interferewith each other, both SD-App-1 and SD-App-2 are equal to1, leading to a weighted speedup of 2. This figure demon-strates three different memory interference scenarios: (1) inBLK_GUPS, both applications slow down each other signifi-

0

0.5

1

1.5

2

BLK_GUPS BLK_QTC BLK_NNWe

igh

ted

Sp

ee

du

p

SD-App-1 SD-App-2

(a) Effect on WeightedSpeedup.

0

0.25

0.5

0.75

1

BLK_GUPS BLK_QTC BLK_NN

No

rmal

ize

d IT

(b) Effect on InstructionThroughput.

Figure 4: Different performance slowdowns obtained whenBLK is co-scheduled with three different applications: GUPS,QTC, and NN. Memory scheduling policy is FR-FCFS.

cantly, (2) in BLK_QTC, slowdowns of BLK and QTC are verydifferent – slowdown of QTC being much higher than that ofBLK, and (3) in BLK_NN, slowdown in both applications is neg-ligible. In these different scenarios, the degradation in WSis also very different. We observe significant degradation inWS in the first (90%) and second (54%) cases, whereas, itis negligible (2%) in the third case.

Figure 4b shows IT degradation when BLK is co-scheduledwith the same three applications vs. the case where BLK

and other application in the workload are executed sequen-tially. We observe similar interference trends in BLK_GUPS

and BLK_NN, where in the former case, we observe signifi-cant degradation in IT , while the latter exhibits negligibledegradation. Interestingly, the IT degradation in BLK_QTC

is not as significant as compared to its WS degradation.From this discussion, we conclude that concurrently execut-ing applications are not only susceptible to significant levelsof destructive interference in memory, but also the degreewith which they impact each other can be strongly dependenton the considered performance metric.Analysis: As shown in Section 3, performance of a GPU ap-plication is a function of both BW and MPKI. As we seekinsights on the impact of each metric on performance, let usfirst discuss the memory bandwidth component. For exam-ple, since both BLK and GUPS have very high bandwidth de-mands (Table 1), when they are co-scheduled, performanceof both applications degrade significantly. We observe ex-actly the same behaviour in Figure 4a, where both appli-cations do not receive their bandwidth shares for achiev-ing stand-alone performance levels. Similarly, the availableDRAM bandwidth is not sufficient for the BLK_QTC case,and BLK hurts QTC performance quite significantly by get-ting most of the available bandwidth. In the third case withBLK_NN, the bandwidth is sufficient for both applications, re-sulting in negligible performance slowdowns for both. Thisinfers the fact that limited DRAM bandwidth is one of theimportant reasons of application interference and differentapplication slowdowns.

However, considering only BW of each application wouldnot explain why BLK_QTC experiences only a slight degrada-tion in IT . We also need to consider the second parameterthat affects performance, MPKI, whose values are listedin Table 1. In our example, there is significant differencein MPKI of BLK and QTC (BLK MPKI < QTC MPKI).From (3), we know that the application with low BW andhigh MPKI will achieve lower performance levels than theapplications with high BW and low MPKI. Therefore, inBLK_QTC, BLK contributes much more towards higher IT thanQTC. As BLK does not slow down significantly (Figure 4a), thetotal IT reduction is low. This discussion confirms that bothBW and MPKI are key parameters for better understand-

4

0

0.5

1

1.5

2

FR-F

CFS

Pri

or_

Ap

p1

Pri

or_

Ap

p2

Ro

un

d-R

ob

in

FR-F

CFS

Pri

or_

Ap

p1

Pri

or_

Ap

p2

Ro

un

d-R

ob

in

FR-F

CFS

Pri

or_

Ap

p1

Pri

or_

Ap

p2

Ro

un

d-R

ob

in

HISTO_TRD BLK_QTC BLK_NN

We

igh

ted

Sp

ee

du

p

SD-App-1 SD-App-2

(a) Effect on Weighted Speedup.

0

0.25

0.5

0.75

1

HISTO_TRD BLK_QTC BLK_NN

No

rmal

ize

d IT

FRFCFS Prior_App1 Prior_App2 Round-Robin

(b) Effect on Instruction Throughput

Figure 5: Different performance slowdowns experiencedwhen different memory scheduling schemes are employed.

ing of the performance characteristics and scheduling consid-erations for concurrently executing applications in GPUs.

4.2 Limitations of Existing MemorySchedulers

In this section, we discuss the limitations of three memoryschedulers. We consider the baseline FR-FCFS [37, 38, 46]scheduler that targets improving DRAM row hit rates, andprioritizes row-hit requests over any other request. In ad-dition, we explore two other schemes a) Prior App sched-uler that statically prioritizes requests of only one of the co-scheduled applications, and b) the recently proposed round-robin (RR) FR-FCFS GPU memory scheduler [22] that givesequal priority to all concurrently executed applications inthe system without considering their properties. None ofthese schedulers sacrifice locality, but instead of pickingmemory requests in FCFS order after servicing row-hit re-quests (as done in FR-FCFS), Prior App i always prefersmemory request from ith application, and Round-Robin ar-bitrates between applications in the round-robin order. Fig-ure 5 shows the effect of these three memory schedulers onweighted speedup and instruction throughput for two of theworkloads already shown in Figure 4a (BLK_QTC and BLK_NN),and an additional workload HISTO_TRD.Limitations of FR-FCFS: When multiple applicationsare co-scheduled, FR-FCFS still optimizes for row-hit lo-cality and does not consider the individual applicationproperties while making scheduling decisions. Because ofsuch application-unawareness, FCFS nature of the schedulerwould allow a high memory demanding application to get alarger bandwidth share, as that application would introducemore requests in memory controller queue. Therefore, asshown in Figure 5, in BLK_QTC, we observe that BLK getshigher bandwidth share, causing large slowdowns in QTC.Moreover, Figure 5 demonstrates that the best performingscheduling strategy for improving either of the the perfor-mance metrics is prioritizing one of the applications through-out the entire execution.Limitations of Prior App i: We observe in HISTO_TRD

that prioritizing TRD over HISTO provides the best IT andWS among all the considered scheduling strategies. How-

ever, in BLK_QTC, prioritizing one application over anotherdoes not improve both the performance metrics. In BLK_NN,since both the applications attain their uncontested band-width demands, prioritizing one over another does not im-pact performance. Even though prioritizing one applicationover another provides the best result, the challenge is to de-termine which application to prioritize. One way of doingso is to profile the workload and employ a static prioritymechanism throughout the execution. However, such strat-egy is often hard to realize. Another mechanism can beto switch priorities between applications during run-time,which is similar to RR FR-FCFS [22].Limitations of RR FR-FCFS: As discussed above, inorder to optimize WS and IT in BLK_QTC, we should prior-itize different applications. Since RR switches priorities be-tween these two applications, it achieves a balance betweenimproving both metrics. However, in HISTO_TRD, althoughemploying RR mechanism leads to a slightly better IT andWS over FR-FCFS, it is far from the case where we priori-tize TRD.

Based on the above discussion, we make two key obser-vations that guide us in developing an application-consciousscheduling strategy.Observation 1: Prioritizing lower MPKI applicationsimproves IT : In the workloads where we observe signifi-cant slowdowns in either of the applications, prioritizing theapplication with lower MPKI improves IT . For example,in HISTO_TRD, TRD has lower MPKI than HISTO and there-fore, Prior App 2 yields better IT . Similarly, in BLK_QTC,BLK has lower MPKI than QTC, thus Prior App 1 providesbetter IT .Observation 2: Prioritizing lower BW applicationsimproves WS: In the workloads where we observe signif-icant slowdowns in either of the applications, prioritizingthe application with lower BW improves WS. For example,in HISTO_TRD, TRD has lower BW than HISTO and thereforePrior App 2 yields better WS. Similarly, in BLK_QTC, QTChas lower BW than BLK, thus Prior App 2 provides betterWS.

5. A PERFORMANCE MODEL FOR CON-CURRENTLY EXECUTINGAPPLICATIONS

In order to provide a theoretical background for the ob-servations we discussed for optimizing IT and WS in Sec-tion 4.2, we extend the model discussed in Section 3 to twoapplications, and analyze the effect of concurrent executionon the performance metrics when memory bandwidth is thesystem bottleneck. We also show how this model guides usin developing different memory scheduling algorithms tar-geted for optimizing each metric separately, and explain ourfindings based on our model by examples.

The notations we use in our model are given below:P alonei = performance of application i when it runs aloneBW alone

i = bandwidth attained by application i when it runsalone

Pi = performance of application iBWi = bandwidth attained by application iC = peak bandwidth of the systemε = infinitesimal bandwidth given to an applicationMPKIi = MPKI of application i

When applications run alone, they cannot attain morebandwidth than that is available in the system, thus

5

𝟑𝟎

𝟐𝟎= 𝟏. 𝟓𝟐𝟎

𝟓

𝟑𝟎

𝟒𝟎

MPKI Alone BW

App 2

App 1

Alone Perf.

C = 50

Prior App 2

Prior App 1

Round-robin

𝐏𝟏 𝐏𝟐

𝟒 + 𝟏. 𝟓 = 𝟓. 𝟓

𝟎. 𝟓 + 𝟖 = 𝟖. 𝟓

𝐈𝐓 = 𝐏𝟏 + 𝐏𝟐

𝟏. 𝟐𝟓 + 𝟓 = 𝟔. 𝟐𝟓2

1

3 4

5 6

7 8

9

10

11

13

14

12

𝟒𝟎

𝟓= 𝟖

𝟑𝟎

𝟐𝟎= 𝟏. 𝟓

𝟏𝟎

𝟐𝟎= 𝟎. 𝟓

𝟐𝟓

𝟐𝟎= 𝟏. 𝟐𝟓

𝟐𝟎

𝟓= 𝟒

𝟐𝟓

𝟓= 𝟓

𝟒𝟎

𝟓= 𝟖

𝟏. 𝟓

𝟏. 𝟓+𝟒

𝟖= 𝟏. 𝟓

𝟎. 𝟓

𝟏. 𝟓+𝟖

𝟖= 𝟏. 𝟑𝟑

𝟏. 𝟐𝟓

𝟏. 𝟓+𝟓

𝟖= 𝟏. 𝟒𝟓

𝐖𝐒 =𝐏𝟏

𝟏. 𝟓+𝐏𝟐

𝟖

Figure 6: An illustrative example showing IT and WS for two applications running together. The shaded boxes representsystem and application properties. The peak memory bandwidth is 50 units. Application 1 and 2 use 30 and 40 unitsbandwidth, respectively, when they execute alone. Their MPKIs are 20 and 5, respectively.

BW alonei ≤ C ∀i; and similarly for concurrent execution,

the cumulative bandwidth consumption cannot exceed thepeak bandwidth, thus

∑Ni=1BWi ≤ C. Also, we have

Pi, BWi,MPKIi ≥ 0 ∀i.

Let us analyze the case when two applications are exe-cuted concurrently3. Assuming that performance is limitedby the memory bandwidth, an application cannot consumemore bandwidth without getting more share from the otherapplication’s bandwidth. Thus, in a two-application sce-nario, BW1 +BW2 = C, assuming no wastage of bandwidthis incurred.

We assume that, at time t = t0, we have Pi ∝ BWiMPKIi

∀i; and at t = t1 where t1 > t0, we give an additionalε bandwidth to the first application by taking it from theother. Thus, at t = t1, we have P ′1 ∝ BW1+ε

MPKI1, and P ′2 ∝

BW2−εMPKI2

, where P ′i is the performance of application i at t =t1.

5.1 Analyzing Instruction ThroughputIn order to have higher IT at t = t1 compared to t = t0,

P ′1 + P ′2 > P1 + P2 (4)

BW1 + ε

MPKI1+BW2 − εMPKI2

>BW1

MPKI1+

BW2

MPKI2(5)

Simplifying (5) yields,

ε(MPKI2 −MPKI1) > 0 (6)

=⇒ MPKI2 > MPKI1, if ε > 0 (7)

MPKI1 > MPKI2, if ε < 0 (8)

From (7) and (8), we find that we should prioritize theapplication with lower MPKI in order to optimize IT .

Illustrative Example: Figure 6 shows an example of theoptimal scheduling strategy for maximizing IT . In this ex-ample, we consider three different scheduling strategies: 1)prioritizing the first application, 2) prioritizing the secondapplication, and 3) the RR scheduler. When these appli-cations run alone, based on their BW and MPKI values,we can say that the first and the second applications achieveperformances of 1.5 ( 1 ) and 8 units ( 2 ), respectively, basedon (3). If these applications are co-scheduled on the GPU,and if we prioritize the first application throughout the exe-cution, that application will get 30 units of bandwidth, sinceit demands 30 units when it runs alone. We assume thatMPKI does not change significantly during different exe-cutions, as we do not do any cache-related optimization inthis work for preserving simplicity. Because the first ap-plication’s MPKI is 20, its performance will be 1.5 units3Our model can be easily extended to analyze more thantwo applications. We show the scalability of our model forthree applications in Section 8.4.

( 3 ). The remaining 20 units of bandwidth will be used bythe second application, as the peak bandwidth is 50 units(C = 50). Because its MPKI is 5, its performance willbe 4 units ( 4 ). Similarly, if we prioritize the second appli-cation, it will get 40 units of bandwidth as it demands 40units when running alone, and the first application will usethe remaining 10 units. Thus, the performances of the firstand the second applications will be 0.5 ( 5 ) and 8 units ( 6 ),respectively. Also, if we employ the round-robin scheduler,assuming both applications have similar row-buffer locali-ties, they get the same share from the available bandwidth,which is 25 units, leading to 1.25 ( 7 ) and 5 ( 8 ) units of per-formance for the first and the second applications, respec-tively. Based on these individual application performancevalues calculated using our model, we show that IT (thesum of individual IPCs) of this workload is 5.5 ( 9 ), 8.5( 10 ), and 6.25 units ( 11 ), when we prioritize the first ap-plication, prioritize the second application, and employ RRscheduler, respectively. These results are consistent with ourmodel, which suggests that prioritizing the application thathas lower MPKI would provide the best IT . Intuitively,as per (3), the same BW provided for the application withlower MPKI, which is the second application in this exam-ple, translates to higher IPC. Thus, prioritizing the applica-tion with lower MPKI provides higher IT for the system.

We observe in Figure 5b that prioritizing the applicationwith lower MPKI results in higher IT for HISTO_TRD andBLK_QTC. Since the application interference in BLK_NN is notsignificant, it does not benefit from prioritization.

5.2 Analyzing Weighted SpeedupIn order to have higher WS at t = t1 compared to t = t0,

P ′1P alone1

+P ′2

P alone2

>P1

P alone1

+P2

P alone2

(9)

BW1+εMPKI1

BWalone1

MPKI1

+

BW2−εMPKI2

BWalone2

MPKI2

>

BW1MPKI1

BWalone1

MPKI1

+

BW2MPKI2

BWalone2

MPKI2

(10)

Simplifying (10) yields,ε(BW alone

2 −BW alone1 ) > 0 (11)

=⇒ BW alone2 > BW alone

1 , if ε > 0 (12)

BW alone1 > BW alone

2 , if ε < 0 (13)

From (12) and (13), we find that we should prioritizethe application with lower BW alone to optimize weightedspeedup.

Illustrative Example: We continue the example shown inFigure 6 with the optimal scheduling strategy for maximiz-ing WS. We obtain WS = 1.5 ( 12 ), 1.33 ( 13 ), and 1.45

units (14 ) if we prioritize the first application, the second

6

application, and employ RR scheduler, respectively. As dis-cussed above, prioritizing the second application providesthe best IT . However, we also observe that prioritizingthe application with the lowest BW alone, which is the firstapplication in this example, yields the best WS, which isconsistent with our model. Intuitively, the same amount ofbandwidth provided for the application with lower BW alone,the first application in this example, translates to lower per-formance degradation over its uncontested performance forthat application. Since WS is the sum of slowdowns, prior-itizing the application with lower BW alone provides higherWS for the system.

We observe in Figure 5a that prioritizing the applicationwith lower BW alone results in higher WS for HISTO_TRD andBLK_QTC. BLK_NN is not a bandwidth limited workload, thus,does not benefit from prioritization.

However, the problem with the approach that prioritizesthe application with lower BW alone is that, it is difficult toobtain BW alone of an application without offline profiling,as pointed out by Subramanian et al. [42]. They also pro-pose an approximate method to obtain alone performanceof an application in a multiple-application environment dur-ing run-time. However, doing so requires halting the exe-cution of one of the applications to approximately calculatethe alone performance of the other application, which mightcause drop in WS. Also, it does not completely eliminatethe application interference. Furthermore, this sampling hasto be done frequently to capture execution phases with com-pletely different behaviours. Thus, instead of approximatingP alonei or BW alone

i , we slightly change our WS optimizationcondition, which leads to a solution that is much easier toimplement and employ during run-time. Our approximationdoes not use alone performance of an application; instead,compares P ′i with Pi. Mathematically, we have,

P ′1P1

+P ′2P2

>P1

P1+P2

P2(14)

BW1 + ε

MPKI1

MPKI1BW1

+BW2 − εMPKI2

MPKI2BW2

> 2 (15)

Simplifying (15) yields,ε(BW2 −BW1) > 0 (16)

=⇒ BW2 > BW1, if ε > 0 (17)

BW1 > BW2, if ε < 0 (18)

From (17) and (18), we find that we can prioritize the appli-cation with lower BW to improve relative weighted speedup.

We will later demonstrate in Section 8 that such approx-imation leads to better weighted speedup. This is alsoconsistent with the observation made by Kim et al. [27] thatpreferring application with the least attained bandwidthcan improve weighted speedup.Discussion: We showed in Figure 6 that optimizing bothIT and WS using the same memory scheduling strategymight not be possible. We also showed a similar scenarioin Figure 5 where BLK_QTC prefers a different application tobe prioritized in order to achieve the best IT or WS. Thekey reason behind this is the properties of the applicationsthat form the workload. In workloads, where the same ap-plication has the lower MPKI and the lower BW alone, thatapplication can be prioritized to optimize both IT and WS.However, in workloads, where one application has lowerMPKI but the other demands less bandwidth, then theoptimal scheduling strategy for improving our performance

metrics is different. In such scenarios, RR achieves a goodbalance between IT and WS, because it gives equal prior-ities to both applications (assuming both applications havesimilar row-buffer localities). However, in scenarios whereprioritizing only one application is optimal for both IT andWS, RR would be far from optimal in terms of performance.

6. MECHANISM AND IMPLEMENTA-TION DETAILS

We provided two key observations in Section 4.2, and pre-sented a theoretical background for them in Section 5. Basedon these observations, we propose two memory schedul-ing schemes: a) Instruction Throughput Targeted Scheme(ITS), and b) Weighted Speedup Targeted Scheme (WEIS).1) Instruction Throughput Targeted Scheme (ITS):This scheme aims to improve IT based on our observationthat the application having lower MPKI should be prior-itized. In order to determine the application with lowerMPKI, we first periodically (every 1024 cycles4) calculatetwo metrics during run-time: 1) the number of L2 misses foreach application locally at each memory partition, and 2)the number of instructions committed by each application.Then, the information regarding the committed instructionsis propagated to the MCs. We calculate MPKI of all theapplications locally at each MC, using an arithmetic unit.Then, by using a comparator, we determine the applicationwith the lowest MPKI, and prioritize the memory requestsgenerated by that application in the MC.2) Weighted Speedup Targeted Scheme (WEIS):This scheme aims to improve WS based on our observationthat the application having lower BW should be prioritized.In order to determine the application with lower BW , weperiodically calculate the amount of data transferred overthe DRAM bus for each application locally at each memorypartition. Then, by using a comparator, we determine theapplication with the lowest BW , and prioritize the memoryrequests generated by that application in the MC.Implementation of the Priority Mechanism: Our pri-ority mechanism takes advantage of the already existing FR-FCFS memory scheduler implementation. However, afterserving the requests that generate DRAM row buffer hits,instead of picking request in the FCFS order, we pick theoldest request from the highest priority application. Whenthere is no request from the highest priority application inthe MC queue, we pick the oldest request originating fromthe application with the next highest priority.

7. INFRASTRUCTURE AND EVALUA-TION METHODOLOGY

Infrastructure: Most of the prior GPU research (e.g. [7,9, 21, 23, 24, 26]) is focused on improving the performanceof a single GPU application, and is evaluated based onthe benchmarks originating from different application suites(Section 3.2). However, to investigate the research issuesin the context of multiple applications, these applicationsneed to be concurrently executed on the same GPU plat-form. This is a non-trivial task because it involves building

4We also used three other sampling size windows (256, 1024,2048) cycles. The difference in overall average performanceis less than 1%, implying that sampling window size doesnot have a significant impact on our design.

7

Table 2: Key configuration parameters of the simulated GPU configuration. See GPGPU-Sim v3.2.1 [16] for full list.

Core Features 1400MHz core clock, 30 cores (streaming multi-processors), SIMT width = 32 (16 × 2),Greedy-then-oldest first (GTO) dual warp scheduler [34]

Resources / Core [16,39,40] 16KB shared memory, 16KB register file, Max. 1536 threads (48 warps, 32 threads/warp)Private Caches / Core [16,39,40] 16KB 4-way L1 data cache

12KB 24-way texture cache, 8KB 2-way constant cache, 2KB 4-way I-cache, 128B cache block sizeShared L2 Cache 16-way 128 KB/memory channel (768KB in total), 128B cache block sizeFeatures Memory coalescing and inter-warp merging enabled,

immediate post dominator based branch divergence handlingMemory Model 6 GDDR5 Memory Controllers (MCs), FR-FCFS scheduling (256 max. requests/MC),

8 DRAM-banks/MC, 4 bank-groups/MC, 924 MHz memory clock, 88.8GB/sec peak memory bandwidthGlobal linear address space is interleaved among partitions in chunks of 256 bytes [15]Reads and writes are handled with equal priority [8, 16]Hynix GDDR5 Timing [19], tCL = 12, tRP = 12, tRC = 40,tRAS = 28, tCCD = 2, tRCD = 12, tRRD = 6, tCDLR = 5, tWR = 12

Interconnect [16] 1 crossbar/direction (30 cores, 6 MCs), 1400MHz interconnect clock, islip VC and switch allocators

a framework that can launch existing CUDA applications inparallel without significant changes to the source code. Inorder to do so, we develop a new framework, called GPUconcurrent application framework (GCA). This frameworktakes advantage of CUDA streams. A stream is defined asa series of memory operations and kernel launches that arerequired to execute sequentially, however, different streamscan be executed in parallel. Our GCA framework creates aseparate stream for each application, and issues all its as-sociated commands to the stream. As many of the legacyCUDA codes use synchronous (e.g., cudaMemcpy()) mem-ory transfer operations, our framework does source codemodifications to change them to asynchronous CUDA APIcalls (e.g., cudaMemcpyAsync()). To ensure correct execu-tion of multiple streams, our framework also adds appro-priate synchronization constructs (e.g., cudaStreamSynchro-nize()) to the source code at correct places. After thesesteps are performed, GCA is ready to execute multiplestreams/applications either on real GPU hardware or on thesimulator. In this paper, we use GCA to concurrently exe-cute workloads on GPGPU-Sim, which is already capable ofconcurrently running multiple CUDA streams. As a part ofinitial package, our suite contains 300 (

(252

)) 2-application

workloads and 2300 ((253

)) 3-application workloads. Also,

the initial GCA package includes guidelines to add a newCUDA code to the suite. We will open source the GCApackage for public use for fostering future research.Evaluation Methodology: We simulate both two- andthree-application workloads on GPGPU-Sim [8], a cycle-accurate GPU simulator. Table 2 provides the details ofsimulation configuration. GCA framework launches CUDAapplications on GPGPU-Sim and executes until the pointwhere all the applications complete at least once. To achievethis, GCA framework relaunches the faster running applica-tion(s) until the slowest application completes its execution.We collect the statistics of individual applications whenthey finish their respective executions such that amount ofwork done by individual applications across different runsis consistent. This methodology is consistent with the priorworks [35]. We only simulate application kernels, and donot have a performance simulation for the data transfer be-tween CPU and GPU. Our DRAM contention model is thesame that comes with the default GPGPU-Sim distribution,which is validated across many workloads [8].Workload Classification: We evaluate 100 two-application workloads and classify them based on twocriteria. The first classification is based on the MPKIdifference between the applications in the workload. Ifthis difference is greater than 10, the workload belongs to

NW_SCP

BLK_MUM

QTC_SCP

BLK_LUH

NW_SCP

BLK_MUM

QTC_SCP

BLK_LUH

NW_SCP

BLK_MUM

QTC_SCP

BLK_LUH

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

-25 -20 -15 -10 -5 0 5 10Ban

dw

idth

Uti

liza

tio

n (

BW

/C)

Dif

fere

nce

s i

n A

pp

lic

ati

on

s

L2-MPKI Differences in Applications

FR-FCFS RR ITS

AB

C

Figure 7: The effect of FR-FCFS, RR, and ITS on BW1 −BW2 and MPKI1 −MPKI2.

Class-A-MPKI. If it is less than 1, the workload belongs toClass-C-MPKI, otherwise it belongs to Class-B-MPKI. Thesecond classification is based on the BW/C (bandwidth-utilization) difference between the applications in theworkload. If this difference is greater than 50%, theworkload belongs to Class-A-BW. If it is less than 25%,the workload belongs to Class-C-BW, otherwise it belongsto Class-B-BW. The intuition behind this classificationmethod is that the workloads with high MPKI and highBW difference are more likely to benefit from ITS andWEIS, respectively.

8. EXPERIMENTAL RESULTSIn this section, we evaluate five memory scheduling mech-

anisms: 1) the baseline FR-FCFS, 2) the recently proposedRR FR-FCFS [22], 3) ITS, 4) WEIS, 5) a static mecha-nism that always prioritizes the lowestMPKI application inthe workload, Prior App Low MPKI, and 6) a static mecha-nism that always prioritizes the application in the workloadwith the lowest BW alone, Prior App Low BW. Note thatthe above static mechanisms require offline profiling thatare difficult to employ during run-time. We use these staticschemes as comparison points for ITS and WEIS, respec-tively. We use equal core partitioning across SMs for theapplications in a workload, and show sensitivity to differentcore partitioning schemes in Section 8.4. We use geomet-ric mean (GM) to report average performance results. Foreach proposed scheme, we first demonstrate how effectivethe algorithm is in modulating the bandwidth given to eachapplication, and then we show the performance results.

8.1 Evaluation of ITSHow ITS Works: Figure 7 shows the effect of FR-FCFS,RR FR-FCFS, and ITS on four representative workloads.The x-axis shows MPKI1 −MPKI2, and the y-axis showsBW1 −BW2.

8

0.8

1

1.2

1.4

1.6

1.8

3D

S_

GU

PS

BL

K_

MU

M

BP

_G

UP

S

FW

T_

GU

PS

GU

PS

_R

ED

HS

_G

UP

S

JP

EG

_G

UP

S

LIB

_G

UP

S

MM

_G

UP

S

MU

M_

3D

S

MU

M_

FW

T

MU

M_

GU

PS

MU

M_

JP

EG

MU

M_

SC

P

SC

AN

_G

UP

S

SC

AN

_M

UM

SC

P_

GU

PS

3D

S_

RE

D

BL

K_L

UH

BL

K_

RE

D

BL

K_

SC

AN

LU

H_

SC

P

NW

_S

CP

QT

C_

SC

P

BL

K_

CF

D

Ov

era

ll

Cla

ss

_A

_M

PK

I

Cla

ss

_B

_M

PK

I

Cla

ss

_C

_M

PK

I

Class_A_MPKI Class_B_MPKI Geomean

No

rmali

ze

d I

ns

tru

cti

on

T

hro

ug

hp

ut

RR WEIS ITS Prior App (Low MPKI)

Class

_C

_MPKI

Figure 8: IT results normalized with respect to FR-FCFS for 25 representative workloads.

In BLK_MUM which is shown in A , since MUM has higherMPKI than BLK, and BLK attains higher BW than MUM,all the points inside A are in the second quadrant. ITSprioritizes BLK due to its relatively lower MPKI, thus, thedifference between BW attained by BLK and MUM increaseswith respect to FR-FCFS. We observe an interesting case inRR. Since BLK already attains higher bandwidth with FR-FCFS, we would expect MUM to find more opportunity toutilize DRAM with RR. However, the RR mechanism em-ployed by Jog et al. [22] preserves row-locality while schedul-ing requests. Therefore, this mechanism, although expectedto give equal share of bandwidth to both applications, pro-vides more opportunity to the application with higher row-locality for scheduling its requests. In other words, RR pro-vides the applications with an equal opportunity to activaterows, resulting in the application with higher row-locality toschedule more requests due to their differences between thenumber of requests served per active row-buffer. The exactphenomenon is observed in A with RR, where BLK has 4×higher row-locality than that of MUM, leading to RR givingBLK even higher opportunity to schedule its requests com-pared to FR-FCFS. However, ITS unilaterally prefers BLK

due to its consistently lower MPKI.In BLK_LUH, which is shown in B , we observe very similar

trends as in A . However, as opposed to A , RR reducesthe gap between BW achieved by LUH and BLK, since bothapplications have similar row-localities. In NW_SCP which isshown in C , since NW has higher MPKI than SCP, and SCP

attains higher bandwidth than NW, all the points inside C

are in the fourth quadrant. ITS prioritizes SCP due to itsrelatively lower MPKI, thus, SCP achieves even more BWcompared to FR-FCFS. We observe almost the same behav-ior with QTC_SCP as well. Note that, MPKI values of theapplications in the workload do not change across schemes,as it is an application property and each application has itsown L2 cache partition.ITS Performance: Figure 8 shows the instructionthroughput of RR, WEIS, ITS, and Prior App Low MPKInormalized with respect to FR-FCFS, using 25 represen-tative workloads that span across MPKI-based workloadclasses, chosen from our pool of 100 workloads. We also showthe average IT of these workloads including the individualGM for each workload class. As expected, in BLK_MUM previ-ously shown in A (Figure 7), IT improves by 15% with RR,and by 30% with ITS. We observe that, employing WEISprovides improvements over FR-FCFS, but results in slightlylower average performance than RR. It is expected, becauseWEIS is not targeted to optimize IT . With ITS, we observe34% and 8% average IT improvements over FR-FCFS andRR, respectively, across 25 workloads. These numbers are49% and 7% for Class-A-MPKI applications, because the

BLK_MM BLK_NW

NW_FWT

BLK_3DS 3DS_RED

BLK_MM

BLK_NW

NW_FWT

BLK_3DS

3DS_RED

BLK_MM

BLK_NW

NW_FWT

BLK_3DS

3DS_RED

-10%

-5%

0%

5%

10%

15%

20%

-0.4 -0.2 0 0.2 0.4 0.6 0.8

Perf

orm

an

ce

Im

pro

ve

me

nt

Ove

r F

R-F

CF

S

Bandwidth Utilization (BW/C) Difference between Applications

FR-FCFS RR WEIS

E

F

D

Figure 9: Effect of FR-FCFS, RR, and WEIS on WS andBW1 −BW2.

workloads that have applications with strikingly differentMPKIs are more likely to benefit with ITS over FR-FCFS.Class-B-MPKI and Class-C-MPKI also gain moderate per-formance improvements, by 7% and 12% over FR-FCFS,respectively. As we have shown in Figure 7, MPKI doesnot change significantly. Thus, dynamism of ITS does notprovide extra IT benefits over Prior App Low MPKI.

8.2 Evaluation of WEISHow WEIS Works: Figure 9 shows the effect of FR-FCFS,RR, and ITS on five representative workloads. The x-axisshows BW1−BW2, and the y-axis shows the normalized WSimprovement over FR-FCFS. WEIS attempts to reduce thedifference between BW attained by the applications, andtherefore in the figure, we expect WEIS to push the work-loads towards the y-axis. Also, as it is expected to improveWS, it also pushes the workload upwards. In NW_FWT whichis shown in D , BW of FWT is higher than NW. WEIS prefersNW as it attained lower BW , which pushes this workloadsupwards and towards y-axis. The RR mechanism degradesWS for NW_FWT because of similar reasons related to row-locality as pointed earlier (FWT has 13× higher row-localitythan NW).

In BLK_3DS ( E ) and BLK_MM ( F ) both RR and WEIS pushthe workload towards y-axis along with improving WS. Weobserve that the trends in both RR and WEIS are similarin 3DS_RED and NW_FWT, and also in BLK_NW and BLK_3DS.WEIS Performance: Figure 10 shows WS of RR, WEIS,ITS, and Prior App Low BW normalized with respect toFR-FCFS, using 25 representative workloads that spanacross BW-based workload classes, chosen from our poolof 100 workloads. We also show the average WS of theseworkloads including the individual GM for each workloadclass. We observe that, employing ITS provides improve-ments over FR-FCFS, but results in lower average perfor-mance than RR. This is expected, because ITS is not tar-geted to optimize WS. With WEIS, we observe 10% and5% average WS improvements over FR-FCFS and RR, re-spectively, across 25 workloads. These numbers are 14%

9

0.8

0.9

1

1.1

1.2

1.3

1.4

BL

K_

HS

BL

K_

MM

BL

K_

NW

BL

K_

QT

C

HIS

TO

_G

UP

S

HS

_G

UP

S

MM

_G

UP

S

MU

M_

HS

MU

M_

MM

NW

_G

UP

S

NW

_R

ED

QT

C_

GU

PS

BL

K_

SR

AD

NW

_F

WT

QT

C_

FW

T

SC

AN

_N

W

SC

AN

_Q

TC

BL

K_

3D

S

LU

H_

TR

D

QT

C_

TR

D

TR

D_

NW

LU

H_

RE

D

MU

M_

RE

D

3D

S_

RE

D

MU

M_

FW

T

Ov

era

ll

Cla

ss

_A

_B

W

Cla

ss

_B

_B

W

Cla

ss

_C

_B

W

Class_A_BW Class_B_BW Class_C_BW Geomean

No

rmalized

Weig

hte

d

Sp

eed

up

RR ITS WEIS Prior App (Low BW)

Figure 10: WS results normalized with respect to FR-FCFS for 25 representative workloads.

0.9

1

1.1

1.2

1.3

1.4

Class A Class B Class C Overall

No

rma

lize

d In

str

uc

tio

n

Th

rou

gh

pu

t

RR WEIS ITS Prior App (Low MPKI)

(a) Normalized IT.

0.9

0.95

1

1.05

1.1

Class A Class B Class C Overall

No

rma

lize

d

We

igh

ted

Sp

ee

du

p RR ITS WEIS Prior App (Low BW)

(b) Normalized WS.

Figure 11: Summary IT and WS results for 100 workloads,normalized with respect to FR-FCFS.

and 3% for Class-A-BW applications, because the work-loads that have applications with strikingly different BWare more likely to benefit with WEIS, compared to FR-FCFS. Class-B-BW and Class-C-BW also gain performanceimprovements, by 5% and 8% over FR-FCFS, respectively.In Class-C-BW workloads, WEIS performs much better thanPrior App Low BW. This is because the average BW dif-ference is not significant, and it is more likely that the sameapplication does not achieve consistently lower bandwidththan the other application. Therefore, prioritizing an appli-cation unilaterally like Prior App Low BW does may leadto sub-optimal performance.

8.3 Performance SummaryWe evaluate ITS and WEIS for 100 workloads and we

observe in Figure 11 that our conclusions from previ-ous discussions hold true for a wide range of workloads.

0.8

0.9

1

1.1

1.2

Class A Class B Class C

BW Classes Overall

No

rmali

zed

H

arm

on

ic S

peed

up RR WEIS

Figure 12: HS resultsfor 100 workloads nor-malized with respect toFR-FCFS.

In Figure 12, we report Har-monic Speedup (HS) for 100workloads in order to gaugeWEIS with a balanced metricfor performance as well as fair-ness [28]. Across all classes ofworkloads, we consistently ob-serve better HS compared ofFR-FCFS and RR. On average,RR and WEIS achieve 4% and8% higher HS over FR-FCFS.

8.4 Scalability AnalysisApplication Scalability: We evaluate ITS and WEIS inthe scenario when three applications are executed concur-rently. We observe that the impact of our schemes is evenhigher, as there is significant increase in memory interferenceamong three applications. Figure 14 shows normalized ITand WS improvements with ITS and WEIS, respectivelyfor 10 workloads. We observe significant IT improvement(27%) in GUPS_SCP_HISTO, as ITS prefers HISTO because ofits significantly lower MPKI than other applications in theworkload. For WEIS, we also observe similar trends as dis-cussed before.Core Partitioning: We evaluate three core partitioning

0.9

1

1.1

1.2

1.3

No

rmali

zed

In

str

ucti

on

T

hro

ug

hp

ut

(a) Evaluation of ITS.

0.95

1

1.05

1.1

No

rmali

zed

Weig

hte

d

Sp

eed

up

(b) Evaluation of WEIS.

Figure 14: Evaluation of ITS and WEIS with three GPUapplications.

configurations: (10,20), (20,10), and the baseline (15,15).Figure 13 shows normalized IT improvements of ITS forJPEG_GUPS, over FR-FCFS when it is used in their re-spective configurations. We observe in all three config-urations that the improvements in JPEG_GUPS are signifi-cant. However, if fewer cores (ITS (20,10)) are assignedto GUPS, which is a very high memory demanding appli-cation, the negative interference effect on JPEG is reduced.

0.8

1

1.2

1.4

1.6

1.8

JPEG_GUPS

No

rma

lize

d I

T ITS (10,20)

ITS (15,15)

ITS (20,10)

Figure 13: Core parti-tioning results.

Therefore, the relative IT im-provements in ITS (20,10) islower than ITS (15, 15). Inthe case of ITS (10, 20), thealone IPC of JPEG is lower asJPEG is assigned to fewer cores.This leads to lower scope in JPEG

IPC improvements compared tothe baseline ITS (15, 15) case.These results indicate that al-

though core partitioning mechanisms affect the magnitudeof interference, the problem still remains significant. Weleave orchestrated memory and core resource partitioningschemes as a part of future work.

9. RELATED WORKTo the best of our knowledge, this is the first work that

analyzes the interactions of multiple applications in GPUmemory system, via both experiments as well as a mathe-matical model.GPU and SoC Memory Scheduling: Prior work onmemory scheduling for GPUs has dealt with a single ap-plication context only. Yuan et al. [45] proposed an arbitra-tion mechanism in NoC to restore the lost row-buffer local-ity to enable a simple in-order DRAM memory scheduler.Lakshminarayana et al. [29] explored a DRAM schedulingpolicy that essentially chooses between Shortest Job First(SJF) and FR-FCFS [38, 46]. Chatterjee et al. [9] proposeda warp-aware memory scheduling mechanism that reducesthe DRAM latency divergence in a warp by avoiding the in-terleaved servicing of memory requests from different warps.The benefits from the above schedulers are orthogonal to ourschemes and some of these mechanisms can be adapted as

10

secondary arbitration criteria between requests for the cur-rently prioritized application in our scheduler. In the SoCspace, Jeong et al. [20] proposed allowing the GPU to con-sume only the required bandwidth to maintain a certainreal-time QoS-level for graphics applications. Ausavarung-nirun et al. [7] proposed a memory scheduling technique forCPU-GPU architectures. However, the overriding motiva-tions for such prior work is to obtain the lowest possiblelatency for CPU requests without degrading the bandwidthutilization of the channel.CPU Memory Schedulers: The impact of memoryscheduling on multicore CPU systems has been a topic of sig-nificant interest in recent years [1,14,27,28,31,32]. Ebrahimiet al. [13] proposed parallel application memory scheduling,where they explicitly managed inter-thread memory interfer-ence for improving performance. The Thread Cluster Mem-ory Scheduler (TCM) [28] is particularly relevant becausenot only did it advocate prioritizing latency-sensitive ap-plications, it identified that the main source of unfairnessis the interference between different bandwidth intensiveapplications. To improve performance and fairness, TCMranks the bandwidth-intensive threads based on their rela-tive bank-level parallelism and row-buffer locality, and peri-odically shuffles the priority of the threads. The TCM tech-nique is an advancement over the ATLAS [27], PARBS [32],and STFM [31] mechanisms that do not distinguish betweenbandwidth-intensive threads while improving performanceand fairness.

We share the same objectives as the CPU schedulers likeTCM, but the motivations and considerations behind ourproposed schedulers, as well as the implementation and de-rived insight are significantly different. First, prior CPUmemory schedulers concentrated only on single-threadedor modestly multi-threaded/multi-programmed workloads,while we demonstrate the benefits of our schemes for mul-tiple, massively-threaded applications running on a funda-mentally different architecture (SIMT). Second, the analysisin Sec. 3 establishes a different set of metrics from TCM,viz. MPKI and attained bandwidth, to guide the mem-ory scheduling at the application level (as opposed to singlethreads as in TCM). This is partly due to the use of TLP inGPU programs to hide memory latency as opposed to ILPand MLP in CPUs. Third, in contrast to earlier papers,we demonstrate that the same scheduler can not achievethe best of both aggregate throughput and fairness. Conse-quently, a practical implementation can use runtime settingsto choose between high throughput and fairness-optimizedscheduling based on application domains (e.g., high aggre-gate throughput in HPC applications vs QoS guarantees invirtualized GPUs). The final differentiator between TCMand our mechanisms is complexity. TCM needs to track eachthread’s MLP, bank-level parallelism (BLP), and row-bufferlocality (RBL), and requires an expensive insertion sort-likeprocedure to shuffle the ranks of high-MLP applications. Incontrast, we only require the L2 MPKI and currently sus-tained bandwidth information for each application, and afew simple comparisons in each time quanta. Scaling TCM’spolicies for the many thousands of concurrent threads in aGPU would be challenging in GDDR5 MC that has to sup-port multi-gigabit command issue rates.Concurrent execution of multiple applications onGPUs: Adriaens et al. [5] proposed spatial partitioningof SM resources across concurrent applications. They pre-

sented a variety of heuristics for dividing the SM resourcesacross applications. Pai et al. [35] proposed elastic ker-nels that allow a fine-grained control over their resourceusage. None of these works addressed the problem of con-tention in the memory system. Moreover, we show thatmemory interference problem remains significant regardlessof SM-partitioning mechanisms, and we believe our proposedschemes are complementary to core resource partitioningtechniques. Gregg et al. [17] presented KernelMerge, a run-time framework to understand and investigate concurrencyissues for OpenCL applications. Wang et al. [44] proposedcontext funneling, which allows kernels from different pro-grams to execute concurrently. Our work presents a newGCA framework that consists of large number of CUDAworkloads and also provides flexibility to add new CUDAcodes in the framework without much effort.

10. CONCLUSIONSWe present an in-depth analysis of GPU memory sys-

tem in a multiple-application domain. We show that co-scheduled applications can significantly interfere in the GPUmemory system leading to significant loss in overall perfor-mance. To address this problem, we developed an analyticalmodel that indicates that L2-MPKI and attained bandwidthare the two knobs that can be used to drive memory schedul-ing decisions for achieving better performance.

11. REFERENCES[1] 3rd JILP Workshop on Computer Architecture

Competitions (Memory Scheduling Championship).http://www.cs.utah.edu/~rajeev/jwac12/.

[2] AMD Radeon R9 290X.http://www.amd.com/us/press-releases/Pages/

amd-radeon-r9-290x-2013oct24.aspx.

[3] NVIDIA GRID.http://www.nvidia.com/object/grid-boards.html.

[4] NVIDIA GTX 780-Ti. http://www.nvidia.com/gtx-700-graphics-cards/gtx-780ti/.

[5] J. Adriaens, K. Compton, N. S. Kim, and M. Schulte.The case for GPGPU spatial multitasking. In HPCA,2012.

[6] S. R. Agrawal, V. Pistol, J. Pang, J. Tran, D. Tarjan,and A. R. Lebeck. Rhythm: Harnessing data parallelhardware for server workloads. SIGARCH Comput.Archit. News.

[7] R. Ausavarungnirun, K. K.-W. Chang,L. Subramanian, G. H. Loh, and O. Mutlu. StagedMemory Scheduling: Achieving High Prformance andScalability in Heterogeneous Systems. In ISCA, 2012.

[8] A. Bakhoda, G. Yuan, W. Fung, H. Wong, andT. Aamodt. Analyzing CUDA Workloads Using aDetailed GPU Simulator. In ISPASS, 2009.

[9] N. Chatterjee, M. O’Connor, G. H. Loh, N. Jayasena,and R. Balasubramonian. Managing DRAM LatencyDivergence in Irregular GPGPU Applications. In SC,2014.

[10] S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer,S.-H. Lee, and K. Skadron. Rodinia: A BenchmarkSuite for Heterogeneous Computing. In IISWC, 2009.

[11] A. Danalis, G. Marin, C. McCurdy, J. S. Meredith,P. C. Roth, K. Spafford, V. Tipparaju, and J. S.

11

Vetter. The scalable heterogeneous computing (shoc)benchmark suite. In GPGPU, 2010.

[12] R. Das, O. Mutlu, T. Moscibroda, and C. R. Das.Aergia: exploiting packet latency slack in on-chipnetworks. In ACM SIGARCH Computer ArchitectureNews. ACM, 2010.

[13] E. Ebrahimi, R. Miftakhutdinov, C. Fallin, C. J. Lee,J. A. Joao, O. Mutlu, and Y. N. Patt. Parallelapplication memory scheduling. In Proceedings of the44th Annual IEEE/ACM International Symposium onMicroarchitecture, MICRO-44 ’11, 2011.

[14] S. Ghose, H. Lee, and J. F. Martınez. Improvingmemory scheduling via processor-side load criticalityinformation. In ISCA, 2013.

[15] GPGPU-Sim v3.2.1. Address mapping.

[16] GPGPU-Sim v3.2.1. GTX 480 Configuration.

[17] C. Gregg, J. Dorn, K. Hazelwood, and K. Skadron.Fine-grained resource sharing for concurrent GPGPUkernels. In HotPar, 2012.

[18] Z. Guz, E. Bolotin, I. Keidar, A. Kolodny,A. Mendelson, and U. C. Weiser. Many-core vs.many-thread machines: Stay away from the valley.IEEE Comput. Archit. Lett.

[19] Hynix. Hynix GDDR5 SGRAM Part H5GQ1H24AFRRevision 1.0.

[20] M. K. Jeong, M. Erez, C. Sudanthi, and N. Paver. AQoS-aware memory controller for dynamicallybalancing GPU and CPU bandwidth use in anMPSoC. In Proceedings of the 49th Annual DesignAutomation Conference, pages 850–855. ACM, 2012.

[21] W. Jia, K. A. Shaw, and M. Martonosi.Characterizing and Improving the Use ofDemand-fetched Caches in GPUs. In ICS, 2012.

[22] A. Jog, E. Bolotin, Z. Guz, M. Parker, S. W. Keckler,M. T. Kandemir, and C. R. Das. Application-awareMemory System for Fair and Efficient Execution ofConcurrent GPGPU Applications. In GPGPU, 2014.

[23] A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir,O. Mutlu, R. Iyer, and C. R. Das. OrchestratedScheduling and Prefetching for GPGPUs. In ISCA,2013.

[24] A. Jog, O. Kayiran, N. C. Nachiappan, A. K. Mishra,M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das.OWL: Cooperative Thread Array Aware SchedulingTechniques for Improving GPGPU Performance. InASPLOS, 2013.

[25] I. Karlin, A. Bhatele, J. Keasler, B. Chamberlain,J. Cohen, Z. DeVito, R. Haque, D. Laney, E. Luke,F. Wang, D. Richards, M. Schulz, and C. Still.Exploring traditional and emerging parallelprogramming models using a proxy application. InIPDPS, 2013.

[26] O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das.Neither More Nor Less: Optimizing Thread-levelParallelism for GPGPUs. In PACT, 2013.

[27] Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter.ATLAS: A Scalable and High-performance SchedulingAlgorithm for Multiple Memory Controllers. InHPCA, 2010.

[28] Y. Kim, M. Papamichael, O. Mutlu, andM. Harchol-Balter. Thread Cluster Memory

Scheduling: Exploiting Differences in Memory AccessBehavior. In MICRO, 2010.

[29] N. B. Lakshminarayana, J. Lee, H. Kim, and J. Shin.DRAM Scheduling Policy for GPGPU ArchitecturesBased on a Potential Function. Computer ArchitectureLetters, 2012.

[30] X. Lin and R. Balasubramonian. Refining the utilitymetric for utility-based cache partitioning. Proc.WDDD, 2011.

[31] O. Mutlu and T. Moscibroda. Stall-Time Fair MemoryAccess Scheduling for Chip Multiprocessors. InMICRO, 2007.

[32] O. Mutlu and T. Moscibroda. Parallelism-AwareBatch Scheduling: Enhancing Both Performance andFairness of Shared DRAM Systems. In ISCA, 2008.

[33] NVIDIA. CUDA C/C++ SDK Code Samples, 2011.

[34] NVIDIA. Fermi: NVIDIA’s Next Generation CUDACompute Architecture, 2011.

[35] S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan.Improving GPGPU concurrency with elastic kernels.In ASPLOS, 2013.

[36] M. K. Qureshi and Y. N. Patt. Utility-based cachepartitioning: A low-overhead, high-performance,runtime mechanism to partition shared caches. InProceedings of the 39th Annual IEEE/ACMInternational Symposium on Microarchitecture. IEEEComputer Society, 2006.

[37] S. Rixner. Memory Controller Optimizations for WebServers. In MICRO, 2004.

[38] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, andJ. D. Owens. Memory Access Scheduling. In ISCA,2000.

[39] T. G. Rogers, M. O’Connor, and T. M. Aamodt.Cache-Conscious Wavefront Scheduling. In MICRO,2012.

[40] T. G. Rogers, M. O’Connor, and T. M. Aamodt.Divergence-Aware Warp Scheduling. In MICRO, 2013.

[41] J. A. Stratton, C. Rodrigues, I. J. Sung, N. Obeid,L. W. Chang, N. Anssari, G. D. Liu, and W. W. Hwu.Parboil: A Revised Benchmark Suite for Scientific andCommercial Throughput Computing. TechnicalReport IMPACT-12-01, University of Illinois, atUrbana-Champaign, March 2012.

[42] L. Subramanian, V. Seshadri, Y. Kim, B. Jaiyen, andO. Mutlu. Mise: Providing performance predictabilityand improving fairness in shared main memorysystems. In Proceedings of the 2013 IEEE 19thInternational Symposium on High PerformanceComputer Architecture (HPCA), HPCA ’13, 2013.

[43] K. Wang, X. Ding, R. Lee, S. Kato, and X. Zhang.GDM: Device Memory Management for GpgpuComputing. In SIGMETRICS, 2014.

[44] L. Wang, M. Huang, and T. El-Ghazawi. Exploitingconcurrent kernel execution on graphic processingunits. In HPCS, 2011.

[45] G. Yuan, A. Bakhoda, and T. Aamodt. ComplexityEffective Memory Access Scheduling for Many-coreAccelerator Architectures. In MICRO, 2009.

[46] W. K. Zuravleff and T. Robinson. Controller for aSynchronous DRAM that Maximizes Throughput byAllowing Memory Requests and Commands to be

12

Issued Out of Order. (U.S. Patent Number 5,630,096),Sept. 1997.

13

Anatomy of GPU Memory System for Multi-Application Execution · As GPUs make headway in the...

Documents

Transcript of Anatomy of GPU Memory System for Multi-Application Execution · As GPUs make headway in the...