Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications...

Application-aware Memory System for Fair and Efficient Execution of Concurrent

GPGPU Applications

Adwait Jog1, Evgeny Bolotin2, Zvika Guz2,a, Mike Parker2,b,

Steve Keckler2,3, Mahmut Kandemir1, Chita Das1

Penn State1, NVIDIA2, UT Austin3,

now at (Samsunga, Intelb)

GPGPU Workshop @ ASPLOS 2014

Era of Throughput Architectures

GPUs are scaling: Number of CUDA Cores, DRAM bandwidth

GTX 780 Ti(Kepler) 2880 cores

(336 GB/sec)

GTX 275 (Tesla) 240 cores (127 GB/sec)

GTX 480 (Fermi) 448 cores

(139 GB/sec)

Prior Approach (Looking Back)

Execute one kernel at a time

Works great, if kernel has enough parallelism

SM-1 SM-2 SM-30 SM-31 SM-32 SM-X

Single Application

Memory

Cache

Interconnect

Current Trend

What happens when kernels do not have enough threads?Execute multiple kernels (from same application/context) concurrently

4

KeplerFermi

CURRENT ARCHITECTURES SUPPORT THIS FEATURE

Future Trend (Looking Forward)

SM-1 SM-A

Application-1

Memory

Cache

Interconnect

SM-A+1

SM-B

Application-2SM-B+1

SM-X

Application-N

We study execution of multiple kernels from

multiple applications (contexts)

Why Multiple Applications (Contexts)?

Improves overall GPU throughput

Improves portability of multiple old apps (with limited thread-scalability) on newer scaled GPUs

Supports consolidation of multiple-user requests on to the same GPU

We study two applications scenarios

2. Co-scheduling two apps Assumed equal partitioning, 30 SM + 30 SM

7

SM-1 SM-30

Application-1

Memory

Cache

Interconnect

SM-31 SM-60

Application-2

SM-1 SM-60

Single Application (Alone)

Memory

Cache

Interconnect

SM-2 SM-3 SM-59

1. One application runs alone on 60 SM GPU (Alone_60)

8

Metrics

Instruction Throughput (Sum of IPCs)IPC (App1) + IPC (App2) + …. IPC (AppN)

Weighted SpeedupWith co-scheduling:

Speedup (App-N) = Co-scheduled IPC (App-N) / Alone IPC (App-N)

Weighted Speedup = Sum of speedups of ALL apps

Best case:Weighted Speedup = N (Number of apps)

With destructive interferenceWeighted Speedup can be between 0 to N

Time-slicing – running alone:Weighted Speedup = 1 (Baseline)

Outline

Introduction and motivation

Positives and negatives of co-scheduling multiple applications

Understanding inefficiencies in memory-subsystem

Proposed DRAM scheduler for better performance and fairness

Evaluation

Conclusions

Positives of co-scheduling multiple apps

Weighted Speedup = 1.4, when HIST is concurrently executed with DGEMM

40% improvement over running alone (time-slicing)

Gain in weighted speedup (application throughput)

Co-schedu...0

0.5

1

1.5

HIST+DGEMM

Wei

gh

ted

Sp

eed

up

Baseline

Alone_60 With DGEMM0

0.2

0.4

0.6

0.8

1

HIST

Sp

ee

du

p

Alone_60 With HIST0

0.2

0.4

0.6

0.8

1DGEMM

Sp

ee

du

p

Unequal performance degradation indicates unfairness in the system

With DGEMM With GUPS0

0.2

0.4

0.6

0.8

1HIST Performance

Sp

ee

du

p

Negatives of co-scheduling multiple apps (1)

(A) Fairness

GAUSS+GUPS: Only 2% improvement in weighted speedup, over running alone

Negatives of co-scheduling multiple apps (2)

(B) Weighted speedup (Application Throughput)With destructive Interference

Weighted speedup can be between 0 to 2 (can also go below baseline = 1)

HIST+DGEMM HIST+GUPS GAUSS+GUPS0

0.5

1

1.5W

eig

hte

d S

pee

du

p Baseline

his

t_g

au

ss

his

t_g

up

s

his

t_b

fs

his

t_3

ds

his

t_d

ge

mm

ga

us

s_

gu

ps

ga

us

s_

bfs

ga

us

s_

3d

s

ga

us

s_

dg

em

m

gu

ps

_b

fs

gu

ps

_3

ds

gu

ps

_d

ge

mm

bfs

_3

ds

bfs

_d

ge

mm

00.20.40.60.8

11.21.41.6

1st APP 2nd APP

Wei

gh

ted

Sp

eed

up

Highlighted workloads: Exhibit unfairness (imbalance in red-green portions) & low throughput

Naïve coupling of 2 apps is probably not a good idea

Summary: Positives and Negatives

Baseline

Outline





Evaluation

Conclusions

Primary Sources of Inefficiencies

Application Interference at many levels L2 Caches

Interconnect

DRAM (Primary Focus of this work)

SM-1 SM-A

Application-1

Memory

Cache

Interconnect

SM-A+1

SM-B

Application-2

SM-B+1

SM-X

Application-N

Bandwidth Distribution

Bandwidth intensive applications (e.g. GUPS) takes majority of memory bandwidth

alo

ne

_3

0a

lon

e_

60

ga

us

sg

up

sb

fs3

ds

dg

em

m

alo

ne

_3

0a

lon

e_

60

his

tg

up

sb

fs3

ds

dg

em

m

alo

ne

_3

0a

lon

e_

60

his

tg

au

ss

bfs

3d

sd

ge

mm

alo

ne

_3

0a

lon

e_

60

his

tg

au

ss

gu

ps

3d

s

alo

ne

_3

0a

lon

e_

60

his

tg

au

ss

gu

ps

bfs

dg

em

m

alo

ne

_3

0a

lon

e_

60

his

tg

au

ss

gu

ps

3d

s

HIST (1st App) GAUSS (1st App) GUPS (1st App) BFS (1st App) 3DS (1st App) DGEMM (1st App)

0%

20%

40%

60%

80%

100%

1st App 2nd App Wasted-BW Idle-BW

Pe

rce

nta

ge

of

Pe

ak

Ba

nd

wid

th

Red portion is the fraction of wasted DRAM cycles during which data is not transferred over bus

17

his

t_g

au

ss

his

t_g

up

s

his

t_b

fs

his

t_3

ds

his

t_d

ge

mm

ga

us

s_

gu

ps

ga

us

s_

bfs

ga

us

s_

3d

s

ga

us

s_

dg

em

m

gu

ps

_b

fs

gu

ps

_3

ds

gu

ps

_d

ge

mm

bfs

_3

ds

bfs

_d

ge

mm

00.20.40.60.8

11.21.41.6

1st APP 2nd APP

Wei

gh

ted

Sp

eed

up

Imbalance in green and red portions indicates unfairness

Revisiting Fairness and Throughput

Baseline

18

Agnostic to different requirements of memory requests coming from different applications

Leads to – Unfairness– Sub-optimal performance

Primarily focus on improving DRAM efficiency

Current Memory Scheduling Schemes

19

Simple FCFS

Time

Bank

R1

R1

R1

R2

R2

R2

R3

Row Switch

Row Switch

Row Switch

Row Switch

Commonly Employed Memory Scheduling Schemes

High DRAM Page Hit Rats

Time

Bank

R1

R1

R1

R2

R2

R2

R3

Row Switch

Row Switch

App-1 App-2

R1 R2 R3

Request toRow-1 Row-2 Row-3

Low DRAM Page Hit Rate

Out of order (FR-FCFS)

Both schedulers are application agnostic! (App-2 suffers)

20

Outline





Evaluation

Conclusions

21

As an example of adding application-awarenessInstead of FCFS, schedule requests in Round-Robin Fashion

Preserve the page hit rates

Proposal:FR-FCFS (Baseline) FR-(RR)-FCFS (Proposed)

Improves Fairness

Improves Performance

Proposed Application-Aware Scheduler

22

Proposed Application-Aware FR-(RR)-FCFS Scheduler

App-1 App-2

Time

Bank

R1 R2 R3

Request to

Row-1 Row-2 Row-3

R1

R1

R1

R2

R2

R2

R3

Row Switch

Row Switch

Time

Bank

R3

R1

R1

R1Row Switch

R2

R2

R2

Row Switch

App-2 is scheduled after App-1 in Round-Robin order

Baseline FR-FCFS Proposed FR-(RR)-FCFS

DRAM Page Hit-Rates

hist_gauss

hist_gups

hist_bfs

hist_3ds

hist_dgem

m

gauss_gups

gauss_bfs

gauss_3ds

gauss_dgem

m

gups_bfs

gups_3ds

gups_dgem

m

bfs_3ds

3ds_dgem

m

30%40%50%60%70%80%90%

FR-FCFS FR-RR-FCFS P

age

Hit

Rat

es

Same Page Hit-Rates as Baseline (FR-FCFS)

24

Outline





Evaluation

Conclusions

Simulation Environment

GPGPU-Sim (v3.2.1)

Kernels from multiple applications are issued to different concurrent CUDA Streams

14 two-application workloads considered with varying memory demands

Baseline configuration similar to scaled-up version of GTX480

60 SMs, 32-SIMT lanes, 32-threads/warp

16KB L1 (4-way, 128B cache block) + 48KB SharedMem per SM

6 partitions/channels (Total Bandwidth: 177.6 GB/sec)

Improvement in Fairness

Fairness = max (r1, r2) Index

r1 = Speedup(app1)

Speedup(app2)

r2 = Speedup(app2)

Speedup(app1)

his

t_g

au

ss

his

t_g

up

s

his

t_b

fs

his

t_3

ds

his

t_d

ge

mm

ga

us

s_

gu

ps

ga

us

s_

bfs

ga

us

s_

3d

s

ga

us

s_

dg

...

gu

ps

_b

fs

gu

ps

_3

ds

gu

ps

_d

g...

bfs

_3

ds

3d

s_

dg

em

m 0

2

4

6

8

10

12FR-FCFS FR-RR-FCFS

Fa

irn

es

s I

nd

ex

On average 7% improvement (up to 49%) in fairness

Significantly reduces the negative impact of BW sensitive applications (e.g. GUPS) on overall fairness of the GPU system

Lower is Better

Improvement in Performance (Normalized to FR-FCFS)

hist_

gaus

s

hist_

gups

hist_

bfs

hist_

3ds

hist_

dgem

m

gaus

s_gu

ps

gaus

s_bf

s

gaus

s_3d

s

gaus

s_dg

emm

gups

_bfs

gups

_3ds

gups

_dge

mm

bfs_

3ds

3ds_

dgem

m

0.8

0.85

0.9

0.95

1

1.05

1.1

No

rmal

ized

Wei

gh

ted

Sp

eed

up

hist_

gaus

s

hist_

gups

hist_

bfs

hist_

3ds

hist_

dgem

m

gaus

s_gu

ps

gaus

s_bf

s

gaus

s_3d

s

gaus

s_dg

emm

gups

_bfs

gups

_3ds

gups

_dge

mm

bfs_

3ds

3ds_

dgem

m

0.80.9

11.11.21.31.41.51.61.7

No

rmal

ized

In

stru

ctio

n

Th

rou

gh

pu

t

On average 10% improvement (up to 64%) in instruction throughput performance and up to 7% improvement in weighted speedup performance.

Significantly reduces the negative impact of BW sensitive applications (e.g. GUPS) on overall performance of the GPU system

Instruction Throughput Weighted Speedup

Bandwidth Distribution with Proposed Scheduler

alo

ne

_3

0

alo

ne

_6

0

fr-f

cfs

-gu

ps

fr-r

r-fc

fs-g

up

s

alo

ne

_3

0

alo

ne

_6

0

fr-f

cfs

_g

up

s

fr-r

r-fc

fs-g

up

s

alo

ne

_3

0

alo

ne

_6

0

fr-f

cfs

-gu

ps

fr-r

r-fc

fs-g

up

s

HIST (1st App) GAUSS (1st App) 3ds (1st App)

0%

20%

40%

60%

80%

100%

1st App 2nd App Wasted-BW Idle-BWP

erce

nta

ge

of

Pea

k B

and

wid

th

Lighter applications get better DRAM bandwidth share

Conclusions

Naïve coupling of applications is probably not a good ideaCo-scheduled applications interfere in the memory-subsystem

Sub-optimal Performance and Fairness

Current DRAM schedulers are agnostic to applicationsTreat all memory request equally

Application-aware memory system is required for enhanced performance and superior fairness

30

Thank You!

Questions?

Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications...

Documents

Transcript of Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications...