Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications...
-
Upload
emerson-tarkington -
Category
Documents
-
view
216 -
download
1
Transcript of Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications...
Application-aware Memory System for Fair and Efficient Execution of Concurrent
GPGPU Applications
Adwait Jog1, Evgeny Bolotin2, Zvika Guz2,a, Mike Parker2,b,
Steve Keckler2,3, Mahmut Kandemir1, Chita Das1
Penn State1, NVIDIA2, UT Austin3,
now at (Samsunga, Intelb)
GPGPU Workshop @ ASPLOS 2014
Era of Throughput Architectures
GPUs are scaling: Number of CUDA Cores, DRAM bandwidth
GTX 780 Ti(Kepler) 2880 cores
(336 GB/sec)
GTX 275 (Tesla) 240 cores (127 GB/sec)
GTX 480 (Fermi) 448 cores
(139 GB/sec)
Prior Approach (Looking Back)
Execute one kernel at a time
Works great, if kernel has enough parallelism
SM-1 SM-2 SM-30 SM-31 SM-32 SM-X
Single Application
Memory
Cache
Interconnect
Current Trend
What happens when kernels do not have enough threads?Execute multiple kernels (from same application/context) concurrently
4
KeplerFermi
CURRENT ARCHITECTURES SUPPORT THIS FEATURE
Future Trend (Looking Forward)
SM-1 SM-A
Application-1
Memory
Cache
Interconnect
SM-A+1
SM-B
Application-2SM-B+1
SM-X
Application-N
We study execution of multiple kernels from
multiple applications (contexts)
Why Multiple Applications (Contexts)?
Improves overall GPU throughput
Improves portability of multiple old apps (with limited thread-scalability) on newer scaled GPUs
Supports consolidation of multiple-user requests on to the same GPU
We study two applications scenarios
2. Co-scheduling two apps Assumed equal partitioning, 30 SM + 30 SM
7
SM-1 SM-30
Application-1
Memory
Cache
Interconnect
SM-31 SM-60
Application-2
SM-1 SM-60
Single Application (Alone)
Memory
Cache
Interconnect
SM-2 SM-3 SM-59
1. One application runs alone on 60 SM GPU (Alone_60)
8
Metrics
Instruction Throughput (Sum of IPCs)IPC (App1) + IPC (App2) + …. IPC (AppN)
Weighted SpeedupWith co-scheduling:
Speedup (App-N) = Co-scheduled IPC (App-N) / Alone IPC (App-N)
Weighted Speedup = Sum of speedups of ALL apps
Best case:Weighted Speedup = N (Number of apps)
With destructive interferenceWeighted Speedup can be between 0 to N
Time-slicing – running alone:Weighted Speedup = 1 (Baseline)
Outline
Introduction and motivation
Positives and negatives of co-scheduling multiple applications
Understanding inefficiencies in memory-subsystem
Proposed DRAM scheduler for better performance and fairness
Evaluation
Conclusions
Positives of co-scheduling multiple apps
Weighted Speedup = 1.4, when HIST is concurrently executed with DGEMM
40% improvement over running alone (time-slicing)
Gain in weighted speedup (application throughput)
Co-schedu...0
0.5
1
1.5
HIST+DGEMM
Wei
gh
ted
Sp
eed
up
Baseline
Alone_60 With DGEMM0
0.2
0.4
0.6
0.8
1
HIST
Sp
ee
du
p
Alone_60 With HIST0
0.2
0.4
0.6
0.8
1DGEMM
Sp
ee
du
p
Unequal performance degradation indicates unfairness in the system
With DGEMM With GUPS0
0.2
0.4
0.6
0.8
1HIST Performance
Sp
ee
du
p
Negatives of co-scheduling multiple apps (1)
(A) Fairness
GAUSS+GUPS: Only 2% improvement in weighted speedup, over running alone
Negatives of co-scheduling multiple apps (2)
(B) Weighted speedup (Application Throughput)With destructive Interference
Weighted speedup can be between 0 to 2 (can also go below baseline = 1)
HIST+DGEMM HIST+GUPS GAUSS+GUPS0
0.5
1
1.5W
eig
hte
d S
pee
du
p Baseline
his
t_g
au
ss
his
t_g
up
s
his
t_b
fs
his
t_3
ds
his
t_d
ge
mm
ga
us
s_
gu
ps
ga
us
s_
bfs
ga
us
s_
3d
s
ga
us
s_
dg
em
m
gu
ps
_b
fs
gu
ps
_3
ds
gu
ps
_d
ge
mm
bfs
_3
ds
bfs
_d
ge
mm
00.20.40.60.8
11.21.41.6
1st APP 2nd APP
Wei
gh
ted
Sp
eed
up
Highlighted workloads: Exhibit unfairness (imbalance in red-green portions) & low throughput
Naïve coupling of 2 apps is probably not a good idea
Summary: Positives and Negatives
Baseline
Outline
Introduction and motivation
Positives and negatives of co-scheduling multiple applications
Understanding inefficiencies in memory-subsystem
Proposed DRAM scheduler for better performance and fairness
Evaluation
Conclusions
Primary Sources of Inefficiencies
Application Interference at many levels L2 Caches
Interconnect
DRAM (Primary Focus of this work)
SM-1 SM-A
Application-1
Memory
Cache
Interconnect
SM-A+1
SM-B
Application-2
SM-B+1
SM-X
Application-N
Bandwidth Distribution
Bandwidth intensive applications (e.g. GUPS) takes majority of memory bandwidth
alo
ne
_3
0a
lon
e_
60
ga
us
sg
up
sb
fs3
ds
dg
em
m
alo
ne
_3
0a
lon
e_
60
his
tg
up
sb
fs3
ds
dg
em
m
alo
ne
_3
0a
lon
e_
60
his
tg
au
ss
bfs
3d
sd
ge
mm
alo
ne
_3
0a
lon
e_
60
his
tg
au
ss
gu
ps
3d
s
alo
ne
_3
0a
lon
e_
60
his
tg
au
ss
gu
ps
bfs
dg
em
m
alo
ne
_3
0a
lon
e_
60
his
tg
au
ss
gu
ps
3d
s
HIST (1st App) GAUSS (1st App) GUPS (1st App) BFS (1st App) 3DS (1st App) DGEMM (1st App)
0%
20%
40%
60%
80%
100%
1st App 2nd App Wasted-BW Idle-BW
Pe
rce
nta
ge
of
Pe
ak
Ba
nd
wid
th
Red portion is the fraction of wasted DRAM cycles during which data is not transferred over bus
17
his
t_g
au
ss
his
t_g
up
s
his
t_b
fs
his
t_3
ds
his
t_d
ge
mm
ga
us
s_
gu
ps
ga
us
s_
bfs
ga
us
s_
3d
s
ga
us
s_
dg
em
m
gu
ps
_b
fs
gu
ps
_3
ds
gu
ps
_d
ge
mm
bfs
_3
ds
bfs
_d
ge
mm
00.20.40.60.8
11.21.41.6
1st APP 2nd APP
Wei
gh
ted
Sp
eed
up
Imbalance in green and red portions indicates unfairness
Revisiting Fairness and Throughput
Baseline
18
Agnostic to different requirements of memory requests coming from different applications
Leads to – Unfairness– Sub-optimal performance
Primarily focus on improving DRAM efficiency
Current Memory Scheduling Schemes
19
Simple FCFS
Time
Bank
R1
R1
R1
R2
R2
R2
R3
Row Switch
Row Switch
Row Switch
Row Switch
Commonly Employed Memory Scheduling Schemes
High DRAM Page Hit Rats
Time
Bank
R1
R1
R1
R2
R2
R2
R3
Row Switch
Row Switch
App-1 App-2
R1 R2 R3
Request toRow-1 Row-2 Row-3
Low DRAM Page Hit Rate
Out of order (FR-FCFS)
Both schedulers are application agnostic! (App-2 suffers)
20
Outline
Introduction and motivation
Positives and negatives of co-scheduling multiple applications
Understanding inefficiencies in memory-subsystem
Proposed DRAM scheduler for better performance and fairness
Evaluation
Conclusions
21
As an example of adding application-awarenessInstead of FCFS, schedule requests in Round-Robin Fashion
Preserve the page hit rates
Proposal:FR-FCFS (Baseline) FR-(RR)-FCFS (Proposed)
Improves Fairness
Improves Performance
Proposed Application-Aware Scheduler
22
Proposed Application-Aware FR-(RR)-FCFS Scheduler
App-1 App-2
Time
Bank
R1 R2 R3
Request to
Row-1 Row-2 Row-3
R1
R1
R1
R2
R2
R2
R3
Row Switch
Row Switch
Time
Bank
R3
R1
R1
R1Row Switch
R2
R2
R2
Row Switch
App-2 is scheduled after App-1 in Round-Robin order
Baseline FR-FCFS Proposed FR-(RR)-FCFS
DRAM Page Hit-Rates
hist_gauss
hist_gups
hist_bfs
hist_3ds
hist_dgem
m
gauss_gups
gauss_bfs
gauss_3ds
gauss_dgem
m
gups_bfs
gups_3ds
gups_dgem
m
bfs_3ds
3ds_dgem
m
30%40%50%60%70%80%90%
FR-FCFS FR-RR-FCFS P
age
Hit
Rat
es
Same Page Hit-Rates as Baseline (FR-FCFS)
24
Outline
Introduction and motivation
Positives and negatives of co-scheduling multiple applications
Understanding inefficiencies in memory-subsystem
Proposed DRAM scheduler for better performance and fairness
Evaluation
Conclusions
Simulation Environment
GPGPU-Sim (v3.2.1)
Kernels from multiple applications are issued to different concurrent CUDA Streams
14 two-application workloads considered with varying memory demands
Baseline configuration similar to scaled-up version of GTX480
60 SMs, 32-SIMT lanes, 32-threads/warp
16KB L1 (4-way, 128B cache block) + 48KB SharedMem per SM
6 partitions/channels (Total Bandwidth: 177.6 GB/sec)
Improvement in Fairness
Fairness = max (r1, r2) Index
r1 = Speedup(app1)
Speedup(app2)
r2 = Speedup(app2)
Speedup(app1)
his
t_g
au
ss
his
t_g
up
s
his
t_b
fs
his
t_3
ds
his
t_d
ge
mm
ga
us
s_
gu
ps
ga
us
s_
bfs
ga
us
s_
3d
s
ga
us
s_
dg
...
gu
ps
_b
fs
gu
ps
_3
ds
gu
ps
_d
g...
bfs
_3
ds
3d
s_
dg
em
m 0
2
4
6
8
10
12FR-FCFS FR-RR-FCFS
Fa
irn
es
s I
nd
ex
On average 7% improvement (up to 49%) in fairness
Significantly reduces the negative impact of BW sensitive applications (e.g. GUPS) on overall fairness of the GPU system
Lower is Better
Improvement in Performance (Normalized to FR-FCFS)
hist_
gaus
s
hist_
gups
hist_
bfs
hist_
3ds
hist_
dgem
m
gaus
s_gu
ps
gaus
s_bf
s
gaus
s_3d
s
gaus
s_dg
emm
gups
_bfs
gups
_3ds
gups
_dge
mm
bfs_
3ds
3ds_
dgem
m
0.8
0.85
0.9
0.95
1
1.05
1.1
No
rmal
ized
Wei
gh
ted
Sp
eed
up
hist_
gaus
s
hist_
gups
hist_
bfs
hist_
3ds
hist_
dgem
m
gaus
s_gu
ps
gaus
s_bf
s
gaus
s_3d
s
gaus
s_dg
emm
gups
_bfs
gups
_3ds
gups
_dge
mm
bfs_
3ds
3ds_
dgem
m
0.80.9
11.11.21.31.41.51.61.7
No
rmal
ized
In
stru
ctio
n
Th
rou
gh
pu
t
On average 10% improvement (up to 64%) in instruction throughput performance and up to 7% improvement in weighted speedup performance.
Significantly reduces the negative impact of BW sensitive applications (e.g. GUPS) on overall performance of the GPU system
Instruction Throughput Weighted Speedup
Bandwidth Distribution with Proposed Scheduler
alo
ne
_3
0
alo
ne
_6
0
fr-f
cfs
-gu
ps
fr-r
r-fc
fs-g
up
s
alo
ne
_3
0
alo
ne
_6
0
fr-f
cfs
_g
up
s
fr-r
r-fc
fs-g
up
s
alo
ne
_3
0
alo
ne
_6
0
fr-f
cfs
-gu
ps
fr-r
r-fc
fs-g
up
s
HIST (1st App) GAUSS (1st App) 3ds (1st App)
0%
20%
40%
60%
80%
100%
1st App 2nd App Wasted-BW Idle-BWP
erce
nta
ge
of
Pea
k B
and
wid
th
Lighter applications get better DRAM bandwidth share
Conclusions
Naïve coupling of applications is probably not a good ideaCo-scheduled applications interfere in the memory-subsystem
Sub-optimal Performance and Fairness
Current DRAM schedulers are agnostic to applicationsTreat all memory request equally
Application-aware memory system is required for enhanced performance and superior fairness
30
Thank You!
Questions?