Post on 29-Dec-2015
Many-Thread Aware Prefetching Mechanisms for GPGPU Applications
Jaekyu Lee
Nagesh B. Lakshminarayana
Hyesoon Kim
Richard Vuduc
2
Introduction
Many-Thread Aware Prefetching Mechanisms (MICRO-43)
General Purpose GPUs (GPGPU) are getting popular High-performance capability
(NVIDIA Geforce GTX 580: 1.5 TFLOPS)
Many cores with large-scale multi-threading and SIMD unit
CUDA programming model SIMT (Single Instruction Multiple Threads) Hierarchy of threads groups: thread, thread block
SIMDExecutio
n
SharedMemory
Memory RequestBuffer Core
DRAM
3
Memory Latency Problem
Many-Thread Aware Prefetching Mechanisms (MICRO-43)
Tolerating memory latency is critical in CPUs Many techniques have been proposed
Caches, prefetching, multi-threading, etc.
GPGPUs have employed multi-threading Memory latency is critical in GPGPUs as well
Limited thread-level-parallelism Application behavior
Algorithmically, lack of parallelism Limited by resource constraints
# registers per thread, # threads per block, shared memory usage per block
4
Multi-threading Example
Many-Thread Aware Prefetching Mechanisms (MICRO-43)
Example 1: Enough threads
Example 2: Not enough threads
C C DMM C
C C DMM C
Switch
C C DMM C
C C DMM C
4 active threads
Switch
Switch
No stall
T0
T1
T2
T3
Memory Latency
C CM
C Computation M Memory D Dependent on memory
C C DMM C
C C DMM C
SwitchStall
2 active threads
Stall CyclesT0
T1
Memory Latency
C C DMM C
5
Prefetching in GPGPUs
Many-Thread Aware Prefetching Mechanisms (MICRO-43)
Problem: when multi-threading is not enough, we need other mechanisms to hide memory latency. Other solutions
Caching (NVIDIA Fermi) Prefetching
Many prefetchers have been proposed for CPUs Stride, stream, Markov, CDP, GHB, helper thread, etc.
Question: Will the existing mechanisms work in GPGPUs?
In this talk
6
Characteristic #1. Many Threads
Many-Thread Aware Prefetching Mechanisms (MICRO-43)
Problem #1. Training of prefetcher Accesses from many threads are interleaved
Thread ID indexing Reduced effective prefetcher size Scalability
Prefetcher Prefetcher
Prefetching in CPU
Prefetcher
Prefetching in GPGPU
1 thread 2 threads Many threads
Characteristic #2. Data Level Parallelism
Many-Thread Aware Prefetching Mechanisms (MICRO-43)7
Problem #2. Short thread lifetime Due to parallelization The length of a thread in parallel programs is
shorter Removes prefetching opportunities
prefetch
demand
prefetchdemand
Sequential Thread Parallel Threads
Too short lifetime
No opportunity
Useful!
Memory latency
Memory latency
create
terminate
8
Characteristic #3. SIMT
Many-Thread Aware Prefetching Mechanisms (MICRO-43)
Problem #3. Single-Configuration Many-Threads (SCMT) Too many threads are controlled together Prefetch degree: # of prefetches per trigger
Prefetch degree 1: < cache size Prefetch degree 2: >> cache size
Problem #4. Amplified negative effects One useless prefetch request per thread
many useless prefetches
pref pref
pref pref
pref pref
pref pref
pref pref
pref pref
pref pref
pref pref
pref pref
Prefetch Cache Prefetch Cache
Fit in a cache
Capacity misses
9
Goal
Many-Thread Aware Prefetching Mechanisms (MICRO-43)
Design hardware/software prefetching mechanisms for GPGPU applications
Step 1. Prefetcher for Many-thread Architecture Many-Thread Aware Prefetching
Mechanisms(Scalability, short thread lifetime)
Step 2. Feedback mechanism to reduce negative effects Prefetch Throttling
(SCMT, amplifying negative effects)
10
Goal
Many-Thread Aware Prefetching Mechanisms (MICRO-43)
Design hardware/software prefetching mechanisms for GPGPU applications
Step 1. Prefetcher for Many-thread Architecture Many-Thread Aware Prefetching
Mechanisms(Scalability, short thread lifetime)* H/W prefetcher: in this talk, S/W prefetcher: in the paper
Step 2. Feedback mechanism to reduce negative effects Prefetch Throttling
(SCMT, amplifying negative effect)
11
Stride Pref.
PromotionTable
IP Pref.
Decision Logic
Pref. AddrPC,
ADDR
PC, ADDR TID
PC, ADDR TID
Many-Thread Aware Hardware Prefetcher (Conventional) Stride prefetcher Promotion table for stride prefetcher (Scalability) Inter-Thread prefetcher (Short thread lifetime) Decision logic
PromotionTable
IP Pref.
Stride Pref.
Decision Logic
Stride Promotion
Many-Thread Aware Prefetching Mechanisms (MICRO-43)
12
Solving Scalability Problem
Many-Thread Aware Prefetching Mechanisms (MICRO-43)
Problem #1. Training of prefetcher (Scalability) Stride Promotion
Similar (or even same) access pattern in threads (SIMT) Without promotion, table is occupied by redundant entries
By promotion, we can effectively manage storage
Reduce training time using earlier threads’ information
PC STRIDE
0x1a
65536
… …
… …
… …
… …
PC TID STRIDE
0x1a
1 65536
0x1a
3 65536
0x1a
10 65536
0x1a
7 65536
… … …
Redundant
Entries
Promotion
Conventional Stride Table
Promotion Table
13
Solving Short Thread Lifetime Problem
Many-Thread Aware Prefetching Mechanisms (MICRO-43)
Problem #2. Short thread lifetime
Highly parallelized code often reduces prefetching opportunities
prefetchdemand
Memory latency
for (ii = 0; ii < 100; ++ii) { prefetch(A[ii+D]); prefetch(B[ii+D]); C[ii] = A[ii] + B[ii];}
// there are 100 threads__global__ void KernelFunction(…) { int tid = blockDim.x * blockIdx.x + threadIdx.x; int varA = aa[tid]; int varB = bb[tid]; C[tid] = varA + varB;}
Loop!
No loopFew instructionsNo opportunity
14
Inter-Thread Prefetching
Many-Thread Aware Prefetching Mechanisms (MICRO-43)
Instead, we can prefetch for other threads Inter-Thread Prefetching (IP) In CUDA, Memory index is a function of the thread
id
// there are 100 threads__global__ void KernelFunction(…) { int tid = blockDim.x * blockIdx.x + threadIdx.x; int next_tid = tid + 32; prefetch(aa[next_tid]); prefetch(bb[next_tid]); int varA = aa[tid]; int varB = bb[tid]; C[tid] = varA + varB;}
T0 T3 …T2 … … …
T32 T35 …T34 … … …
T64 … …T66 … … …
Prefetch
Prefetch
Memory accessin other threads
prefetch
prefetch
T1
T33
T65
15
IP Table
IP Pattern Detection in Hardware
Many-Thread Aware Prefetching Mechanisms (MICRO-43)
Detecting strides across threads
Launch prefetch requests
PC Addr1 TID 1 Addr 2 TID 2 Train Delta
- - - - - - -
PC:0x1a Addr:400 TID:3
PC Addr1 TID 1 Addr 2 TID 2 Train Delta
0x1a 400 3 - - - -
PC:0x1a Addr:1100 TID:10
PC Addr1 TID 1 Addr 2 TID 2 Train Delta
0x1a 400 3 1100 10 - -
PC:0x1a Addr:200 TID:1
PC Addr1 TID 1 Addr 2 TID 2 Train Delta
0x1a 400 3 1100 10 √ 100
Delta (Req1-Req2) = = 100
Delta (Req3-Req1) = = 100
Delta (Req3-Req2) = = 100
All three deltas are same
We found a pattern
Req 1Req 2Req 3
TID ∆
PC:0x1a Addr:2100 TID:1
Req 4 Prefetch (addr + stride)Addr:2100 Stride: 100
Addr ∆
16
MT-aware Hardware Prefetcher
Many-Thread Aware Prefetching Mechanisms (MICRO-43)
Decision logic Promotion table > Stride prefetcher > IP prefetcher Stride behavior in a thread is more common Entries in Promotion table have been trained longer
time
PromotionTable
IP Pref.
Stride Pref.
Decision Logic
Pref. Addr
Stride Promotion
Cycle 1
Cycle 2
Cycle 3
PC, ADDR
PC, ADDR TID
PC, ADDR TID
Promotion IP Table Stride Prefetcher
Action
1st cycle 2nd cycle 3rd cycle
HIT HIT Not accessed Generate stride prefetch requests
HIT MISS Not accessed Generate stride prefetch requests
MISS HIT Not accessed Generate IP prefetch requests
MISS MISS Accessed
Generate stride prefetch requests, if hitUpdate Promotion Table, if necessary
17
Goal
Many-Thread Aware Prefetching Mechanisms (MICRO-43)
Design a hardware/software prefetcher for GPGPU applications
Step 1. Prefetcher for Many-thread Architecture Many-Thread Aware Prefetching Mechanisms
(Scalability, short thread lifetime)
Step 2. Feedback mechanism to reduce negative effects Prefetch Throttling
(SCMT, amplifying negative effects)
18
Design GPGPU Prefetch Throttling
Many-Thread Aware Prefetching Mechanisms (MICRO-43)
Need GPGPU specific metrics to identify whether prefetching is effective Extension from feedback prefetching for CPUs
[Srinath07] Useful prefetches – accurate and timely Harmful prefetches – inaccurate or too early
Some late prefetches can be tolerable By multithreading Less harmful
19
Throttling Metrics
Many-Thread Aware Prefetching Mechanisms (MICRO-43)
Merged memory requests New request with same address of existing entries
Inside of a core (in MSHR) Late prefetches in CPUs Indicate accuracy (due to massive multi-threading)
Early block eviction from a prefetch cache Due to capacity misses, regardless of accuracy
Periodic Updates To cope with runtime behavior
Many-Thread Aware Prefetching Mechanisms (MICRO-43)20
Heuristic for Prefetch Throttling
* Ideal case (accurate and perfect timing) will have low early eviction and low merge ratio.
Throttle Degree Vary from 0 (prefetch all) to 5 (no prefetch) Default:2
Early Eviction
Merge Ratio Action Note
High High NO prefetch Too aggressive
Medium - LESS prefetch
Low High MORE prefetch
Low Low NO prefetch Inaccurate *
High Low NO prefetch Inaccurate
21
Outline
Many-Thread Aware Prefetching Mechanisms (MICRO-43)
Motivation Step 1. Many-Thread Aware Prefetching Step 2. Prefetch Throttling Evaluation Conclusion
22
Evaluation Methodology
Many-Thread Aware Prefetching Mechanisms (MICRO-43)
MacSim simulator A cycle accurate, in-house simulator A trace-driven simulator (trace from
GPUOcelot[Diamos10])
Baseline 14-core (8-wide SIMD) Freq:900MHz, 16 Banks/8 Channels,
1.2GHz memory frequency, 900MHz bus, FR-FCFS NVIDIA G80 Architecture
14 memory intensive benchmarks CUDA SDK, Merge, Rodinia, and Parboil Stride, MP (massively parallel), and uncoalesced types Non-memory intensive benchmarks (in the paper)
23
Evaluation Methodology
Many-Thread Aware Prefetching Mechanisms (MICRO-43)
Prefetch Stream, Stride, and GHB prefetchers evaluated 16 KB cache per core (other size results are in the
paper) Prefetch distance:1 degree :1 (the optimal
configuration)
Results Hardware prefetcher Software prefether (in the paper)
24
Results: MT Hardware Prefetching
Many-Thread Aware Prefetching Mechanisms (MICRO-43)
GHB/Stride do not work in mp and uncoal-type IP (Inter-Thread Prefetching) can be effective Stride Promotion improves performance of few benchmarks
blac
kco
nv
mer
senn
e
mon
te pns
scal
ar
stre
am
back
propce
ll
ocea
nbf
scfd
linea
r
sepi
aAV
G
stride-type mp-type uncoal-type
0.5
1
1.5
2
2.5
3GHB Stride Stride+Promotion Stride+IP
Sp
eed
up
15% over Stride
25
blac
kco
nv
mer
senn
e
mon
te pns
scal
ar
stre
am
back
propce
ll
ocea
nbf
scfd
linea
r
sepi
aAV
G
stride-type mp-type un-type
0.5
1
1.5
2
2.5
3GHB GHB+F StridePC StridePC+T MT-HWP MT-HWP+T
Sp
eed
up
Results: MT-HWP with Throttling
Many-Thread Aware Prefetching Mechanisms (MICRO-43)
15% over Stride + Throttling
GHB+F improves performance MT-HWP+T eliminates negative effect (stream) * Feedback mechanism is more effective in software prefetching
26
Outline
Many-Thread Aware Prefetching Mechanisms (MICRO-43)
Motivation Step 1. Many-Thread Aware Prefetching Step 2. Prefetch Throttling Evaluation Conclusion
27
Conclusion
Many-Thread Aware Prefetching Mechanisms (MICRO-43)
Memory latency is an important problem in GPGPUs as well.
GPGPU prefetching has four problems: Scalability, short thread, SCMT, and amplifying negative effects
Goal: Design hardware/software prefetcher Step 1. Many-Thread aware prefetcher (promotion, IP) Step 2. Prefetch throttling
MT-aware hardware prefetcher shows 15% performance improvement and prefetch throttling removes all the negative effects.
Future work Study other many-thread architectures.
Other programming models, architectures with caches
28 Many-Thread Aware Prefetching Mechanisms (MICRO-43)
THANK YOU!
Many-Thread Aware Prefetching Mechanisms for GPGPU Applications
Jaekyu Lee
Nagesh B. Lakshminarayana
Hyesoon Kim
Richard Vuduc
30
NVIDIA Fermi Result
Many-Thread Aware Prefetching Mechanisms (MICRO-43)
bla
ck
conv
mers
enne
monte
pns
scala
r
stre
am
back
pro
p
Oce
an
bfs
cfd
linear
AV
G
STRIDE MP UNCOALESCED
0
0.5
1
1.5
2
2.5
HP HP+Throttle
31
Different Prefetch Cache Size
Many-Thread Aware Prefetching Mechanisms (MICRO-43)
MT-HWP MT-HWP+T MT-SWP MT-SWP+T0.9
1
1.1
1.2
1.3
1.4
1.5
1K 2K 4K 8K16K 32K 64K 128K
Sp
eed
up
32
Software MT Prefetcher Results
Many-Thread Aware Prefetching Mechanisms (MICRO-43)
blackco
nv
mersennemontepns
scalar
stream
backprop
cell
ocean bfs cfd
linear
sepia
AVG
stride-type mp-type uncoal-type
0.0
1.0
2.0
3.0
4.0
Register StrideMT-SWP MT-SWP+Throttle
Sp
eed
up
33
Hardware prefetcher without TID
Many-Thread Aware Prefetching Mechanisms (MICRO-43)
blac
kco
nv
mer
senn
e
mon
te pns
scal
ar
stre
am
back
propce
ll
Ocean bf
scfd
linea
r
sepi
aAV
G
stride-type mp-type uncoal-type
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
Stride StridePC Stream GHB
Sp
eed
up
34
Hardware prefetcher with TID
Many-Thread Aware Prefetching Mechanisms (MICRO-43)
blac
kco
nv
mer
senn
e
mon
te pns
scal
ar
stre
am
back
propce
ll
Ocean bf
scfd
linea
r
sepi
aAV
G
stride-type mp-type uncoal-type
0.6
0.8
1
1.2
1.4
1.6
Stride StridePC Stream GHB
Sp
eed
up
35
Benefit Because of Few Threads?
Many-Thread Aware Prefetching Mechanisms (MICRO-43)
Black Conv Mersenne
Monte PNS Scalar stream
12 12 8 16 8 16 16backprop
cell ocean bfs cfd linear sepia
16 16 16 16 6 16 24
Some benchmarks have enough number of threads but they still cannot hide memory latency fully.
blac
kco
nv
mer
senn
e
mon
te pns
scal
ar
stre
am
back
propce
ll
ocea
nbf
scfd
linea
r
sepi
aAV
G
stride-type mp-type uncoal-type
0.5
1.5
2.5
3.5
GHB Stride Stride+Promotion Stride+IP
Sp
eed
up
Many-Thread Aware Prefetching Mechanisms (MICRO-43)36
Inter-Thread Prefetching IP may not be useful in some cases
Case 1. Demand requests have already been generated Threads are not executed in a strict sequential order
Out of order execution among threads Redundant prefetches: requests will be merged in the
memory system. Less harmful.
Case 2. Out of array range effect: The last thread in a block generates a request for another thread which is mapped to a different core. Unless inter-core merge occurs in DRAM controller,
useless prefetches