CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS...
Transcript of CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS...
![Page 1: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/1.jpg)
Stephen Jones, GTC 2017
CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES
![Page 2: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/2.jpg)
2
The art of doing more with less
![Page 3: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/3.jpg)
3
RULE #1: DON’T TRY TOO HARDPerf
orm
ance
Time
Peak Performance
![Page 4: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/4.jpg)
4
RULE #1: DON’T TRY TOO HARDPerf
orm
ance
Time
Peak Performance
Unre
alist
ic E
ffort
/Rew
ard
![Page 5: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/5.jpg)
5
RULE #1: DON’T TRY TOO HARDPerf
orm
ance
Time
Peak Performance
![Page 6: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/6.jpg)
6
RULE #1: DON’T TRY TOO HARDPerf
orm
ance
Time
Peak Performance
Reduce this time
Don’t waste this time
Get onthis curve
![Page 7: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/7.jpg)
7
RULE #1: DON’T TRY TOO HARDPerf
orm
ance
Time
Peak Performance
Trough ofdespair
Point ofdiminishing
returns
Prematureexcitement
Wait, it’sgoing slower??
Hire anintern
Here be ninjas
Most peoplegive up here
4 weeks andthis is it?
![Page 8: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/8.jpg)
8
PERFORMANCE CONSTRAINTS
Memory75%
Occupancy10%
Instruction2%
Divergence3%
Compute Intensity10%
![Page 9: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/9.jpg)
9
PERFORMANCE CONSTRAINTS
CPU <> GPU Transfer
Coalescence
Cache Inefficiency
Register Spilling
Divergent Access
Occupancy
Instruction
Divergence
Compute Intensity
Chart Title
![Page 10: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/10.jpg)
10
MEMORY ORDERS OF MAGNITUDE
CPUDRAMGDRAML2 CacheL1$SM
150GB/sec
16GB/sec
300GB/sec
2,000GB/sec
20,000GB/sec
regs
shmem
PCIe
bus
regs
shmem
regs
shmem
![Page 11: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/11.jpg)
11
TALK BREAKDOWN
1. Why Didn’t I Think Of That?
2. CPU Memory to GPU Memory (the PCIe Bus)
3. GPU Memory to the SM
4. Registers & Shared Memory
5. Occupancy, Divergence & Latency
6. Weird Things You Never Thought Of (and probably shouldn’t try)
In no particular order
![Page 12: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/12.jpg)
12
WHERE TO BEGIN?
![Page 13: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/13.jpg)
13
THE OBVIOUS
Start with the Visual Profiler
NVIDIA Visual Profiler
![Page 14: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/14.jpg)
14
CPU <> GPU DATA MOVEMENT
![Page 15: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/15.jpg)
15
PCI ISSUES
regs
shmem
regs
shmem
regs
shmem
PCIe
bus
16GB/sec
Moving data over the PCIe bus
![Page 16: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/16.jpg)
16
PIN YOUR CPU MEMORY
CPU MemoryGPU Memory
DataCopy
![Page 17: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/17.jpg)
17
PIN YOUR CPU MEMORY
CPU MemoryGPU Memory
DataDMA
Controller
![Page 18: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/18.jpg)
18
PIN YOUR CPU MEMORY
CPU MemoryGPU Memory
Swap
DMA
Controller Data
![Page 19: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/19.jpg)
19
PIN YOUR CPU MEMORY
CPU MemoryGPU Memory
Data
DMA
Controller
Pinned
Copy of
Data
CPU allocates & pins page then copies locally before DMA
![Page 20: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/20.jpg)
20
GPU Memory
PIN YOUR CPU MEMORY
CPU Memory
User
Pinned
Data
DMA
Controller
cudaHostAlloc( &data, size, cudaHostAllocMapped );cudaHostRegister( &data, size, cudaHostRegisterDefault );
![Page 21: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/21.jpg)
21
PIN YOUR CPU MEMORY
![Page 22: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/22.jpg)
22
REMEMBER: PCIe GOES BOTH WAYS
![Page 23: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/23.jpg)
23
Operations in a single stream are ordered
But hardware can copy and compute at the same time
STREAMS & CONCURRENCY
ComputeCopy data
to Host
Copy data
to GPU
Time
SingleStream
Hiding the cost of data transfer
![Page 24: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/24.jpg)
24
STREAMS & CONCURRENCY
ComputeCopy data
to Host
Copy data
to GPU
Time
WorkCopy
back
Copy
up
WorkCopy
back
Copy
up Saved TimeStream 2
Stream 1
SingleStream
![Page 25: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/25.jpg)
25
STREAMS & CONCURRENCY
8streams
2streams
1stream
Can keep on breaking work into smaller chunks and saving time
![Page 26: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/26.jpg)
26
SMALL PCIe TRANSFERS
PCIe is designed for large data transfers
But fine-grained copy/compute overlap prefers small transfers
So how small can we go?
8Toomany
2 1
![Page 27: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/27.jpg)
27
APPARENTLY NOT THAT SMALL
![Page 28: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/28.jpg)
28
FROM GPU MEMORY TO GPU THREADS
![Page 29: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/29.jpg)
29
FEEDING THE MACHINE
regs
shmem
regs
shmem
regs
shmem
PCIe
bus
From GPU Memory to the SMs
![Page 30: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/30.jpg)
30
USE THE PARALLEL ARCHITECTURE
Cache is sized to servicesets of 32 requests at a time
L2 Cache Line
Threads runin groups of 32
High-speed GPU memoryworks best with linear access
Hardware is optimized to use all SIMT threads at once
![Page 31: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/31.jpg)
31
VECTORIZE MEMORY LOADS
T0-T32
int
Multi-Word as well as Multi-Thread
![Page 32: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/32.jpg)
32
VECTORIZE MEMORY LOADS
T0-T15
T16-T31
int2
Fill multiple cache lines in a single fetch
![Page 33: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/33.jpg)
33
VECTORIZE MEMORY LOADS
T0-T7
T8-T15
T16-T23
T24-T31
int4
Fill multiple cache lines in a single fetch
![Page 34: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/34.jpg)
34
VECTORIZE MEMORY LOADS
![Page 35: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/35.jpg)
35
DO MULTIPLE LOADS PER THREAD
__global__ void copy(int2 *input,int2 *output, int max) {
int id = threadIdx.x +blockDim.x * blockIdx.x;
if( id < max ) {output[id] = input[id];
}}
__global__ void copy(int2 *input,int2 *output, int max,int loadsPerThread) {
int id = threadIdx.x +blockDim.x * blockIdx.x;
for(int n=0; n<loadsPerThread; n++) {if( id >= max ) {break;
}output[id] = input[id];id += blockDim.x * gridDim.x;
}}
One copy per threadMaximum overhead
Multiple copies per threadAmortize overhead
Multi-Thread, Multi-Word AND Multi-Iteration
![Page 36: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/36.jpg)
36
“MAXIMAL” LAUNCHES ARE BEST
![Page 37: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/37.jpg)
37
COALESCED MEMORY ACCESS
1 2 3 4Coalesced: Sequential memory accesses are adjacent
Uncoalesced: Sequential memory accesses are unassociated1
2
3
4
It’s not just good enough to use all SIMT threads
![Page 38: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/38.jpg)
38
SIMT PENALTIES WHEN NOT COALESCED
x = data[threadIdx.x] x = data[rand()]
Single 32-wide operation 32 one-wide operations
![Page 39: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/39.jpg)
39
SCATTER & GATHER
1
2
3
4
1 2 3 4
1
2
3
4
1 2 3 4
Scattering
Reading randomlyWriting sequentially
Gathering
Reading sequentiallyWriting randomly
![Page 40: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/40.jpg)
40
AVOID SCATTER/GATHER IF YOU CAN
![Page 41: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/41.jpg)
41
AVOID SCATTER/GATHER IF YOU CAN
![Page 42: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/42.jpg)
42
SORTING MIGHT BE AN OPTION
If reading non-sequential data is expensive, is it worth sorting it to make it sequential?
1 2 3 4
Coalesced Read
1 2 3 4Sort
1 2 3 4
2 4 1 3
Gathering
Slow Fast
![Page 43: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/43.jpg)
43
SORTING MIGHT BE AN OPTION
Even if you’re only going to read it twice, then yes!
![Page 44: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/44.jpg)
44
PRE-SORTING TURNS OUT TO BE GOOD
![Page 45: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/45.jpg)
45
DATA LAYOUT: “AOS vs. SOA”
Array-of-Structures
#define NPTS 1024 * 1024
struct Coefficients_AOS {double u[3];double x[3][3];double p;double rho;double eta;
};
Coefficients_AOS gridData[NPTS];
#define NPTS 1024 *1024
struct Coefficients_SOA {double u[3][NPTS];double x[3][3][NPTS];double p[NPTS];double rho[NPTS];double eta[NPTS];
};
Coefficients_SOA gridData;
Structure-of-Arrays
Single-thread code prefers arrays of structures, for cache efficiency
SIMT code prefers structures of arrays, for execution & memory efficiency
Sometimes you can’t just sort your data
![Page 46: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/46.jpg)
46
DATA LAYOUT: “AOS vs. SOA”
#define NPTS 1024 * 1024
struct Coefficients_AOS {double u[3];double x[3][3];double p;double rho;double eta;
};
Coefficients_AOS gridData[NPTS];
u0 u1 u2
x00 x01 x02
x10 x11 x12
x20 x21 x22
p
rho
eta
Structure Definition Conceptual Layout
![Page 47: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/47.jpg)
47
SOA: STRIDED ARRAY ACCESS
u0 u1 u2
x00 x01 x02
x10 x11 x12
x20 x21 x22
p
rho
eta
Conceptual Layout
Array-of-Structures Memory Layout
double u0 = gridData[threadIdx.x].u[0];
GPU reads data one element at a time, but in parallel by 32 threads in a warp
![Page 48: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/48.jpg)
48
AOS: COALESCED BUT COMPLEX
u0 u1 u2
x00 x01 x02
x10 x11 x12
x20 x21 x22
p
rho
eta
Conceptual Layout
Array-of-Structures Memory Layout
GPU reads data one element at a time, but in parallel by 32 threads in a warp
double u0 = gridData.u[0][threadIdx.x];
Structure-of-Arrays Memory Layout
![Page 49: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/49.jpg)
49
BLOCK-WIDE LOAD VIA SHARED MEMORY
Read data linearly as bytes. Use shared memory to convert to struct
Block copies datato shared memory
Device Memory
Shared Memory
![Page 50: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/50.jpg)
50
BLOCK-WIDE LOAD VIA SHARED MEMORY
Read data linearly as bytes. Use shared memory to convert to struct
Threads which own the datagrab it from shared memory
Device Memory
Shared Memory
![Page 51: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/51.jpg)
51
CLEVER AOS/SOA TRICKS
![Page 52: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/52.jpg)
52
CLEVER AOS/SOA TRICKSHelps for any data size
![Page 53: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/53.jpg)
53
HANDY LIBRARY TO HELP YOU
Trove – A utility library for fast AOS/SOA access and transpositionhttps://github.com/bryancatanzaro/trove
![Page 54: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/54.jpg)
54
(AB)USING THE CACHE
![Page 55: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/55.jpg)
55
MAKING THE MOST OF L2-CACHE
L2 cache is fast but small: GDRAML2 Cache
300GB/sec
2,000GB/sec
Architecture L2 Cache
Size
Total
Threads
Cache Bytes
per Thread
Kepler 1536 KB 30,720 51
Maxwell 3072 KB 49,152 64
Pascal 4096 KB 114,688 36
![Page 56: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/56.jpg)
56
TRAINING DEEP NEURAL NETWORKS
![Page 57: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/57.jpg)
57
LOTS OF PASSES OVER DATA
FFT
3x3
convolution
5x5
convolution
7x7
convolution
+
W1
W2
W3
Cat!
![Page 58: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/58.jpg)
58
MULTI-RESOLUTION CONVOLUTIONS
Pass 1 : 3x3
Pass 2: 5x5
Pass 3: 7x7
![Page 59: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/59.jpg)
59
TILED, MULTI-RESOLUTION CONVOLUTION
Do 3 passes per-tile Each tile sized to fit in L2 cache
Pass 1 : 3x3
Pass 2: 5x5
Pass 3: 7x7
![Page 60: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/60.jpg)
60
LAUNCHING FEWER THAN MAXIMUM THREADS
![Page 61: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/61.jpg)
61
SHARED MEMORY: DEFINITELY WORTH IT
![Page 62: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/62.jpg)
62
USING SHARED MEMORY WISELY
Shared memory arranged into “banks” for concurrent SIMT access▪ 32 threads can read simultaneously so long as into separate banks
Shared memory has 4-byte and 8-byte “bank” sizes
![Page 63: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/63.jpg)
63
STENCIL ALGORITHM
Many algorithms have high data re-use: potentially good for shared memory
“Stencil” algorithms accumulate data from neighbours onto a central point▪ Stencil has width “W” (in the above case, W=5)
Adjacent threads will share (W-1) items of data – good potential for data re-use
![Page 64: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/64.jpg)
64
STENCILS IN SHARED MEMORY
![Page 65: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/65.jpg)
65
SIZE MATTERS
![Page 66: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/66.jpg)
66
PERSISTENT KERNELS
Avoid multiple kernel launches by caching in shared memory instead of L2
void tiledConvolution() {convolution<3><<< numblocks, blockdim, 0, s >>>(ptr, chunkSize);convolution<5><<< numblocks, blockdim, 0, s >>>(ptr, chunkSize);convolution<7><<< numblocks, blockdim, 0, s >>>(ptr, chunkSize);
}
__global__ void convolutionShared(int *data, int count, int sharedelems) {extern __shared__ int shdata[];shdata[threadIdx.x] = data[threadIdx.x + blockDim.x*blockIdx.x];__syncthreads();
convolve<3>(threadIdx.x, shdata, sharedelems);__syncthreads();convolve<5>(threadIdx.x, shdata, sharedelems);__syncthreads();convolve<7>(threadIdx.x, shdata, sharedelems);
}
Separate kernellaunches withL2 re-use
Single kernellaunch withpersistent kernel
Revisiting the tiled convolutions
![Page 67: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/67.jpg)
67
PERSISTENT KERNELS
![Page 68: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/68.jpg)
68
OPERATING DIRECTLY FROM CPU MEMORY
Can save memory copies. It’s obvious when you think about it ...
ComputeCopy data
to Host
Copy data
to GPU
ComputeRead f
rom
CPU
Wri
te t
o
CPU
Compute only begins when 1st copyhas finished. Task only ends when2nd copy has finished.
Compute begins after first fetch.Uses lots of threads to coverhost-memory access latency. Takes advantage of bi-directional PCI.
![Page 69: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/69.jpg)
69
OPERATING DIRECTLY FROM CPU MEMORY
![Page 70: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/70.jpg)
70
OCCUPANCY AND REGISTER LIMITATIONS
Register file is bigger than shared memory and L1 cache!
Occupancy can kill you if you use too many registers
Often worth forcing fewer registers to allow more blocks per SM
But watch out for math functions!
Function float double
log 7 18
cos 16 28
acos 6 18
cosh 7 10
tan 15 28
erfc 14 22
exp 7 10
log10 6 18
normcdf 16 26
cbrt 8 20
sqrt 6 12
rsqrt 5 12
y0 20 30
y1 22 30
fdivide 11 20
pow 11 24
grad. desc. 14 22
__launch_bounds__(maxThreadsPerBlock,minBlocksPerMultiprocessor)
__global__ void compute() {y = acos(pow(log(fdivide(tan(cosh(erfc(x))), 2)), 3);
}
![Page 71: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD](https://reader035.fdocuments.us/reader035/viewer/2022071006/5fc3865e7824f278f2400581/html5/thumbnails/71.jpg)
THANK YOU!