CUDA Optimization with NVIDIA® Nsight Visual Studio Edition...
Transcript of CUDA Optimization with NVIDIA® Nsight Visual Studio Edition...
![Page 1: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/1.jpg)
CUDA Optimization with NVIDIA® Nsight™ Visual Studio Edition 3.0 Julien Demouth, NVIDIA
![Page 2: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/2.jpg)
What Will You Learn?
An iterative method to optimize your GPU code
A way to conduct that method with Nsight VSE
APOD Method, Session S3008, Cliff Woolley
https://developer.nvidia.com/content/assess-parallelize-optimize-deploy
2
![Page 3: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/3.jpg)
What Does the Application Do ?
It does not matter !!!
We care about memory accesses, instructions, latency, …
Companion code (with a different input file)
https://github.com/jdemouth/nsight-gtc2013
3
![Page 4: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/4.jpg)
4
![Page 5: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/5.jpg)
Our Method
Trace your application
Identify the hot spot and profile it
Identify the performance limiter
— Memory Bandwidth
— Instruction Throughput
— Latency
Optimize the code
Iterate
5
![Page 6: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/6.jpg)
Our Environment
We use
— Nvidia Tesla K20c (GK110, SM 3.5), ECC OFF,
— Microsoft Windows 7 x64,
— Microsoft Visual Studio 2010 SP1,
— CUDA 5.0,
— Driver 310.34,
— Nvidia Nsight 3.0.
6
![Page 7: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/7.jpg)
ITERATION 1
7
![Page 8: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/8.jpg)
Trace the Application
8
![Page 9: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/9.jpg)
CUDA Launch Summary
spmv_kernel_v0 is a hot spot, let’s start here!!!
Kernel Time Speedup
Original version 457.1ms
9
![Page 10: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/10.jpg)
Profile the Most Expensive Kernel
10
![Page 11: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/11.jpg)
CUDA Launches
11
![Page 12: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/12.jpg)
Identify the Main Limiter
Is it limited by the memory bandwidth ?
Is it limited by the instruction throughput ?
Is it limited by latency ?
12
![Page 13: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/13.jpg)
Memory Bandwidth
Utilization of DRAM Bandwidth: 37.67%
We are not limited by the memory bandwidth
13
![Page 14: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/14.jpg)
Instruction Throughput
Instructions Per Clock (IPC): 0.04
We are not limited by instruction throughput
14
![Page 15: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/15.jpg)
Latency
First two things to check:
— Occupancy
— Memory accesses (coalesced/uncoalesced accesses)
Other things to check (if needed):
— Control flow efficiency (branching, idle threads)
— Divergence
— Bank conflicts in shared memory
15
![Page 16: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/16.jpg)
Latency
Occupancy: 47.58% Achieved / 50% Theoretical
Eligible Warps per Active Cycle: >4.7 on average
On GK110, 4 Eligible Warps are enough: Not an issue 16
![Page 17: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/17.jpg)
Latency
Memory Accesses:
— Load: 22 Transactions per Request
— Store: 8 Transactions per Request
We have too many uncoalesced accesses!!!
17
![Page 18: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/18.jpg)
Where Do Those Accesses Happen?
CUDA Source Profiler:
— Find where most of the uncoalesced requests happen
Tip: Sort “L2 Global Transactions Executed”
18
![Page 19: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/19.jpg)
Access Pattern
Double precision numbers: 64-bit
Per Warp:
— Up to 32 L1 Transactions / Ideal case: 2 Transactions
— Up to 32 L2 Transactions / Ideal case: 8 Transactions
L2 Transaction
(32B)
L2 Transaction
(32B)
L1 Transaction (128B)
Thread 0 Thread 1
L2 Transaction
(32B)
Thread 2
19
![Page 20: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/20.jpg)
Access Pattern
Next iteration:
Idea: Use the Read-only cache (LDG load)
— On Fermi: Use a texture or Use 48KB for L1
Thread 0 Thread 1 Thread 2
L2 Transaction
(32B)
L2 Transaction
(32B)
L1 Transaction (128B) L2 Transaction
(32B)
20
![Page 21: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/21.jpg)
First Modification: Use __ldg
We change the source code:
It is slower: 625.8ms
Kernel Time Speedup
Original version 457.1ms
LDG to load A 625.8ms 0.73x
21
![Page 22: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/22.jpg)
First Modification: Use __ldg
Less L2 to SM traffic: 857.1MB transferred (it was 906.2MB)
22
![Page 23: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/23.jpg)
First Modification: Use __ldg
Why does 6% less traffic lead to 37% performance loss?
Instruction Efficiency (Eligible Warps per Active Cycle):
The average number of Eligible Warps dropped below 1
23
![Page 24: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/24.jpg)
First Modification: Use __ldg
There are already “a lot” of Active Warps per Cycle
24
![Page 25: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/25.jpg)
First Modification: Use __ldg
Warps cannot issue because they have to wait
Warps wait for Texture in 91.1% of the cases
25
![Page 26: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/26.jpg)
First Modification: Use __ldg
The loads compete for the cache too much
— Low hit rate: 7.7%
Texture requests introduce too much latency
Things to check in those cases:
— Texture Hit Rate: Low means no reuse
— Issue Efficiency and Stall Reasons
It was actually expected: GPU caches are not CPU caches!!!
26
![Page 27: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/27.jpg)
First Modification: Use __ldg
Other accesses may benefit from LDGs
Memory blocks accessed several times by several threads
How can we detect it?
— Source code analysis
— There is no way to detect it from Nsight
27
![Page 28: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/28.jpg)
First Modification: Use __ldg
We change the source code
— In y = Ax, we use __ldg when loading x
It’s faster: 403.4ms
Kernel Time Speedup
Original version 457.1ms
LDG to load A 625.8ms 0.73x
LDG to load X 403.4ms 1.13x
28
![Page 29: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/29.jpg)
First Modification: Use __ldg
Much less L2 to SM traffic: 774MB (it was 906.2MB)
Good hit rate in Texture Cache: 83%
29
![Page 30: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/30.jpg)
ITERATION 2
30
![Page 31: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/31.jpg)
CUDA Launch Summary
spmv_kernel_v2 is still a hot spot, so we profile it 31
![Page 32: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/32.jpg)
Identify the Main Limiter
Is it limited by the memory bandwidth ?
Is it limited by the instruction throughput ?
Is it limited by latency ?
32
![Page 33: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/33.jpg)
Identify the Main Limiter
We are still limited by latency
— Low DRAM utilization: 36.48%
— Low IPC: 0.06
33
![Page 34: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/34.jpg)
Identify the Main Limiter
We are not limited by the Occupancy
— We have > 6 Eligible Warps per Active Cycle
We are limited by uncoalesced accesses: 48.92% of Replays
34
![Page 35: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/35.jpg)
Second Strategy: Change Memory Accesses
4 consecutive threads load 4 consecutive elements
Per Warp:
— Up to 8 L1 Transactions / Ideal case: 2 Transactions
— Up to 8 L2 Transactions / Ideal case: 8 Transactions
Threads 0, 1, 2, 3 Threads 4, 5, 6, 7
L2 Transaction
(32B)
L2 Transaction
(32B)
L1 Transaction (128B) L2 Transaction
(32B)
Threads 8, 9, 10, 11
35
![Page 36: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/36.jpg)
Second Strategy: Change Memory Accesses
It’s much faster: 161.7ms
Kernel Time Speedup
Original version 457.1ms
LDG to load A 625.8ms 0.73x
LDG to load X 403.4ms 1.13x
Coalescing with 4 Threads 161.7ms 2.83x
36
![Page 37: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/37.jpg)
Second Strategy: Change Memory Accesses
We have much fewer Transactions per Request: 5.51 (LD)
37
![Page 38: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/38.jpg)
Second Strategy: Change Memory Accesses
Much less traffic from L2: 230.5MB (it was 774MB)
Much less DRAM traffic: 210.1MB (it was 503.1MB)
38
![Page 39: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/39.jpg)
ITERATION 3
39
![Page 40: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/40.jpg)
CUDA Launch Summary
spmv_kernel_v3 is still a hot spot, so we profile it 40
![Page 41: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/41.jpg)
Identify the Main Limiter
Is it limited by the memory bandwidth ?
Is it limited by the instruction throughput ?
Is it limited by latency ?
41
![Page 42: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/42.jpg)
Identify the Main Limiter
We are still limited by latency
— Low DRAM utilization: 37.67%
— Low IPC: 0.31
42
![Page 43: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/43.jpg)
Latency
Occupancy: 57.80% Achieved / 62.50% Theoretical
Eligible Warps per Active Cycle: ~3 on average
We need 4 warps on GK110, so ~3 could be an issue
43
![Page 44: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/44.jpg)
Latency
Occupancy is limited by the number of registers
We change the number of registers with __launch_bounds__
It does not really help 44
__launch_bounds__(BLOCK_SIZE, MIN_BLOCKS)
![Page 45: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/45.jpg)
Latency
Memory Accesses:
— Load: 5.51 Transactions per Request
— Store: 2 Transactions per Request
We still have too many uncoalesced accesses
45
![Page 46: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/46.jpg)
Latency
We still have too many uncoalesced accesses
— Nearly 70% of Instruction Serialization (Replays)
— Stall Reasons: 48.1% due to Data Requests
46
![Page 47: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/47.jpg)
Where Do Those Accesses Happen?
Same lines of code as before
47
![Page 48: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/48.jpg)
What Can We Do?
In our kernel: 4 threads per row of the matrix A
New approach: 1 warp of threads per row of the matrix A
Threads 0, 1, 2, 3 Threads 4, 5, 6, 7
L2 Transaction
(32B)
L2 Transaction
(32B)
L1 Transaction (128B) L2 Transaction
(32B)
Threads 8, 9, 10, 11
Threads 0, 1, 2, 3, …, 31 (some possibly idle)
L2 Transaction
(32B)
L2 Transaction
(32B) L2 Transaction
(32B) 48
![Page 49: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/49.jpg)
One Warp Per Row
It’s faster: 140.4ms
Kernel Time Speedup
Original version 457.1ms
LDG to load A 625.8ms 0.73x
LDG to load X 403.4ms 1.13x
Coalescing with 4 Threads 161.7ms 2.83x
1 Warp per Row 140.4ms 3.26x
49
![Page 50: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/50.jpg)
One Warp Per Row
Much fewer Transactions Per Request: 1.37 (LD) / 1 (ST)
50
![Page 51: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/51.jpg)
ITERATION 4
51
![Page 52: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/52.jpg)
One Warp Per Row
spmv_kernel_v4 is the hot spot
52
![Page 53: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/53.jpg)
One Warp Per Row
DRAM utilization: 40.64%
IPC: 1.57
We are still limited by latency 53
![Page 54: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/54.jpg)
One Warp Per Row
Occupancy and memory accesses are OK (not shown)
Control Flow Efficiency: 86.59%
Only 72.5% threads active in the expensive loop
54
![Page 55: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/55.jpg)
One Half Warp Per Row
It is faster: 114.7ms
Kernel Time Speedup
Original version 457.1ms
LDG to load A 625.8ms 0.73x
LDG to load X 403.4ms 1.13x
Coalescing with 4 Threads 161.7ms 2.83x
1 Warp per Row 140.4ms 3.26x
½ Warp per Row 114.7ms 3.99x
55
![Page 56: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/56.jpg)
ITERATION 5
56
![Page 57: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/57.jpg)
One Half Warp Per Row
DRAM utilization: 49.79%
IPC: 1.34
We are still limited by latency 57
![Page 58: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/58.jpg)
One Half Warp Per Row
Memory accesses are good enough
Occupancy could be an issue: ~3.2 Eligible Warps per Cycle
— Occupancy is limited by registers
But forcing register count does not improve performance 58
![Page 59: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/59.jpg)
One Half Warp Per Row
Branch divergence induce latency
We have 23.1% of divergent branches
59
![Page 60: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/60.jpg)
One Half Warp Per Row
We fix branch divergence
It is faster: 91.2ms
Kernel Time Speedup
Original version 457.1ms
LDG to load A 625.8ms 0.73x
LDG to load X 403.4ms 1.13x
Coalescing with 4 Threads 161.7ms 2.83x
1 Warp per Row 140.4ms 3.26x
½ Warp per Row 114.7ms 3.99x
No divergence 91.2ms 5.01x
60
![Page 61: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/61.jpg)
One Half Warp Per Row
DRAM utilization: 62.57%
IPC: 1.56
We achieve a much better bandwidth 61
![Page 62: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/62.jpg)
So Far
We have consecutively:
— Improved caching using __ldg (use with care)
— Improved coalescing
— Improved control flow efficiency
— Improved branching
Our new kernel is 5x faster than our first implementation
Nsight helped us a lot 62
![Page 63: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/63.jpg)
63
![Page 64: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/64.jpg)
ITERATION 6
64
![Page 65: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/65.jpg)
Next Kernel
We are satisfied with the performance of spmv_kernel
We move to the next kernel: jacobi_smooth
65
![Page 66: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/66.jpg)
66
![Page 67: CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition](https://reader030.fdocuments.us/reader030/viewer/2022040405/5e95cb96d2e8e07d07579400/html5/thumbnails/67.jpg)
What Have You Seen?
An iterative method to optimize your GPU code
— Trace your application
— Identify the hot spot and profile it
— Identify the performance limiter
— Optimize the code
— Iterate
A way to conduct that method with Nsight VSE
67