Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent...
Transcript of Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent...
![Page 1: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/1.jpg)
A Study of Persistent Threads Style Programming Model for GPU Computing
Kshitij Gupta [/shi/ /tij/]
UC Davis GTC 2012 | San Jose
![Page 2: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/2.jpg)
A Study of Persistent Threads Style GPU Programming for GPGPU Workloads
Kshitij Gupta, Jeff A. Stuart, John D. Owens
UC Davis InPar 2012 | San Jose
![Page 3: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/3.jpg)
Outline
GPGPU Programming (“nonPT”)
Limitations
Introduction to “PT”
Use Cases
Observations/Discussion
![Page 4: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/4.jpg)
* We will use CUDA terminology, but the same† discussion can be extended to OpenCL
NOTE: (CUDA | OpenCL) Terminology
CUDA OpenCL
Thread Work item
Warp --
Thread block Work group
Grid Index space
Local memory Private memory
Shared memory Local memory
Global memory Global memory
Scalar core Processing element
Multi-processor (SM) Compute unit
![Page 5: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/5.jpg)
Preliminaries: GPGPU Programming Hierarchy
DR
AM
GPU
SM_3 SM_2 SM_1 SM_0
![Page 6: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/6.jpg)
Preliminaries: GPGPU Programming Hierarchy
DR
AM
Warp
P0_0 P1_0 P7_0 P0_X P1_X P7_X SIMT
SM0
P0 P1 P7
Virtualize
SIMD
![Page 7: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/7.jpg)
Preliminaries: GPGPU Programming Hierarchy
DR
AM
Warp
P0_0 P1_0 P7_0 P0_X P1_X P7_X SIMT
SM0
P0 P1 P7
Block
W0 W1 WN
Virtualize VirtualizE
![Page 8: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/8.jpg)
Preliminaries: GPGPU Programming Hierarchy
DR
AM
Warp
P0_0 P1_0 P7_0 P0_X P1_X P7_X SIMT
SM0
P0 P1 P7
Block
W0 W1 WN
B0 B1 BM (a) SPMD
Virtualize VirtualizE ViRtUaLiZe
![Page 9: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/9.jpg)
Preliminaries: GPGPU Programming Hierarchy
DR
AM
Warp
P0_0 P1_0 P7_0 P0_X P1_X P7_X SIMT
SM0
P0 P1 P7
Block
W0 W1 WN
(a) SPMD
(b)
B0 B1 BM
Virtualize VirtualizE ViRtUaLiZe VIRTUALIZE!!
![Page 10: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/10.jpg)
Preliminaries: Workload Evolution
Unified Shader Core
ibuff
obuff
ibuff
obuff
ibuff
obuff
Shader A
Shader B
Shader C
(a) Pre-2006: Discrete cores
Core C
Core B
Core A
ibuff
obuff
ibuff
obuff
ibuff
obuff
(b) 2006: Stream programming – CUDA architecture with unified cores; along
with ‘C for CUDA’
![Page 11: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/11.jpg)
Preliminaries: Workload Evolution
Unified Shader Core
ibuff
obuff
ibuff
obuff
ibuff
obuff
Kernel A
Kernel B
Kernel C
(i)
(ii)
(iii)
(a) Pre-2006: Discrete cores
Core C
Core B
Core A
ibuff
obuff
ibuff
obuff
ibuff
obuff
(b) 2006: Stream programming – CUDA architecture with unified cores; along
with ‘C for CUDA’
(c) Today: A sample of irregular workload patterns
ibuff
obuff obuff
ibuff
obuff
ibuff
ibuff
obuff
sync
sync
![Page 12: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/12.jpg)
Core GPGPU Programming Characteristics (& Limitations)
1. Host-Device Interface Master-slave processing
Kernel size
2. Device-side Properties Lifetime of a Block
Hardware Scheduler
Block State
3. Memory Consistency Intra-block
Inter-block
4. Kernel Invocations Producer-consumer
Spawning kernels
Irregular workloads
![Page 13: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/13.jpg)
Preliminaries: Example – Image Processing (outside Pixar HQ)
Grid
Kernel
Block
Thread
![Page 14: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/14.jpg)
A sample image of 128x128 pixels divided into 16 blocks
vertical motion blur; processed on a 4-SM GPU
Inp
ut
Imag
e
Ou
tpu
t Im
age
G
PU
0 1 2 3
![Page 15: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/15.jpg)
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 1 2 3
2 3 1 0
1 0 3 2
3 2 0 1
A sample image of 128x128 pixels divided into 16 blocks nonPT illustration
vertical motion blur; processed on a 4-SM GPU
Softw
are vie
w
Hard
ware
view
In
pu
t Im
age
O
utp
ut
Imag
e
16 blocks mapped to the 4 SMs in random order
GP
U
0 1 2 3
![Page 16: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/16.jpg)
Persistent Threads: Properties
1. Maximal Launch – A kernel uses only as many threads as can be concurrently scheduled on the SMs
2. Software, not hardware, schedules work
![Page 17: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/17.jpg)
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 1 2 3
2 3 1 0
1 0 3 2
3 2 0 1
A sample image of 128x128 pixels divided into 16 blocks nonPT illustration
vertical motion blur; processed on a 4-SM GPU
Softw
are vie
w
Hard
ware
view
In
pu
t Im
age
O
utp
ut
Imag
e
16 blocks mapped to the 4 SMs in random order
GP
U
0 1 2 3
0 1 2 3
PT illustration
4 SMs 4 thread groups
0 1 2 3
![Page 18: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/18.jpg)
Persistent Threads: Properties
1. Maximal Launch – A kernel uses only as many threads as can be concurrently scheduled on the SMs Thread-group (v/s thread-block) Upper-bound: maximal launch
Lower-bound: 1
2. Software, not hardware, schedules work Work-queues Several optimizations possible
In his paper – single global FIFO
![Page 19: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/19.jpg)
Common Communication Patterns
Linear
Diagonal
Zig-Zag
Scanline
Wavefront
Pinwheel
Checker
.
.
.
![Page 20: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/20.jpg)
• μ-kernel benchmarks
• Workload comprises of FMAs
• Nvidia GeForce GTX295
Use Cases
![Page 21: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/21.jpg)
PT Use Cases
# Use Case Scenario Advantage of Persistent Threads
1 CPU-GPU
Synchronization
Kernel A produces a variable amount of data that must be consumed by Kernel B
nonPT implementations require a round-trip communication to the host to launch Kernel B with the exact number of blocks corresponding to work items produced by Kernel A.
2 Load Balancing Traversing an irregularly-structured, hierarchical data structure
PT implementations build an efficient queue to allow a single kernel to produce a variable amount of output per thread and load balance those outputs onto threads for further processing.
3 Maintaining Active
State
A kernel accumulates a single value across a large number of threads, or Kernel A wants to pass data to Kernel B through shared memory or registers
Because a PT kernel processes many more items per block than a nonPT kernel, it can effectively leverage shared memory across a larger block size for an application like a global reduction.
4 Global
Synchronization
Global synchronization within a kernel across workgroups
In a nonPT kernel, synchronizing across blocks within a kernel is not possible because blocks run to completion and cannot wait for blocks that have not yet been scheduled. The PT model ensures that all blocks are resident and thus allows global synchronization.
![Page 22: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/22.jpg)
Use Case #1: CPU-GPU Synchronization
Scenario Advantage of Persistent Threads
Kernel A produces a variable amount of data that must be
consumed by Kernel B
nonPT implementations require a round-trip communication to the host to launch Kernel B with the exact number of blocks corresponding to work items produced by Kernel A.
CPU
GPU’-kC
GPU-kP
data
param
CPU
CPU
GPU-kP
data
param
GPU-kC
param
data
dat
a
read-back barrier trb
tcpu
tlaunch
(a) nonPT (b) PT
![Page 23: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/23.jpg)
Use Case #1: CPU-GPU Synchronization
![Page 24: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/24.jpg)
Use Case #2: Load Balancing/Irregular Parallelism
Scenario Advantage of Persistent Threads
Traversing an irregularly-structured, hierarchical data
structure
PT implementations build an efficient queue to allow a single kernel to produce a variable amount of output per thread and load balance those outputs onto threads for further processing.
![Page 25: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/25.jpg)
Use Case #2: Workload Illustration – Tree(s)
Initial Inputs
# o
f le
vels
Initial Inputs
# o
f le
vels
(a) Full Tree
(b) Tilted Tree
![Page 26: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/26.jpg)
Use Case #2: Workload – Complete Tree
![Page 27: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/27.jpg)
Use Case #2: Workload – Tilted Tree
![Page 28: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/28.jpg)
Use Case #3: Maintaining Active State
Scenario Advantage of Persistent Threads
A kernel accumulates a single value across a large number of
threads, or Kernel A wants to pass data to Kernel B through shared
memory or registers
Because a PT kernel processes many more items per block than a nonPT kernel, it can effectively leverage shared memory across a larger block size for an application like a global reduction.
Kernel-X2
Kernel-X1
Kernel-X3
GPU Kernel-X’
(a) nonPT (b) PT
![Page 29: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/29.jpg)
Use Case #3: Workload Illustration – Reduction
![Page 30: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/30.jpg)
Use Case #3: Workload – Reduction
![Page 31: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/31.jpg)
Use Case #4: Global Synchronization
Scenario Advantage of Persistent Threads
Global synchronization within a kernel across workgroups
In a nonPT kernel, synchronizing across blocks within a kernel is not possible because blocks run to completion and cannot wait for blocks that have not yet been scheduled. The PT model ensures that all blocks are resident and thus allows global synchronization.
Kernel-X2
Kernel-X1
cross-block barrier #1
cross-block barrier #2
Kernel-X3
GPU Kernel-X’
PT barrier #1
PT barrier #2
(a) nonPT (b) PT
![Page 32: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/32.jpg)
Use Case #4: Global Synchronization
![Page 33: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/33.jpg)
Portability & Usability
# Use Case Occupancy Scheduling Comments
1 CPU-GPU
Synchronization ----- ----- indirect; CPU-GPU workload partitioning
2
Load Balancing/Irregular
Parallelism -----
non-trivial when sophisticated queuing structures (local + global) and work stealing/donation optimizations are used
3 Maintaining Active
State
different kernel organization and partitioning strategies
4 Global
Synchronization ----- hard to debug as occupancy changes
![Page 34: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/34.jpg)
Looking Ahead…
# Use Case Disucssion
1 CPU-GPU
Synchronization • Less of an issue on future consumer systems; but HPC still a problem • We expect this to be addressed in the future
2
Load Balancing/Irregular
Parallelism • Provide support for queues
3 Maintaining Active
State • Very hard to solve!
4 Global
Synchronization • Kernel launch might be cheaper than synchronizing across an entire
chip in future-generation hardware
![Page 35: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/35.jpg)
Looking Ahead…
Return-on-Investment
Power?
Modifications?
Native support would not require a complete re-making of the underlying hardware
Small changes could lead to reasonable gains
Augment existing APIs?
![Page 36: Study: Persistent Threads Style Programming Model for GPU … · 2013-08-23 · Study: Persistent Threads Style Programming Model for GPU Computing Author: Kshitij Gupta, Jeff Stuart](https://reader034.fdocuments.us/reader034/viewer/2022050514/5f9e0446ec970e68360d8c9d/html5/thumbnails/36.jpg)
Kshitij Gupta www.kshitijgupta.com
Thank You!