Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving...
Transcript of Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving...
![Page 1: Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving Programmability Dynamic Parallelism Occupancy Simplify CPU/GPU Divide Library Calls from](https://reader034.fdocuments.us/reader034/viewer/2022052611/5f05c0297e708231d414850d/html5/thumbnails/1.jpg)
Introduction to Dynamic Parallelism
Stephen Jones NVIDIA Corporation
![Page 2: Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving Programmability Dynamic Parallelism Occupancy Simplify CPU/GPU Divide Library Calls from](https://reader034.fdocuments.us/reader034/viewer/2022052611/5f05c0297e708231d414850d/html5/thumbnails/2.jpg)
Improving Programmability
Dynamic
Parallelism Occupancy
Simplify CPU/GPU Divide
Library Calls from Kernels
Batching to Help Fill GPU
Dynamic Load Balancing
Data-Dependent Execution
Recursive Parallel Algorithms
![Page 3: Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving Programmability Dynamic Parallelism Occupancy Simplify CPU/GPU Divide Library Calls from](https://reader034.fdocuments.us/reader034/viewer/2022052611/5f05c0297e708231d414850d/html5/thumbnails/3.jpg)
What is Dynamic Parallelism?
The ability to launch new grids from the GPU
Dynamically
Simultaneously
Independently
CPU GPU CPU GPU
Fermi: Only CPU can generate GPU work Kepler: GPU can generate work for itself
![Page 4: Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving Programmability Dynamic Parallelism Occupancy Simplify CPU/GPU Divide Library Calls from](https://reader034.fdocuments.us/reader034/viewer/2022052611/5f05c0297e708231d414850d/html5/thumbnails/4.jpg)
CPU GPU CPU GPU
What Does It Mean?
Autonomous, Dynamic Parallelism GPU as Co-Processor
![Page 5: Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving Programmability Dynamic Parallelism Occupancy Simplify CPU/GPU Divide Library Calls from](https://reader034.fdocuments.us/reader034/viewer/2022052611/5f05c0297e708231d414850d/html5/thumbnails/5.jpg)
The Simplest Parallel Program
for i = 1 to N for j = 1 to M convolution(i, j) next j next i
![Page 6: Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving Programmability Dynamic Parallelism Occupancy Simplify CPU/GPU Divide Library Calls from](https://reader034.fdocuments.us/reader034/viewer/2022052611/5f05c0297e708231d414850d/html5/thumbnails/6.jpg)
The Simplest Parallel Program
for i = 1 to N for j = 1 to M convolution(i, j) next j next i
M
N
![Page 7: Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving Programmability Dynamic Parallelism Occupancy Simplify CPU/GPU Divide Library Calls from](https://reader034.fdocuments.us/reader034/viewer/2022052611/5f05c0297e708231d414850d/html5/thumbnails/7.jpg)
The Simplest Impossible Parallel Program
for i = 1 to N for j = 1 to x[i] convolution(i, j) next j next i
![Page 8: Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving Programmability Dynamic Parallelism Occupancy Simplify CPU/GPU Divide Library Calls from](https://reader034.fdocuments.us/reader034/viewer/2022052611/5f05c0297e708231d414850d/html5/thumbnails/8.jpg)
The Simplest Impossible Parallel Program
for i = 1 to N for j = 1 to x[i] convolution(i, j) next j next i
N
max(x[i])
N
Bad alternative #2: Serialisation
Bad alternative #1: Oversubscription
![Page 9: Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving Programmability Dynamic Parallelism Occupancy Simplify CPU/GPU Divide Library Calls from](https://reader034.fdocuments.us/reader034/viewer/2022052611/5f05c0297e708231d414850d/html5/thumbnails/9.jpg)
The Now-Possible Parallel Program
for i = 1 to N for j = 1 to x[i] convolution(i, j) next j next i
Serial Program
__global__ void convolution(int x[]) { for j = 1 to x[blockIdx] kernel<<< ... >>>(blockIdx, j) } convolution<<< N, 1 >>>(x);
CUDA Program
Now Possible: Dynamic Parallelism
N
![Page 10: Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving Programmability Dynamic Parallelism Occupancy Simplify CPU/GPU Divide Library Calls from](https://reader034.fdocuments.us/reader034/viewer/2022052611/5f05c0297e708231d414850d/html5/thumbnails/10.jpg)
Data-Dependent Parallelism
CUDA Today CUDA on Kepler
Computational
Power allocated to
regions of interest
![Page 11: Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving Programmability Dynamic Parallelism Occupancy Simplify CPU/GPU Divide Library Calls from](https://reader034.fdocuments.us/reader034/viewer/2022052611/5f05c0297e708231d414850d/html5/thumbnails/11.jpg)
Dynamic Work Generation
Initial Grid
Statically assign conservative
worst-case grid
Dynamically assign performance
where accuracy is required
Dynamic Grid
Fixed Grid
![Page 12: Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving Programmability Dynamic Parallelism Occupancy Simplify CPU/GPU Divide Library Calls from](https://reader034.fdocuments.us/reader034/viewer/2022052611/5f05c0297e708231d414850d/html5/thumbnails/12.jpg)
Mapping Compute to the Problem
![Page 13: Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving Programmability Dynamic Parallelism Occupancy Simplify CPU/GPU Divide Library Calls from](https://reader034.fdocuments.us/reader034/viewer/2022052611/5f05c0297e708231d414850d/html5/thumbnails/13.jpg)
Mapping Compute to the Problem
![Page 14: Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving Programmability Dynamic Parallelism Occupancy Simplify CPU/GPU Divide Library Calls from](https://reader034.fdocuments.us/reader034/viewer/2022052611/5f05c0297e708231d414850d/html5/thumbnails/14.jpg)
Library Calls & Nested Parallelism
LU decomposition
dgetrf(N, N) { for j=1 to N for i=1 to 64 idamax<<<>>> memcpy dswap<<<>>> memcpy dscal<<<>>> dger<<<>>> next i memcpy dlaswap<<<>>> dtrsm<<<>>> dgemm<<<>>> next j }
dswap();
idamax();
dscal();
dger();
dlaswap();
dtrsm();
dgemm();
GPU Code CPU Code
LU decomposition (Kepler)
dgetrf(N, N) { dgetrf<<<>>> synchronize(); }
dgetrf(N, N) { for j=1 to N for i=1 to 64 idamax<<<>>> dswap<<<>>> dscal<<<>>> dger<<<>>> next i dlaswap<<<>>> dtrsm<<<>>> dgemm<<<>>> next j }
GPU Code CPU Code
(Fermi)
![Page 15: Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving Programmability Dynamic Parallelism Occupancy Simplify CPU/GPU Divide Library Calls from](https://reader034.fdocuments.us/reader034/viewer/2022052611/5f05c0297e708231d414850d/html5/thumbnails/15.jpg)
Batched & Nested Parallelism
Algorithm flow simplified for illustrative purposes
CPU-Controlled Work Batching
CPU programs limited by single
point of control
Can run at most 10s of threads
CPU is fully consumed with
controlling launches
CPU Control Thread
dgetf2 dgetf2 dgetf2
CPU Control Thread
dswap dswap dswap
dtrsm dtrsm dtrsm
dgemm dgemm dgemm
CPU Control Thread
Multiple LU-Decomposition, Pre-Kepler
CPU Control Thread
CPU Control Thread
![Page 16: Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving Programmability Dynamic Parallelism Occupancy Simplify CPU/GPU Divide Library Calls from](https://reader034.fdocuments.us/reader034/viewer/2022052611/5f05c0297e708231d414850d/html5/thumbnails/16.jpg)
Batched & Nested Parallelism
Algorithm flow simplified for illustrative purposes
Batching via Dynamic Parallelism
Move top-level loops to GPU
Run thousands of independent tasks
Release CPU for other work
CPU Control Thread
CPU Control Thread
GPU Control
Thread
dgetf2
dswap
dtrsm
dgemm
GPU Control
Thread
dgetf2
dswap
dtrsm
dgemm
GPU Control
Thread
dgetf2
dswap
dtrsm
dgemm
Batched LU-Decomposition, Kepler
![Page 17: Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving Programmability Dynamic Parallelism Occupancy Simplify CPU/GPU Divide Library Calls from](https://reader034.fdocuments.us/reader034/viewer/2022052611/5f05c0297e708231d414850d/html5/thumbnails/17.jpg)
Familiar Syntax
__global__ void B(float *data) { do_stuff(data); X <<< ... >>> (data); Y <<< ... >>> (data); Z <<< ... >>> (data); cudaDeviceSynchronize(); do_more_stuff(data); }
void main() { float *data; do_stuff(data); A <<< ... >>> (data); B <<< ... >>> (data); C <<< ... >>> (data); cudaDeviceSynchronize(); do_more_stuff(data); }
CUDA from CPU
CUDA from GPU
![Page 18: Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving Programmability Dynamic Parallelism Occupancy Simplify CPU/GPU Divide Library Calls from](https://reader034.fdocuments.us/reader034/viewer/2022052611/5f05c0297e708231d414850d/html5/thumbnails/18.jpg)
GPU
Reminder: Dependencies in CUDA
A
B
C
CPU void main() { float *data; do_stuff(data); A <<< ... >>> (data); B <<< ... >>> (data); C <<< ... >>> (data); cudaDeviceSynchronize(); do_more_stuff(data); }
![Page 19: Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving Programmability Dynamic Parallelism Occupancy Simplify CPU/GPU Divide Library Calls from](https://reader034.fdocuments.us/reader034/viewer/2022052611/5f05c0297e708231d414850d/html5/thumbnails/19.jpg)
GPU
Nested Dependencies
__global__ void B(float *data) { do_stuff(data); X <<< ... >>> (data); Y <<< ... >>> (data); Z <<< ... >>> (data); cudaDeviceSynchronize(); do_more_stuff(data); }
A
B
C
X
Y
Z
CPU void main() { float *data; do_stuff(data); A <<< ... >>> (data); B <<< ... >>> (data); C <<< ... >>> (data); cudaDeviceSynchronize(); do_more_stuff(data); }
![Page 20: Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving Programmability Dynamic Parallelism Occupancy Simplify CPU/GPU Divide Library Calls from](https://reader034.fdocuments.us/reader034/viewer/2022052611/5f05c0297e708231d414850d/html5/thumbnails/20.jpg)
__device__ float buf[1024]; __global__ void dynamic(float *data) { int tid = threadIdx.x; if(tid % 2) buf[tid/2] = data[tid]+data[tid+1]; __syncthreads(); if(tid == 0) { launch<<< 128, 256 >>>(buf); cudaDeviceSynchronize(); } __syncthreads(); cudaMemcpyAsync(data, buf, 1024); cudaDeviceSynchronize(); }
Programming Model Basics
Code Example CUDA Runtime syntax & semantics
![Page 21: Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving Programmability Dynamic Parallelism Occupancy Simplify CPU/GPU Divide Library Calls from](https://reader034.fdocuments.us/reader034/viewer/2022052611/5f05c0297e708231d414850d/html5/thumbnails/21.jpg)
__device__ float buf[1024]; __global__ void dynamic(float *data) { int tid = threadIdx.x; if(tid % 2) buf[tid/2] = data[tid]+data[tid+1]; __syncthreads(); if(tid == 0) { launch<<< 128, 256 >>>(buf); cudaDeviceSynchronize(); } __syncthreads(); cudaMemcpyAsync(data, buf, 1024); cudaDeviceSynchronize(); }
Programming Model Basics
Code Example CUDA Runtime syntax & semantics
Launch is per-thread
![Page 22: Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving Programmability Dynamic Parallelism Occupancy Simplify CPU/GPU Divide Library Calls from](https://reader034.fdocuments.us/reader034/viewer/2022052611/5f05c0297e708231d414850d/html5/thumbnails/22.jpg)
__device__ float buf[1024]; __global__ void dynamic(float *data) { int tid = threadIdx.x; if(tid % 2) buf[tid/2] = data[tid]+data[tid+1]; __syncthreads(); if(tid == 0) { launch<<< 128, 256 >>>(buf); cudaDeviceSynchronize(); } __syncthreads(); cudaMemcpyAsync(data, buf, 1024); cudaDeviceSynchronize(); }
Programming Model Basics
Code Example CUDA Runtime syntax & semantics
Launch is per-thread
Sync includes all launches
by any thread in the block
![Page 23: Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving Programmability Dynamic Parallelism Occupancy Simplify CPU/GPU Divide Library Calls from](https://reader034.fdocuments.us/reader034/viewer/2022052611/5f05c0297e708231d414850d/html5/thumbnails/23.jpg)
__device__ float buf[1024]; __global__ void dynamic(float *data) { int tid = threadIdx.x; if(tid % 2) buf[tid/2] = data[tid]+data[tid+1]; __syncthreads(); if(tid == 0) { launch<<< 128, 256 >>>(buf); cudaDeviceSynchronize(); } __syncthreads(); cudaMemcpyAsync(data, buf, 1024); cudaDeviceSynchronize(); }
Programming Model Basics
CUDA Runtime syntax & semantics
Launch is per-thread
Sync includes all launches
by any thread in the block
cudaDeviceSynchronize() does
not imply syncthreads
Code Example
![Page 24: Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving Programmability Dynamic Parallelism Occupancy Simplify CPU/GPU Divide Library Calls from](https://reader034.fdocuments.us/reader034/viewer/2022052611/5f05c0297e708231d414850d/html5/thumbnails/24.jpg)
__device__ float buf[1024]; __global__ void dynamic(float *data) { int tid = threadIdx.x; if(tid % 2) buf[tid/2] = data[tid]+data[tid+1]; __syncthreads(); if(tid == 0) { launch<<< 128, 256 >>>(buf); cudaDeviceSynchronize(); } __syncthreads(); cudaMemcpyAsync(data, buf, 1024); cudaDeviceSynchronize(); }
Programming Model Basics
Code Example CUDA Runtime syntax & semantics
Launch is per-thread
Sync includes all launches
by any thread in the block
cudaDeviceSynchronize() does
not imply syncthreads
Asynchronous launches only
![Page 25: Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving Programmability Dynamic Parallelism Occupancy Simplify CPU/GPU Divide Library Calls from](https://reader034.fdocuments.us/reader034/viewer/2022052611/5f05c0297e708231d414850d/html5/thumbnails/25.jpg)
__device__ float buf[1024]; __global__ void dynamic(float *data) { int tid = threadIdx.x; if(tid % 2) buf[tid/2] = data[tid]+data[tid+1]; __syncthreads(); if(tid == 0) { launch<<< 128, 256 >>>(buf); cudaDeviceSynchronize(); } __syncthreads(); cudaMemcpyAsync(data, buf, 1024); cudaDeviceSynchronize(); }
Programming Model Basics
Code Example CUDA Runtime syntax & semantics
Launch is per-thread
Sync includes all launches
by any thread in the block
cudaDeviceSynchronize() does
not imply syncthreads
Asynchronous launches only (note bug in program, here!)
![Page 26: Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving Programmability Dynamic Parallelism Occupancy Simplify CPU/GPU Divide Library Calls from](https://reader034.fdocuments.us/reader034/viewer/2022052611/5f05c0297e708231d414850d/html5/thumbnails/26.jpg)
Example 1: Simple Library Calls
__global__ void libraryCall(float *a, float *b, float *c) { // All threads generate data createData(a, b); __syncthreads(); // Only one thread calls library if(threadIdx.x == 0) { cublasDgemm(a, b, c); cudaDeviceSynchronize(); } // All threads wait for dtrsm __syncthreads(); // Now continue consumeData(c); }
CPU launches
kernel
Per-block data
generation
Call of 3rd party
library
3rd party library
executes launch
Parallel use
of result
![Page 27: Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving Programmability Dynamic Parallelism Occupancy Simplify CPU/GPU Divide Library Calls from](https://reader034.fdocuments.us/reader034/viewer/2022052611/5f05c0297e708231d414850d/html5/thumbnails/27.jpg)
Example 1: Simple Library Calls
__global__ void libraryCall(float *a, float *b, float *c) { // All threads generate data createData(a, b); __syncthreads(); // Only one thread calls library if(threadIdx.x == 0) { cublasDgemm(a, b, c); cudaDeviceSynchronize(); } // All threads wait for dgemm __syncthreads(); // Now continue consumeData(c); }
Things to notice
Sync before launch to ensure all data is ready
Per-thread execution semantic
Single call to external library function
(Note launch performed by external library,
but we synchronize in our own kernel)
cudaDeviceSynchronize() by launching thread
__syncthreads() before consuming data
![Page 28: Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving Programmability Dynamic Parallelism Occupancy Simplify CPU/GPU Divide Library Calls from](https://reader034.fdocuments.us/reader034/viewer/2022052611/5f05c0297e708231d414850d/html5/thumbnails/28.jpg)
Example 2: Parallel Recursion
Simple example: Quicksort
Typical divide-and-conquer algorithm
Recursively partition-and-sort data
Entirely data-dependent execution
Notoriously hard to do efficiently on
Fermi
3 2 6 3 9 1 4 2 5 1 8 7 9 2 5 8
3 2 6 3 9 1 4 2 5 1 8 7 9 2 5 8
2 1 2 1 2 3 6 3 9 4 5 8 7 9 5 8
3 6 3 4 5 8 7 5 8
1 2 2 2 3 3 4 1 5 6 7 8 8 9 9 5
eventually...
![Page 29: Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving Programmability Dynamic Parallelism Occupancy Simplify CPU/GPU Divide Library Calls from](https://reader034.fdocuments.us/reader034/viewer/2022052611/5f05c0297e708231d414850d/html5/thumbnails/29.jpg)
Example 2: Parallel Recursion
Select pivot
value
For each element:
retrieve value
Recurse sort
into right-hand
subset
Store left if
value < pivot
Store right if
value >= pivot
all
done?
Recurse sort
into left-hand
subset
No
Yes
__global__ void qsort(int *data, int l, int r) { int pivot = data[0]; int *lptr = data+l, *rptr = data+r; // Partition data around pivot value partition(data, l, r, lptr, rptr, pivot); // Launch next stage recursively if(l < (rptr-data)) qsort<<< ... >>>(data, l, rptr-data); if(r > (lptr-data)) qsort<<< ... >>>(data, lptr-data, r); }
![Page 30: Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving Programmability Dynamic Parallelism Occupancy Simplify CPU/GPU Divide Library Calls from](https://reader034.fdocuments.us/reader034/viewer/2022052611/5f05c0297e708231d414850d/html5/thumbnails/30.jpg)
Example 2: Parallel Recursion
__global__ void qsort(int *data, int l, int r) { int pivot = data[0]; int *lptr = data+l, *rptr = data+r; // Partition data around pivot value partition(data, l, r, lptr, rptr); // Now the recursive launch part. // Use streams this time! cudaStream_t s1, s2; cudaStreamCreateWithFlags(&s1, ...); cudaStreamCreateWithFlags(&s2, ...); int rx = rptr-data, lx = lptr-data; if(l < rx) qsort<<< ..., 0, s1 >>>(data, l, rx); if(r > lx) qsort<<< ..., 0, s2 >>>(data, lx, r); }
Achieve concurrency by
launching left- and right-hand
sorts in separate streams
Compare simplicity of recursion to complexity
of equivalent program on Fermi...
![Page 31: Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving Programmability Dynamic Parallelism Occupancy Simplify CPU/GPU Divide Library Calls from](https://reader034.fdocuments.us/reader034/viewer/2022052611/5f05c0297e708231d414850d/html5/thumbnails/31.jpg)
Basic Rules
Programming Model
Manifestly the same as CUDA
Launch is per-thread
Sync is per-block
CUDA primitives are per-block (cannot pass streams/events to children)
cudaDeviceSynchronize() != __syncthreads()
Events allow inter-stream dependencies
![Page 32: Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving Programmability Dynamic Parallelism Occupancy Simplify CPU/GPU Divide Library Calls from](https://reader034.fdocuments.us/reader034/viewer/2022052611/5f05c0297e708231d414850d/html5/thumbnails/32.jpg)
Execution Rules
Execution Model
Each block runs CUDA independently
All launches & copies are async
Constants set from host
Textures/surfaces bound only from host
ECC errors reported at host
![Page 33: Introduction to Dynamic Parallelismdeveloper.download.nvidia.com/GTC/PDF/GTC2012/...Improving Programmability Dynamic Parallelism Occupancy Simplify CPU/GPU Divide Library Calls from](https://reader034.fdocuments.us/reader034/viewer/2022052611/5f05c0297e708231d414850d/html5/thumbnails/33.jpg)
Memory Consistency Rules
Memory Model
Launch implies membar (child sees parent state at time of launch)
Sync implies invalidate (parent sees child writes after sync)
Texture changes by child are
visible to parent after sync (i.e. sync == tex cache invalidate)
Constants are immutable
Local & shared memory are private:
cannot be passed as child kernel args