Introduction to GPU Programming - Home | National … · –Programming Guide –Best Practices...
Transcript of Introduction to GPU Programming - Home | National … · –Programming Guide –Best Practices...
![Page 1: Introduction to GPU Programming - Home | National … · –Programming Guide –Best Practices Gide –Reference Manual •CUDA C SDK Code Samples ... Introduction to GPU Programming](https://reader030.fdocuments.us/reader030/viewer/2022021507/5b04868d7f8b9a3c378dc54a/html5/thumbnails/1.jpg)
Introduction to GPU Programming
Volodymyr (Vlad) Kindratenko Innovative Systems Laboratory @ NCSA
Institute for Advanced Computing Applications and Technologies (IACAT)
V. Kindratenko, Introduction to GPU Programming (part II), December 2010, The American University in Cairo, Egypt
![Page 2: Introduction to GPU Programming - Home | National … · –Programming Guide –Best Practices Gide –Reference Manual •CUDA C SDK Code Samples ... Introduction to GPU Programming](https://reader030.fdocuments.us/reader030/viewer/2022021507/5b04868d7f8b9a3c378dc54a/html5/thumbnails/2.jpg)
Part II
• GPU programing model
• Hands-on: Mandelbrot set fractal renderer
– Reference implementation
– GPU implementation
2 V. Kindratenko, Introduction to GPU Programming (part II), December 2010, The American University in Cairo, Egypt
![Page 3: Introduction to GPU Programming - Home | National … · –Programming Guide –Best Practices Gide –Reference Manual •CUDA C SDK Code Samples ... Introduction to GPU Programming](https://reader030.fdocuments.us/reader030/viewer/2022021507/5b04868d7f8b9a3c378dc54a/html5/thumbnails/3.jpg)
CUDA Programming Model
• A CUDA kernel is executed by an array of threads – All threads run the same code (SPMD)
– Each thread has an ID that it uses to compute memory addresses and make control decisions
• Threads are arranged as a grid of thread blocks – Threads within
a block have access to a segment of shared memory
3
…
float x = input[threadID];
float y = func(x);
output[threadID] = y;
…
threadID
Grid
Thread Block 0
Shared memory
Thread Block 1
Shared memory
Thread Block N-1
Shared memory
…
V. Kindratenko, Introduction to GPU Programming (part II), December 2010, The American University in Cairo, Egypt
![Page 4: Introduction to GPU Programming - Home | National … · –Programming Guide –Best Practices Gide –Reference Manual •CUDA C SDK Code Samples ... Introduction to GPU Programming](https://reader030.fdocuments.us/reader030/viewer/2022021507/5b04868d7f8b9a3c378dc54a/html5/thumbnails/4.jpg)
Kernel Invocation Syntax
4
Grid
Thread Block 0
Shared memory
Thread Block 1
Shared memory
Thread Block N-1
Shared memory
…
grid & thread block dimensionality
vecAdd<<<32, 512>>>(devPtrA, devPtrB, devPtrC);
int i = blockIdx.x * blockDim.x + threadIdx.x;
thread ID within a thread block number of threads per block block ID within a grid
V. Kindratenko, Introduction to GPU Programming (part II), December 2010, The American University in Cairo, Egypt
![Page 5: Introduction to GPU Programming - Home | National … · –Programming Guide –Best Practices Gide –Reference Manual •CUDA C SDK Code Samples ... Introduction to GPU Programming](https://reader030.fdocuments.us/reader030/viewer/2022021507/5b04868d7f8b9a3c378dc54a/html5/thumbnails/5.jpg)
Mapping Threads to the Hardware
• Blocks of threads are transparently assigned to SMs
– A block of threads executes on one SM & does not migrate
– Several blocks can reside concurrently on one SM
• Blocks must be independent
– Any possible interleaving of blocks should be valid
– Blocks may coordinate but not synchronize
– Thread blocks can run in any order
5
Device
Block 0 Block 1
Block 2 Block 3
Block 4 Block 5
Block 6 Block 7
Kernel grid
Block 0 Block 1
Block 2 Block 3
Block 4 Block 5
Block 6 Block 7
Device
Block 0 Block 1 Block 2 Block 3
Block 4 Block 5 Block 6 Block 7
Each block can execute in any order relative to other blocks.
time
Slide is courtesy of NVIDIA V. Kindratenko, Introduction to GPU Programming (part II), December 2010, The American University in Cairo, Egypt
![Page 6: Introduction to GPU Programming - Home | National … · –Programming Guide –Best Practices Gide –Reference Manual •CUDA C SDK Code Samples ... Introduction to GPU Programming](https://reader030.fdocuments.us/reader030/viewer/2022021507/5b04868d7f8b9a3c378dc54a/html5/thumbnails/6.jpg)
CUDA Programming Model
• A kernel is executed as a grid of thread blocks – Grid of blocks can be 1 or 2-
dimentional – Thread blocks can be 1, 2, or
3-dimensional
• Different kernels can have different grid/block configuration
• Threads from the same block have access to a shared memory and their execution can be synchronized
6
Slide is courtesy of NVIDIA
Device
Grid 2
Host
Kernel
1
Kernel
2
Block (1, 1)
Thread
(0,1,0)
Thread
(1,1,0)
Thread
(2,1,0)
Thread
(3,1,0)
Thread
(0,0,0)
Thread
(1,0,0)
Thread
(2,0,0)
Thread
(3,0,0)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
V. Kindratenko, Introduction to GPU Programming (part II), December 2010, The American University in Cairo, Egypt
![Page 7: Introduction to GPU Programming - Home | National … · –Programming Guide –Best Practices Gide –Reference Manual •CUDA C SDK Code Samples ... Introduction to GPU Programming](https://reader030.fdocuments.us/reader030/viewer/2022021507/5b04868d7f8b9a3c378dc54a/html5/thumbnails/7.jpg)
GPU Memory Hierarchy
• Global (device) memory
– Accessible by all threads as well as host (CPU)
– Data lifetime is from allocation to deallocation
7
Host memory
Device 0 memory
Device 1 memory
cudaMemcpy()
V. Kindratenko, Introduction to GPU Programming (part II), December 2010, The American University in Cairo, Egypt
![Page 8: Introduction to GPU Programming - Home | National … · –Programming Guide –Best Practices Gide –Reference Manual •CUDA C SDK Code Samples ... Introduction to GPU Programming](https://reader030.fdocuments.us/reader030/viewer/2022021507/5b04868d7f8b9a3c378dc54a/html5/thumbnails/8.jpg)
GPU Memory Hierarchy
• Global (device) memory
8
Kernel 0
Thread Block 0 Thread Block 1 Thread Block N-1
…
Kernel 1
Thread Block 0 Thread Block 1 Thread Block N-1
…
Per-device Global
Memory
V. Kindratenko, Introduction to GPU Programming (part II), December 2010, The American University in Cairo, Egypt
![Page 9: Introduction to GPU Programming - Home | National … · –Programming Guide –Best Practices Gide –Reference Manual •CUDA C SDK Code Samples ... Introduction to GPU Programming](https://reader030.fdocuments.us/reader030/viewer/2022021507/5b04868d7f8b9a3c378dc54a/html5/thumbnails/9.jpg)
GPU Memory Hierarchy
• Local storage – Each thread has own local
storage
– Mostly registers (managed by the compiler)
– Data lifetime = thread lifetime
• Shared memory – Each thread block has own
shared memory
• Accessible only by threads within that block
– Data lifetime = block lifetime
9
Thread Block Per-block
shared memory
Per-thread local memory
V. Kindratenko, Introduction to GPU Programming (part II), December 2010, The American University in Cairo, Egypt
![Page 10: Introduction to GPU Programming - Home | National … · –Programming Guide –Best Practices Gide –Reference Manual •CUDA C SDK Code Samples ... Introduction to GPU Programming](https://reader030.fdocuments.us/reader030/viewer/2022021507/5b04868d7f8b9a3c378dc54a/html5/thumbnails/10.jpg)
GPU Memory Hierarchy
• 1D grid
– 2 thread blocks
• 1D block
– 2 threads
10
Grid of 2 thread blocks
block 0
thread 0 thread 1
registers registers
Global memory
Constant memory
Shared memory
block 1
thread 0 thread 1
registers registers
Shared memory
Host memory
V. Kindratenko, Introduction to GPU Programming (part II), December 2010, The American University in Cairo, Egypt
![Page 11: Introduction to GPU Programming - Home | National … · –Programming Guide –Best Practices Gide –Reference Manual •CUDA C SDK Code Samples ... Introduction to GPU Programming](https://reader030.fdocuments.us/reader030/viewer/2022021507/5b04868d7f8b9a3c378dc54a/html5/thumbnails/11.jpg)
GPU Memory Hierarchy
11
Memory Location Cached Access Scope Lifetime
Register On-chip N/A R/W One thread Thread
Local Off-chip No R/W One thread Thread
Shared On-chip N/A R/W All threads in a block Block
Global Off-chip No R/W All threads + host Application
Constant Off-chip Yes R All threads + host Application
Texture Off-chip Yes R All threads + host Application
Host
CPU
chipset
DRAM
Device
DRAM
local global
constant texture
GPU Multiprocessor
Multiprocessor Multiprocessor
registers shared memory
constant and texture caches
V. Kindratenko, Introduction to GPU Programming (part II), December 2010, The American University in Cairo, Egypt
![Page 12: Introduction to GPU Programming - Home | National … · –Programming Guide –Best Practices Gide –Reference Manual •CUDA C SDK Code Samples ... Introduction to GPU Programming](https://reader030.fdocuments.us/reader030/viewer/2022021507/5b04868d7f8b9a3c378dc54a/html5/thumbnails/12.jpg)
Porting Mandelbrot set fractal renderer to CUDA
• Source is in ~/tutorial/src2
– fractal.c – reference C implementation
– Makefile – make file
– fractal.cu.reference – CUDA implementation for reference
12 V. Kindratenko, Introduction to GPU Programming (part II), December 2010, The American University in Cairo, Egypt
![Page 13: Introduction to GPU Programming - Home | National … · –Programming Guide –Best Practices Gide –Reference Manual •CUDA C SDK Code Samples ... Introduction to GPU Programming](https://reader030.fdocuments.us/reader030/viewer/2022021507/5b04868d7f8b9a3c378dc54a/html5/thumbnails/13.jpg)
Getting started
• cd tutorial/src2
• make cpu
• ./fractal_cpu
• make convert
• copy fractal.bmp to your desktop
• display fractal.bmp on your desktop
13 V. Kindratenko, Introduction to GPU Programming (part II), December 2010, The American University in Cairo, Egypt
![Page 14: Introduction to GPU Programming - Home | National … · –Programming Guide –Best Practices Gide –Reference Manual •CUDA C SDK Code Samples ... Introduction to GPU Programming](https://reader030.fdocuments.us/reader030/viewer/2022021507/5b04868d7f8b9a3c378dc54a/html5/thumbnails/14.jpg)
Reference C Implementation
14
void makefractal_cpu(unsigned char *image, int width, int height, double xupper, double xlower, double yupper, double ylower) { int x, y; double xinc = (xupper - xlower) / width; double yinc = (yupper - ylower) / height; for (y = 0; y < height; y++) { for (x = 0; x < width; x++) { image[y*width+x] = iter((xlower + x*xinc), (ylower + y*yinc)); } } }
V. Kindratenko, Introduction to GPU Programming (part II), December 2010, The American University in Cairo, Egypt
![Page 15: Introduction to GPU Programming - Home | National … · –Programming Guide –Best Practices Gide –Reference Manual •CUDA C SDK Code Samples ... Introduction to GPU Programming](https://reader030.fdocuments.us/reader030/viewer/2022021507/5b04868d7f8b9a3c378dc54a/html5/thumbnails/15.jpg)
Reference C Implementation
15
inline unsigned char iter(double a, double b) { unsigned char i = 0; double c_x = 0, c_y = 0; double c_x_tmp, c_y_tmp; double D = 4.0; while ((c_x*c_x+c_y*c_y < D) && (i++ < 255)) { c_x_tmp = c_x * c_x - c_y * c_y; c_y_tmp = 2* c_y * c_x; c_x = a + c_x_tmp; c_y = b + c_y_tmp; } return i; }
The Mandelbrot set is
generated by iterating complex
function z2 + c, where c is a
constant:
z1 = (z0)2 + c
z2 = (z1)2 + c
z3 = (z2)2 + c
and so forth. Sequence z0, z1,
z2,... is called the orbit of z0
under iteration of z2 + c. We
stop iteration when the orbit
starts to diverge, or when a
maximum number of iterations
is done.
V. Kindratenko, Introduction to GPU Programming (part II), December 2010, The American University in Cairo, Egypt
![Page 16: Introduction to GPU Programming - Home | National … · –Programming Guide –Best Practices Gide –Reference Manual •CUDA C SDK Code Samples ... Introduction to GPU Programming](https://reader030.fdocuments.us/reader030/viewer/2022021507/5b04868d7f8b9a3c378dc54a/html5/thumbnails/16.jpg)
CUDA Kernel Implementation
16
__global__ void makefractal_gpu(unsigned char *image, int width, int height, double xupper, double xlower, double yupper, double ylower) { int x = blockIdx.x; int y = blockIdx.y; int width = gridDim.x; int height = gridDim.y; double xupper=-0.74624, xlower=-0.74758, yupper=0.10779, ylower=0.10671; double xinc = (xupper - xlower) / width; double yinc = (yupper - ylower) / height; image[y*width+x] = iter((xlower + x*xinc), (ylower + y*yinc)); }
V. Kindratenko, Introduction to GPU Programming (part II), December 2010, The American University in Cairo, Egypt
![Page 17: Introduction to GPU Programming - Home | National … · –Programming Guide –Best Practices Gide –Reference Manual •CUDA C SDK Code Samples ... Introduction to GPU Programming](https://reader030.fdocuments.us/reader030/viewer/2022021507/5b04868d7f8b9a3c378dc54a/html5/thumbnails/17.jpg)
CUDA Kernel Implementation
17
inline __device__ unsigned char iter(double a, double b) { unsigned char i = 0; double c_x = 0, c_y = 0; double c_x_tmp, c_y_tmp; double D = 4.0; while ((c_x*c_x+c_y*c_y < D) && (i++ < 255)) { c_x_tmp = c_x * c_x - c_y * c_y; c_y_tmp = 2* c_y * c_x; c_x = a + c_x_tmp; c_y = b + c_y_tmp; } return i; }
V. Kindratenko, Introduction to GPU Programming (part II), December 2010, The American University in Cairo, Egypt
![Page 18: Introduction to GPU Programming - Home | National … · –Programming Guide –Best Practices Gide –Reference Manual •CUDA C SDK Code Samples ... Introduction to GPU Programming](https://reader030.fdocuments.us/reader030/viewer/2022021507/5b04868d7f8b9a3c378dc54a/html5/thumbnails/18.jpg)
Host Code
18
int width = 1024; int height = 768; unsigned char *image = NULL; unsigned char *devImage; image = (unsigned char*)malloc(width*height*sizeof(unsigned char)); cudaMalloc((void**)&devImage, width*height*sizeof(unsigned char)); dim3 dimGrid(width, height); dim3 dimBlock(1); makefractal_gpu<<<dimGrid, dimBlock>>>(devImage); cudaMemcpy(image, devImage, width*height*sizeof(unsigned char), cudaMemcpyDeviceToHost);
free(image); cudaFree(devImage);
V. Kindratenko, Introduction to GPU Programming (part II), December 2010, The American University in Cairo, Egypt
![Page 19: Introduction to GPU Programming - Home | National … · –Programming Guide –Best Practices Gide –Reference Manual •CUDA C SDK Code Samples ... Introduction to GPU Programming](https://reader030.fdocuments.us/reader030/viewer/2022021507/5b04868d7f8b9a3c378dc54a/html5/thumbnails/19.jpg)
Few Examples
• xupper=-0.74624
• xlower=-0.74758
• yupper=0.10779
• ylower=0.10671
• CPU time: 2.27 sec
• GPU time: 0.29 sec
• xupper=-0.754534912109
• xlower=-.757077407837
• yupper=0.060144042969
• ylower=0.057710774740
• CPU time: 1.5 sec
• GPU time: 0.25 sec
19 V. Kindratenko, Introduction to GPU Programming (part II), December 2010, The American University in Cairo, Egypt
![Page 20: Introduction to GPU Programming - Home | National … · –Programming Guide –Best Practices Gide –Reference Manual •CUDA C SDK Code Samples ... Introduction to GPU Programming](https://reader030.fdocuments.us/reader030/viewer/2022021507/5b04868d7f8b9a3c378dc54a/html5/thumbnails/20.jpg)
Lab/Homework Exercises
• Exercise 1: Modify fractal code to improve efficiency
– hint: launch multiple threads per block
20 V. Kindratenko, Introduction to GPU Programming (part II), December 2010, The American University in Cairo, Egypt
![Page 21: Introduction to GPU Programming - Home | National … · –Programming Guide –Best Practices Gide –Reference Manual •CUDA C SDK Code Samples ... Introduction to GPU Programming](https://reader030.fdocuments.us/reader030/viewer/2022021507/5b04868d7f8b9a3c378dc54a/html5/thumbnails/21.jpg)
Documentation
• NVIDIA’s documentation • http://developer.nvidia.com/object/gpucomputing.html – Programming Guide – Best Practices Gide – Reference Manual
• CUDA C SDK Code Samples – http://developer.nvidia.com/object/cuda_3_2_downloads.html
• Books – David Kirk, Wen-mei W. Hwu, Programming Massively Parallel
Processors: A Hands-on Approach, Morgan Kaufmann, 2010 – Jason Sanders, Edward Kandrot, CUDA by Example: An
Introduction to General-Purpose GPU Programming, Addison-Wesley, 2010
21 V. Kindratenko, Introduction to GPU Programming (part II), December 2010, The American University in Cairo, Egypt