Post on 09-Aug-2020
GPU Programming
Lecture 2: CUDA C Basics
Miaoqing HuangUniversity of Arkansas
Spring 2015
1 / 33
Outline
Evolvements of NVIDIA GPU
CUDA Basic
Detailed StepsDevice Memories and Data TransferKernel Functions and Threading
2 / 33
Outline
Evolvements of NVIDIA GPU
CUDA Basic
Detailed StepsDevice Memories and Data TransferKernel Functions and Threading
3 / 33
Architecture of G80 GPU
SFU SFUSFU SFUSFU SFUSFU SFU SFU SFUSFU SFU
I 128 streaming processors in 16 streaming multiprocessors
4 / 33
Architecture of GT200 GPU (e.g., GeForce GTX 295)
SFU SFUSFU SFUSFU SFUSFU SFU SFU SFU SFU SFU
I 240 streaming processors in 30 streaming multiprocessors
5 / 33
Architecture of Fermi GPU (e.g., GeForce GTX 480, Tesla C2075)
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
LD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/ST
Spe
cial
Fu
nctio
n U
nit
Spe
cial
Fu
nctio
n U
nit
Spe
cial
Fu
nctio
n U
nit
Spe
cial
Fu
nctio
n U
nit
Register File (32,768 × 32-bit)
Dispatch Unit Dispatch UnitWarp Scheduler Warp Scheduler
Instruction Cache
Interconnect Network64 KB Shared Memory / L1 Cache
Fermi Streaming Multiprocessor (SM)
FermiSM
L2 Cache
FermiSM
FermiSM
FermiSM
FermiSM
FermiSM
FermiSM
FermiSM
FermiSM
FermiSM
FermiSM
FermiSM
FermiSM
FermiSM
FermiSM
FermiSM
Gig
athr
ead
Hos
t In
terfa
ceD
RA
MD
RA
M
DR
AM
DR
AM
DR
AM
DR
AM
DR
AM
16-SM Fermi GPU
Dispatch PortOperant Collector
Result Queue
FP Unit INT Unit
CUDA Core
I 512 Fermi streaming processors in 16 streaming multiprocessors
6 / 33
Architecture of Kepler GPU (e.g., Tesla K20)
I 2,880 streaming processors in 15 streaming multiprocessors
7 / 33
Architecture of Kepler Streaming Multiprocessor
I Each streaming multiprocessor contains 192 single-precisioncores and 64 double-precision cores
8 / 33
Comparison among Three ArchitecturesGPU G80 GT200 Fermi
Transistors 681 million 1.4 billion 3.0 billionCUDA Cores 128 240 512Double Precision Capability None 30 FMAs/clk 256 FMAs/clkSingle Precision Capability 128 MADs/clk 240 MADs/clk 512 FMAs/clkSpecial Function Units / SM 2 2 4Shared Memory / SM 16KB 16KB 48KB/16KBL1 Cache / SM None None 16KB/48KBL2 Cache None None 768KBECC Support No No Yes
Product
C
BA
Result
× =
+
=
(truncate extra digits)
Multiply-Add (MAD)Product
C
BA
Result
× =
+
=
(retain all digits)
Fused Multiply-Add (FMA)
9 / 33
Fermi vs. KeplerFERMIGF100
FERMIGF104
KEPLERGK104
KEPLERGK110
Compute Capability 2.0 2.1 3.0 3.5Threads / Warp 32 32 32 32Max Warps / Multiprocessor 48 48 64 64Max Threads / Multiprocessor 1536 1536 2048 2048Max Thread Blocks / Multiprocessor 8 8 16 1632 bit Registers / Multiprocessor 32768 32768 65536 65536Max Registers / Thread 63 63 63 255Max Threads / Thread Block 1024 1024 1024 1024Shared Memory Size Configurations (bytes) 16K 16K 16K 16K
48K 48K 32K 32K48K 48K
Max X Grid Dimension 2^16 1 2^16 1 2^32 1 2^32 1Hyper Q No No No YesDynamic Parallelism No No No Yes
Compute Capability of Fermi and Kepler GPUs
10 / 33
Compute Capability
I The compute capability of a device is defined by a major revisionnumber and a minor revision number
I Devices with the same major revision number are of the samecore architecture
I Kepler architecture: 3.xI Fermi architecture: 2.xI Prior devices: 1.x
I “NVIDIA CUDA C Programming Guide (v6.5)”I Appendix A lists of all CUDA-enabled devices along with their
compute capabilityI Appendix G gives the technical specifications of each compute
capability
11 / 33
Outline
Evolvements of NVIDIA GPU
CUDA Basic
Detailed StepsDevice Memories and Data TransferKernel Functions and Threading
12 / 33
Data ParallelismSquare Matrix Multiplication Example
28
M
N
P
WIDTH
WIDTH
WIDTH WIDTH
I P = M × N of sizeWIDTH×WIDTH
I The approach in C
for (i=0;i<WIDTH;i++) {for (j=0;j<WIDTH;j++) {
......P[i][j]=......;......
}}
I The approach in CUDA(the straightforward way)
I One thread calculates oneelement of P
I Issue WIDTH×WIDTHthreads simultaneously
13 / 33
Data ParallelismSquare Matrix Multiplication Example
28
M
N
P
WIDTH
WIDTH
WIDTH WIDTH
I P = M × N of sizeWIDTH×WIDTH
I The approach in C
for (i=0;i<WIDTH;i++) {for (j=0;j<WIDTH;j++) {
......P[i][j]=......;......
}}
I The approach in CUDA(the straightforward way)
I One thread calculates oneelement of P
I Issue WIDTH×WIDTHthreads simultaneously
14 / 33
Data ParallelismSquare Matrix Multiplication Example
28
M
N
P
WIDTH
WIDTH
WIDTH WIDTH
I P = M × N of sizeWIDTH×WIDTH
I The approach in C
for (i=0;i<WIDTH;i++) {for (j=0;j<WIDTH;j++) {
......P[i][j]=......;......
}}
I The approach in CUDA(the straightforward way)
I One thread calculates oneelement of P
I Issue WIDTH×WIDTHthreads simultaneously
15 / 33
The Heterogeneous Platform
I A typical GPGPU-capableplatform consists of
I One or moremicroprocessors (CPUs) –host
I One or more GPUs –device
I GPU devices are connectedto CPUs through PCIe 2.0bus
I A CUDA program consists ofI The code on CPU –
software partI The code on GPU –
hardware part
QPI
25.6GB/s25.6GB/s
CPU-GPU5.8GB/s
Unidirectional
102GB/s
QPI
16x PCIe 2.0
QPI
QPI
144GB/s
16x PCIe 2.0
16 / 33
CUDA Program Structure
Serial Code (host)
. . .
. . .
Parallel Kernel (device)
KernelA<<< nBlk, nTid >>>(args);
Serial Code (host)
Parallel Kernel (device)
KernelB<<< nBlk, nTid >>>(args);
Grid 0
Grid 1
I Integrated host+device application C programI Sequential or modestly parallel parts in host C codeI Highly parallel parts in device SPMD kernel C code
17 / 33
CUDA Thread Organization
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(0, 1)
Block(1, 1)
Grid 2
Courtesy: NDVIA
Block (1, 1)
Thread(0,1,0)
Thread(1,1,0)
Thread(2,1,0)
Thread(3,1,0)
Thread(0,0,0)
Thread(1,0,0)
Thread(2,0,0)
Thread(3,0,0)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
I A kernel is implemented as a gridof threads
I Threads in grid is furtherdecomposed into blocks
I A grid can be up tothree-dimensional
I A thread block is a batch ofthreads that can cooperate witheach other by:
I Synchronizing their executionI Efficiently sharing data through
shared memoryI A block can be one or two or
three-dimensionalI Each block and each thread has
its own ID
18 / 33
CUDA Thread Organization
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(0, 1)
Block(1, 1)
Grid 2
Courtesy: NDVIA
Block (1, 1)
Thread(0,1,0)
Thread(1,1,0)
Thread(2,1,0)
Thread(3,1,0)
Thread(0,0,0)
Thread(1,0,0)
Thread(2,0,0)
Thread(3,0,0)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
I A kernel is implemented as a gridof threads
I Threads in grid is furtherdecomposed into blocks
I A grid can be up tothree-dimensional
I A thread block is a batch ofthreads that can cooperate witheach other by:
I Synchronizing their executionI Efficiently sharing data through
shared memoryI A block can be one or two or
three-dimensionalI Each block and each thread has
its own ID
19 / 33
Matrix Multiplication: the main() function
28
M
N
P
WIDTH
WIDTH
WIDTH WIDTH
int main (void) {// 1.// Allocate and initialize// the matrices M, N, P// I/O to read the input// matrices M and N
......
// 2.// M*N on the processorMatrixMultiplication(M,N,P,width);
// 3.// I/O to write the output matrix P// Free matrices M, N, P
......return 0;
}
20 / 33
Matrix Multiplication: A Simple Host Version in C
WIDTH WIDTH
WIDTH
WIDTH
void MatrixMultiplication(float* M,float* N, float* P, int width)
{for (int i=0; i<width; i++) {for (int j=0; j<width; j++) {float sum = 0;for (int k=0; k<width; k++) {
float a = M[i*width+k];float b = N[k*width+j];sum += a*b;
}P[i*width+j] = sum;
}}
}
M0,0 M0,1 M0,2 M0,3
M1,0
M2,0
M3,0
M1,1
M2,1
M3,1
M1,2
M2,2
M3,2
M1,3
M2,3
M3,3
M0,0 M0,1 M0,2 M0,3 M1,0 M2,0 M3,0M1,1 M2,1 M3,1M1,2 M2,2 M3,2M1,3 M2,3 M3,3
Mi,j
row column
21 / 33
Matrix Multiplication: Move computation to the device
WIDTH WIDTH
WIDTH
WIDTH
void MatrixMultiplication(float* M,float* N, float* P, int width)
{int size = width*width*sizeof(float);float* Md, Nd, Pd;......// 1.// Allocate device memory for M, N,// and P// Copy M and N to allocated device// memory locations
// 2.// Kernel invocation code - to have// the device to perform the actual// matrix multiplication
// 3.// Copy P from the device memory// Free device matrices
}
22 / 33
Outline
Evolvements of NVIDIA GPU
CUDA Basic
Detailed StepsDevice Memories and Data TransferKernel Functions and Threading
23 / 33
Memory Hierarchy on DeviceI Memory hierarchy on device
I Global MemoryI Main means of communicating
between host and deviceI Long latency access
I Shared MemoryI Short latency
I RegisterI Per-thread local variables
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
I Data accessI Device code can read/write
I Per-thread registers, per-block shared memory, per-grid globalmemory
I Host code can transfer data to/fromI Per-grid global memory
24 / 33
CUDA Device Memory AllocationI cudaMalloc()
I Allocates object in the deviceGlobal Memory
I Requires two parameters1. Address of a pointer to the
allocated object2. Size of of allocated object
float* Md; //d indicates a device dataint size = width*width*sizeof(float);
cudaMalloc((void**)&Md, size);cudaFree(Md);
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
I cudaFree()I Frees object from device Global Memory
I Pointer to freed object
25 / 33
CUDA Host-Device Data TransferI cudaMemcpy()
I Memory data transferI Requires four parameters
1. Pointer to destination2. Pointer to source3. Number of bytes copied4. Type of transfer
I Types of transferI cudaMemcpyHostToHostI cudaMemcpyHostToDeviceI cudaMemcpyDeviceToHostI cudaMemcpyDeviceToDevice
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);
26 / 33
Matrix Multiplication: After Integrating the Data Transfer
void MatrixMultiplication(float* M, float* N, float* P, int width){int size = width*width*sizeof(float);float* Md, Nd, Pd;
// 1. Allocate and Load M, N to device memorycudaMalloc((void**)&Md, size);cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);cudaMalloc((void**)&Nd, size);cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);
// Allocate P on the devicecudaMalloc((void**)&Pd, size);
// 2. Kernel invocation code -- to be shown later
// 3. Read P from the devicecudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);
// Free device matricescudaFree(Md); cudaFree(Nd); cudaFree (Pd);
}
27 / 33
CUDA Thread Organization
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(0, 1)
Block(1, 1)
Grid 2
Courtesy: NDVIA
Block (1, 1)
Thread(0,1,0)
Thread(1,1,0)
Thread(2,1,0)
Thread(3,1,0)
Thread(0,0,0)
Thread(1,0,0)
Thread(2,0,0)
Thread(3,0,0)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
I A kernel is implemented as a gridof threads
I Threads in grid is furtherdecomposed into blocks
I A grid can be up tothree-dimensional
I A thread block is a batch ofthreads that can cooperate witheach other by:
I Synchronizing their executionI Efficiently sharing data through
shared memoryI A block can be one or two or
three-dimensionalI Each block and each thread has
its own ID
28 / 33
Index (i.e., coordinates) of Block and Thread
O
Y
X
Z
Thread(0,0,0)
Thread(1,0,0)
Thread(2,0,0)
Thread(0,1,0)
Thread(1,1,0)
Thread(2,1,0)
Thread(3,0,0)
Thread(3,1,0)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
tx ty tz
M0,0 M0,1 M0,2 M0,3
M1,0
M2,0
M3,0
M1,1
M2,1
M3,1
M1,2
M2,2
M3,2
M1,3
M2,3
M3,3
I Each block and each thread are assigned an index, i.e.,blockIdx and threadIdx
I blockIdx.x, blockIdx.y, blockIdx.zI threadIdx.x, threadIdx.y, threadIdx.z
I threadIdx.y−→row, threadIdx.x−→column
29 / 33
Index (i.e., coordinates) of Block and Thread
O
Y
X
Z
Thread(0,0,0)
Thread(1,0,0)
Thread(2,0,0)
Thread(0,1,0)
Thread(1,1,0)
Thread(2,1,0)
Thread(3,0,0)
Thread(3,1,0)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
tx ty tz
M0,0 M0,1 M0,2 M0,3
M1,0
M2,0
M3,0
M1,1
M2,1
M3,1
M1,2
M2,2
M3,2
M1,3
M2,3
M3,3
I Each block and each thread are assigned an index, i.e.,blockIdx and threadIdx
I blockIdx.x, blockIdx.y, blockIdx.zI threadIdx.x, threadIdx.y, threadIdx.z
I threadIdx.y−→row, threadIdx.x−→column
30 / 33
Define the Dimension of Grid and Block
I Use pre-defined dim3 to define thedimension of grid and block
I Use gridDim and blockDim to get thedimensions of the grid and the block
dim3 dimGrid(width_g, height_g, depth_g);dim3 dimBlock(width_b, height_b, depth_b);
I Total number of threads issuedI width_g×height_g×depth_g×width_b×height_b×depth_b
I Launch the device computation threads
MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);
31 / 33
Kernel Function Specification
WIDTH WIDTH
WIDTH
WIDTH
// Matrix multiplication kernel// -- per thread code
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd,int width)
{int tx = threadIdx.x;int ty = threadIdx.y;
// Pvalue is used to store the// element of the matrix// that is computed by the threadfloat Pvalue = 0;
for (int k = 0; k < width; ++k) {float Melement = Md[ty*width+k];float Nelement = Nd[k*width+tx];Pvalue += Melement * Nelement;
}
Pd[ty*width+tx] = Pvalue;}
32 / 33
CUDA Function DeclarationsExecuted on Only callable from
__device__ float DeviceFunc() device device__globe__ void DeviceFunc() device host__host__ float HostFunc() host host
I __global__ defines a kernel functionI Must be void function
I __device__ and __host__ can be used togetherI Generate two versions of the same function during the compilation
I For functions executed on the deviceI __global__ functions does not support recursionI __device__ functions only support recursion in device code
compiled for devices of compute capability 2.x and higherI No static variable declarations inside the functionI No variable number of argumentsI No indirect function calls through pointers
I All functions are host functions by default
33 / 33