CUDA Basics - PRACE
Transcript of CUDA Basics - PRACE
July 6, 2016
Mitg
lied
de
r H
elm
hol
tz-G
em
ein
scha
ft
CUDA Basics
July 6, 2016 Slide 2
CUDA Kernels
Parallel portion of application: execute as a kernel Entire GPU executes kernel, many threads
CUDA threads: Lightweight Fast switching 1000s execute simultaneously
July 6, 2016 Slide 3
CUDA Kernels: Parallel Threads
A kernel is a function executed on the GPU as an array of threads in parallel
All threads execute the same code, but can take different paths
Each thread has an ID Select input/output data Control decisions
float x = input[threadIdx.x];float y = func(x);output[threadIdx.x] = y;
© NVIDIA Corporation 2013
July 6, 2016 Slide 4
CUDA Kernels: Subdivide into Blocks
© NVIDIA Corporation 2013
July 6, 2016 Slide 5
CUDA Kernels: Subdivide into Blocks
© NVIDIA Corporation 2013
Threads are grouped into blocks
July 6, 2016 Slide 6
CUDA Kernels: Subdivide into Blocks
© NVIDIA Corporation 2013
Threads are grouped into blocks Blocks are grouped into a grid
July 6, 2016 Slide 7
CUDA Kernels: Subdivide into Blocks
© NVIDIA Corporation 2013
Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads
DRAM
July 6, 2016 Slide 8
Kernel Execution
……
…
CUDA thread CUDA core
CUDA thread block
…
CUDA Streaming Multiprocessor
CUDA kernel grid
...
© NVIDIA Corporation 2013
July 6, 2016 Slide 9
Thread blocks allow scalability
Block can execute in any order, concurrently or sequentially
This independence between blocks gives scalability: A kernel scales across any number of SMs
Device with 2 SMs
SM 0 SM 1
Block 0 Block 1
Block 2 Block 3
Block 4 Block 5
Block 6 Block 7
Kernel Grid Launch
Block 0
Block 1
Block 2
Block 3
Block 4
Block 5
Block 6
Block 7
Device with 4 SMs
SM 0 SM 1 SM 2 SM 3
Block 0 Block 1 Block 2 Block 3
Block 4 Block 5 Block 6 Block 7
© NVIDIA Corporation 2013
July 6, 2016 Slide 10
Scale Kernel
__global__ void scale(float alpha, float* A, float* C, int m)
{ int i = blockDim.x*blockIdx.x+threadIdx.x; if ( i < m) C[i] = alpha * A[i];}
void scale(float alpha, float* A, float* C, int m)
{ int i = 0; for ( i=0; i<m; ++i) C[i] = alpha * A[i];}
July 6, 2016 Slide 11
Getting data in and out with Unified Memory
GPU has separate memory, but transfers can be managed by runtime
Allocate memory with cudaMallocManaged Free memory
July 6, 2016 Slide 12
Define dimensions of thread block
On JURECA (Tesla K80): Max. dim. of a block: 1024 x 1024 x 64 Max. number of threads per block: 1024
Example:// Create 3D thread block with 512 threadsdim3 blockDim(16, 16, 2);
dim3 blockDim(size_t blockDimX, size_t blockDimY, size_t blockDimZ)
dim3 blockDim(size_t blockDimX, size_t blockDimY, size_t blockDimZ)
July 6, 2016 Slide 13
Define dimensions of grid
On JURECA (Tesla K80): Max. dim. of a grid: 2147483647 x 65535 x 65535
Example:// Dimension of problem: nx x ny = 1000 x 1000dim3 blockDim(16, 16) // Don't need to write z = 1
int gx = (nx % blockDim.x==0) ? nx / blockDim.x : nx / blockDim.x + 1int gy = (ny % blockDim.y==0) ? ny / blockDim.y : ny / blockDim.y + 1dim3 gridDim(gx, gy);
dim3 gridDim(size_t blockDimX, size_t blockDimY, size_t blockDimZ)
dim3 gridDim(size_t blockDimX, size_t blockDimY, size_t blockDimZ)
Watch out!
July 6, 2016 Slide 14
Call the kernel
Call returns immediately! → Kernel executes asynchronouslyExample:
scale<<<m/blockDim, blockDim>>>(alpha, a_gpu, c_gpu, m)
kernel<<<int gridDim, int blockDim>>>([arg]*)kernel<<<int gridDim, int blockDim>>>([arg]*)
July 6, 2016 Slide 15
Calling the kernel
Define dimensions of thread block
Define dimensions of grid
Call the kernel
dim3 blockDim(size_t blockDimX, size_t blockDimY, size_t blockDimZ)
dim3 blockDim(size_t blockDimX, size_t blockDimY, size_t blockDimZ)
dim3 gridDim(size_t gridDimX, size_t gridDimY, size_t gridDimZ)
dim3 gridDim(size_t gridDimX, size_t gridDimY, size_t gridDimZ)
kernel<<<dim3 gridDim, dim3 blockDim>>>([arg]*)kernel<<<dim3 gridDim, dim3 blockDim>>>([arg]*)
July 6, 2016 Slide 16
Free device memory
cudaFree(void* pointer)
Example: // Free the memory allocated by a_gpu on the devicecudaFree(a_gpu);
July 6, 2016 Slide 17
Exercise
CudaBasics/exercises/tasks/scale_vector
Compile with nvcc -o scale_vector scale_vector_um.cu
July 6, 2016 Slide 18
Getting data in and out
GPU has separate memory Allocate memory on device Transfer data from host to device Transfer data from device to host Free device memory
July 6, 2016 Slide 19
Allocate memory on device
cudaMalloc(T** pointer, size_t nbytes)
Example: // Allocate a vector of 2048 floats on devicefloat * a_gpu;int n = 2048;cudaMalloc(&a_gpu, n * sizeof(float));
Address of pointer
Get size of a float
July 6, 2016 Slide 20
Copy from host to device
cudaMemcpy(void* dst, void* src, size_t nbytes, enum cudaMemcpyKind dir)
Example: // Copy vector of floats a of length n=2048 to a_gpu on devicecudaMemcpy(a_gpu, a, n * sizeof(float),
cudaMemcpyHostToDevice);
July 6, 2016 Slide 21
Copy from device to host
cudaMemcpy(void* dst, void* src, size_t nbytes, enum cudaMemcpyKind dir)
Example: // Copy vector of floats a_gpu of length n=2048 to a on hostcudaMemcpy(a, a_gpu, n * sizeof(float),
cudaMemcpyDeviceToHost);
Note the order Changed flag
July 6, 2016 Slide 22
Unified Virtual Address Space (UVA)
64bit64bit
2.02.0 cudaMalloc*(...)cudaHostAlloc(...)cudaMemcpy*(..., cudaMemcpyDefault)
return UVA pointers
July 6, 2016 Slide 23
Getting data in and out
Allocate memory on devicecudaMalloc(void** pointer, size_t nbytes)
Transfer data between host and devicecudaMemcpy(void* dst, void* src, size_t nbytes,
enum cudaMemcpyKind dir)dir = cudaMemcpyHostToDevicedir = cudaMemcpyDeviceToHost
Free device memorycudaFree(void* pointer)
July 6, 2016 Slide 24
Exercise Scale Vector
Allocate memory on devicecudaMalloc(T** pointer, size_t nbytes)
Transfer data between host and devicecudaMemcpy(void* dst, void* src,
size_t nbytes, enum cudaMemcpyKind dir)
dir = cudaMemcpyHostToDevice
dir = cudaMemcpyDeviceToHost
Free device memorycudaFree(void* pointer)
Define dimensions of thread block
Define dimensions of grid
Call the kernel
dim3 blockDim(size_t blockDimX, size_t blockDimY, size_t blockDimZ)
dim3 blockDim(size_t blockDimX, size_t blockDimY, size_t blockDimZ)
dim3 gridDim(size_t gridDimX, size_t gridDimY, size_t gridDimZ)
dim3 gridDim(size_t gridDimX, size_t gridDimY, size_t gridDimZ)
kernel<<<dim3 gridDim, dim3 blockDim>>>([arg]*)
kernel<<<dim3 gridDim, dim3 blockDim>>>([arg]*)
July 6, 2016 Slide 25
Exercise
CudaBasics/exercises/tasks/jacobi_w_explicit_transfer
Compile with make jacobi.