CUDA Basics - PRACE

25
July 6, 2016 Mitglied der Helmholtz-Gemeinschaft CUDA Basics

Transcript of CUDA Basics - PRACE

Page 1: CUDA Basics - PRACE

July 6, 2016

Mitg

lied

de

r H

elm

hol

tz-G

em

ein

scha

ft

CUDA Basics

Page 2: CUDA Basics - PRACE

July 6, 2016 Slide 2

CUDA Kernels

Parallel portion of application: execute as a kernel Entire GPU executes kernel, many threads

CUDA threads: Lightweight Fast switching 1000s execute simultaneously

Page 3: CUDA Basics - PRACE

July 6, 2016 Slide 3

CUDA Kernels: Parallel Threads

A kernel is a function executed on the GPU as an array of threads in parallel

All threads execute the same code, but can take different paths

Each thread has an ID Select input/output data Control decisions

float x = input[threadIdx.x];float y = func(x);output[threadIdx.x] = y;

© NVIDIA Corporation 2013

Page 4: CUDA Basics - PRACE

July 6, 2016 Slide 4

CUDA Kernels: Subdivide into Blocks

© NVIDIA Corporation 2013

Page 5: CUDA Basics - PRACE

July 6, 2016 Slide 5

CUDA Kernels: Subdivide into Blocks

© NVIDIA Corporation 2013

Threads are grouped into blocks

Page 6: CUDA Basics - PRACE

July 6, 2016 Slide 6

CUDA Kernels: Subdivide into Blocks

© NVIDIA Corporation 2013

Threads are grouped into blocks Blocks are grouped into a grid

Page 7: CUDA Basics - PRACE

July 6, 2016 Slide 7

CUDA Kernels: Subdivide into Blocks

© NVIDIA Corporation 2013

Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads

DRAM

Page 8: CUDA Basics - PRACE

July 6, 2016 Slide 8

Kernel Execution

……

CUDA thread CUDA core

CUDA thread block

CUDA Streaming Multiprocessor

CUDA kernel grid

...

© NVIDIA Corporation 2013

Page 9: CUDA Basics - PRACE

July 6, 2016 Slide 9

Thread blocks allow scalability

Block can execute in any order, concurrently or sequentially

This independence between blocks gives scalability: A kernel scales across any number of SMs

Device with 2 SMs

SM 0 SM 1

Block 0 Block 1

Block 2 Block 3

Block 4 Block 5

Block 6 Block 7

Kernel Grid Launch

Block 0

Block 1

Block 2

Block 3

Block 4

Block 5

Block 6

Block 7

Device with 4 SMs

SM 0 SM 1 SM 2 SM 3

Block 0 Block 1 Block 2 Block 3

Block 4 Block 5 Block 6 Block 7

© NVIDIA Corporation 2013

Page 10: CUDA Basics - PRACE

July 6, 2016 Slide 10

Scale Kernel

__global__ void scale(float alpha, float* A, float* C, int m)

{ int i = blockDim.x*blockIdx.x+threadIdx.x; if ( i < m) C[i] = alpha * A[i];}

void scale(float alpha, float* A, float* C, int m)

{ int i = 0; for ( i=0; i<m; ++i) C[i] = alpha * A[i];}

Page 11: CUDA Basics - PRACE

July 6, 2016 Slide 11

Getting data in and out with Unified Memory

GPU has separate memory, but transfers can be managed by runtime

Allocate memory with cudaMallocManaged Free memory

Page 12: CUDA Basics - PRACE

July 6, 2016 Slide 12

Define dimensions of thread block

On JURECA (Tesla K80): Max. dim. of a block: 1024 x 1024 x 64 Max. number of threads per block: 1024

Example:// Create 3D thread block with 512 threadsdim3 blockDim(16, 16, 2);

dim3 blockDim(size_t blockDimX, size_t blockDimY, size_t blockDimZ)

dim3 blockDim(size_t blockDimX, size_t blockDimY, size_t blockDimZ)

Page 13: CUDA Basics - PRACE

July 6, 2016 Slide 13

Define dimensions of grid

On JURECA (Tesla K80): Max. dim. of a grid: 2147483647 x 65535 x 65535

Example:// Dimension of problem: nx x ny = 1000 x 1000dim3 blockDim(16, 16) // Don't need to write z = 1

int gx = (nx % blockDim.x==0) ? nx / blockDim.x : nx / blockDim.x + 1int gy = (ny % blockDim.y==0) ? ny / blockDim.y : ny / blockDim.y + 1dim3 gridDim(gx, gy);

dim3 gridDim(size_t blockDimX, size_t blockDimY, size_t blockDimZ)

dim3 gridDim(size_t blockDimX, size_t blockDimY, size_t blockDimZ)

Watch out!

Page 14: CUDA Basics - PRACE

July 6, 2016 Slide 14

Call the kernel

Call returns immediately! → Kernel executes asynchronouslyExample:

scale<<<m/blockDim, blockDim>>>(alpha, a_gpu, c_gpu, m)

kernel<<<int gridDim, int blockDim>>>([arg]*)kernel<<<int gridDim, int blockDim>>>([arg]*)

Page 15: CUDA Basics - PRACE

July 6, 2016 Slide 15

Calling the kernel

Define dimensions of thread block

Define dimensions of grid

Call the kernel

dim3 blockDim(size_t blockDimX, size_t blockDimY, size_t blockDimZ)

dim3 blockDim(size_t blockDimX, size_t blockDimY, size_t blockDimZ)

dim3 gridDim(size_t gridDimX, size_t gridDimY, size_t gridDimZ)

dim3 gridDim(size_t gridDimX, size_t gridDimY, size_t gridDimZ)

kernel<<<dim3 gridDim, dim3 blockDim>>>([arg]*)kernel<<<dim3 gridDim, dim3 blockDim>>>([arg]*)

Page 16: CUDA Basics - PRACE

July 6, 2016 Slide 16

Free device memory

cudaFree(void* pointer)

Example: // Free the memory allocated by a_gpu on the devicecudaFree(a_gpu);

Page 17: CUDA Basics - PRACE

July 6, 2016 Slide 17

Exercise

CudaBasics/exercises/tasks/scale_vector

Compile with nvcc -o scale_vector scale_vector_um.cu

Page 18: CUDA Basics - PRACE

July 6, 2016 Slide 18

Getting data in and out

GPU has separate memory Allocate memory on device Transfer data from host to device Transfer data from device to host Free device memory

Page 19: CUDA Basics - PRACE

July 6, 2016 Slide 19

Allocate memory on device

cudaMalloc(T** pointer, size_t nbytes)

Example: // Allocate a vector of 2048 floats on devicefloat * a_gpu;int n = 2048;cudaMalloc(&a_gpu, n * sizeof(float));

Address of pointer

Get size of a float

Page 20: CUDA Basics - PRACE

July 6, 2016 Slide 20

Copy from host to device

cudaMemcpy(void* dst, void* src, size_t nbytes, enum cudaMemcpyKind dir)

Example: // Copy vector of floats a of length n=2048 to a_gpu on devicecudaMemcpy(a_gpu, a, n * sizeof(float),

cudaMemcpyHostToDevice);

Page 21: CUDA Basics - PRACE

July 6, 2016 Slide 21

Copy from device to host

cudaMemcpy(void* dst, void* src, size_t nbytes, enum cudaMemcpyKind dir)

Example: // Copy vector of floats a_gpu of length n=2048 to a on hostcudaMemcpy(a, a_gpu, n * sizeof(float),

cudaMemcpyDeviceToHost);

Note the order Changed flag

Page 22: CUDA Basics - PRACE

July 6, 2016 Slide 22

Unified Virtual Address Space (UVA)

64bit64bit

2.02.0 cudaMalloc*(...)cudaHostAlloc(...)cudaMemcpy*(..., cudaMemcpyDefault)

return UVA pointers

Page 23: CUDA Basics - PRACE

July 6, 2016 Slide 23

Getting data in and out

Allocate memory on devicecudaMalloc(void** pointer, size_t nbytes)

Transfer data between host and devicecudaMemcpy(void* dst, void* src, size_t nbytes,

enum cudaMemcpyKind dir)dir = cudaMemcpyHostToDevicedir = cudaMemcpyDeviceToHost

Free device memorycudaFree(void* pointer)

Page 24: CUDA Basics - PRACE

July 6, 2016 Slide 24

Exercise Scale Vector

Allocate memory on devicecudaMalloc(T** pointer, size_t nbytes)

Transfer data between host and devicecudaMemcpy(void* dst, void* src,

size_t nbytes, enum cudaMemcpyKind dir)

dir = cudaMemcpyHostToDevice

dir = cudaMemcpyDeviceToHost

Free device memorycudaFree(void* pointer)

Define dimensions of thread block

Define dimensions of grid

Call the kernel

dim3 blockDim(size_t blockDimX, size_t blockDimY, size_t blockDimZ)

dim3 blockDim(size_t blockDimX, size_t blockDimY, size_t blockDimZ)

dim3 gridDim(size_t gridDimX, size_t gridDimY, size_t gridDimZ)

dim3 gridDim(size_t gridDimX, size_t gridDimY, size_t gridDimZ)

kernel<<<dim3 gridDim, dim3 blockDim>>>([arg]*)

kernel<<<dim3 gridDim, dim3 blockDim>>>([arg]*)

Page 25: CUDA Basics - PRACE

July 6, 2016 Slide 25

Exercise

CudaBasics/exercises/tasks/jacobi_w_explicit_transfer

Compile with make jacobi.