Lecture 11: GPU programming

Lecture 11:GPU programming

David Bindel

4 Oct 2011

Logistics

I Matrix multiply results are readyI Summary on assignments pageI My version (and writeup) on CMS

I HW 2 due ThursdayI Still working on project 2!I Start thinking about possible projects...

Matrix multiply outcome

0 100 200 300 400 500 600 700 800

HW 2 comments

I Due Thursday night – don’t wait until the last minute!I This is not meant to be a hard assignment ...I ... but leave time to get confused and ask questions.

I Three basic tasks:I OpenMP: Parallelize by adding pragmas to codeI MPI: Fill in missing communication routineI Both: Report on some performance experiments

I You can debug on your own computerI Need recent gcc to get OpenMP supportI Need an MPI implementation – I recommend OpenMPII Make sure to test with 1, 2, and 4 processes

I Make sure timings are done on the cluster worker nodes!

HW 2: Ghost cells revisited

0 1 2 3 4

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4

Global node indices

Local indices on P0

Local indices on P1

Local indices on P2

Notes on timing

I Different notions of time:I clock() – processor timeI omp_wtime() and MPI_Wtime() – wall-clock timeI clock_gettime() – depends!

I I/O generally does not count toward processor timeI Generally care about wall clock time

Notes on timing

I Timer resolution is limited!I omp_get_wtick() – timer resolution in OpenMPI MPI_Wtick() – same in MPI

I Do enough steps to get reasonable timingsI When reporting time vs size, it’s reasonable to look at

time/step

... and now on to the main event ...

Some history

I Late 80s-early 90s: “golden age” for supercomputingI Companies: Thinking Machines, MasPar, CrayI Relatively fast processors (vs memory)I Lots of academic interest and developmentI But got hard to compete with commodity hardware

I Scientific computing is not a market driver!I 90s-early 2000s: age of the cluster

I Beowulf, grid computing, etc.I “Big iron” also uses commodity chips (better interconnect)

I Past few yearsI CPU producers move to multicoreI High-end graphics becomes commodity HW

I Gaming is a market driver!I GPU producers realize their many-core designs can apply to

general purpose computing

Thread design points

I Threads on desktop CPUsI Implemented via lightweight processes (for example)I General system schedulerI Thrashing when more active threads than processors

I An alternative approachI Hardware support for many threads / CPU

I Modest example: hyperthreadingI More extreme: Cray MTA-2 and XMT

I Hide memory latency by thread switchingI Want many more independent threads than cores

I GPU programmingI Thread creation / context switching are basically freeI Want lots of threads (thousands for efficiency?!)

General-purpose GPU programming

I Old GPGPU model: use texture mapping interfacesI People got good performance!I But too clever by half

I CUDA (Compute Unified Device Architecture)I More natural general-purpose programming modelI Initial release in 2007; now in version 3.0

I OpenCLI Relatively new (late 2009); in Apple’s Snow LeopardI Open standard (Khronos group) – includes NVidia, ATI, etc

I And so on: DirectCompute (MS), Brook+ (Stanford/AMD),Rapidmind (Waterloo (Sh)/Rapidmind/Intel?)

Today: C for CUDA (more available examples)

Compiling CUDA

I nvcc is the driverI Builds on top of g++ or other compilers

I nvcc driver produces CPU and PTX codeI PTX (Parallel Thread eXecution)

I Virtual machine and ISAI Compiles down to binary for target

I Can compile in device emulation mode for debugI nvcc -deviceemuI Can use native debug supportI Can access data across host/device boundariesI Can call printf from device code

CUDA programming

do_something_on_cpu();some_kernel<<<nBlk, nTid>>>(args);do_something_else_on_cpu();cudaThreadSynchronize();

I Highly parallel kernels run on deviceI Vaguely analogous to parallel sections in OpenMP code

I Rest of the code on host (CPU)I C + extensions to program both host code and kernels

Thread blocks

I Monolithic thread array partitioned into blocksI Blocks have 1D or 2D numeric identifierI Threads within blocks have 1D, 2D, or 3D identifierI Identifiers help figure out what data to work on

I Blocks cooperate via shared memory, atomic ops, barriersI Threads in different blocks cannot cooperate

I ... except for implied global barrier from host

Memory access

I Registers are registers; per threadI Shared memory is small, fast, on-chip; per blockI Global memory is large uncached off-chip space

I Also accessible by host

Also runtime support for texture memory and constant memory.

Basic usage

1. Perform any needed allocations2. Copy data from host to device3. Invoke kernel4. Copy results from device to host5. Clean up allocations

Device memory management

h_data = malloc(size);... Initialize h_data on host ...cudaMalloc((void**) &d_data, size);cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice);... invoke kernel ...cudaMemcpy(h_data, d_data, size, cudaMemcpyDeviceToHost);cudaFree(d_data);free(h_data);

Notes:I Don’t dereference h_data on device or d_data on host!I Can also copy host-to-host, device-to-deviceI Kernel invocation is asynchronous with CPU; cudaMemcpy is

synchronous(can synchronize kernels with cudaThreadSynchronize)

CUDA function declarations

__device__ float device_func();__global__ void kernel_func();__host__ float host_func();

I __global__ for kernel (must return void)I __device__ functions called and executed on deviceI __host__ functions called and executed on hostI __device__ and __host__ can be used together

Restrictions on device functions

I No taking the address of a __device__ functionI No recursionI No static variables inside the functionI No varargs

Kernel invocation

Kernels called with an execution configuration:

__global__ void kernel_func(...);dim3 dimGrid(100, 50); // 5000 thread blocksdim3 dimBlock(4, 8, 8); // 256 threads per blocksize_t sharedMemBytes = 64;kernel_func<<dimGrid, dimBlock, sharedMemBytes>>(...);

I Can write integers (1D layouts) for first two argumentsI Third argument is optional (defaults to zero)I Optional fourth argument for stream of execution

I Used to specify asynchronous execution across kernels

I Kernel can fail if you request too many resources

Example: Vector addition

__global__ voidVecAdd(const float* A, const float* B, float* C, int N){

int i = blockDim.x * blockIdx.x + threadIdx.x;if (i < N) C[i] = A[i] + B[i];

}cudaMalloc((void**)&d_A, size);cudaMalloc((void**)&d_B, size);cudaMalloc((void**)&d_C, size);cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);int threadsPerBlock = 256;int blocksPerGrid = (N+255) / threadsPerBlock;VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A,d_B,d_C,N);cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);

Shared memory

Size known at compile time

__global__ void kernel(...){

__shared__ float x[256];...

kernel<<<nb,bs>>>(...);

Size known at kernel launch

__global__ void kernel(...){

extern __shared__ float x[];...

kernel<<<nb,bs,bytes>>>(...);

Synchronize access with barrier.

Example: Butterfly reduction

I On input (step 0): 2b numbersI At step i , entry j becomes sum over all inputs whose indices

agree with j in the last b − j bitsI On output (step b): 2b copies of the sum

Example: Butterfly reduction

__global__ void sum_reduce(int* x){

// B is a compile time constant power of 2int i = threadIdx.x;__shared__ int sum[B];sum[i] = x[i]; __syncthreads();for (int bit = B/2; bit > 0; bit /= 2) {

int inbr = (i + bit) % B;int t = sum[i] + sum[inbr]; __syncthreads();sum[i] = t; __syncthreads();

sum_reduce<<1,N>>(d_x);

General picture: CUDA extensions

I Type qualifiers:I globalI deviceI sharedI localI constant

I Keywords (threadIdx, blockIdx)I Intrinsics (__syncthreads)I Runtime API (memory, symbol, execution management)I Function launch

Libraries and languages

The usual array of language tools exist:I CUBLAS, CUFFT, CUDA LAPACK bindings (commercial)I CUDA-accelerated libraries (e.g. in Trilinos)I Bindings to CUDA from Python, Java, etc

Hardware picture (G80)

I 128 processors execute threadsI Thread Execution Manager issues threadsI Parallel data cache / shared memory per processorI All have access to device memory

I Partitioned into global, constant, texture spacesI Read-only caches to texture and constant spaces

HW thread organization

I Single Instruction, Multiple ThreadI A warp of threads executes physically in parallel

(one warp == 32 parallel threads)I Blocks are partitioned into warps by consecutive thread IDI Best efficiency when all threads in warp do same operation

I Conditional branches reduce parallelism —serially execute all paths taken

Memory architecture

I Memory divided into 16 banks of 32-byte wordsI Each bank services one address per cycleI Conflicting accesses are serializedI Stride 1 (or odd stride): no bank conflicts

Batch memory access: coalescing

I Coalescing is a coordinated read by half-warpI Read contiguous region (64, 128, or 256 bytes)I Starting address for region a multiple of region sizeI Thread k in half-warp accesses element k of blocksI Not all threads need to participate

The usual picture

I Performance is potentially quite complicated!I ... and memory is important.

I Fortunately, there are profiling tools includedI Unfortunately, I have yet to play with them!

Resources

Beside the basic NVidia documentation, see:I http:

//developer.nvidia.com/object/cuda_training.htmlI http://courses.ece.illinois.edu/ece498/al/I http://gpgpu.org/developer

Lecture 11: GPU programming

Documents

Transcript of Lecture 11: GPU programming

Lecture 8: Compute-mode GPU Programming …graphics.cs.cmu.edu/courses/15869/fall2013content/...Lecture 8: Compute-mode GPU Programming Interfaces CMU 15-869, Fall 2013 Today Some

Lecture 5: GPU Programming · 2018-12-26 · Lecture 5: GPU Programming CSE599W: Spring 2018. Typical Deep Learning System Stack Gradient Calculation (Differentiation API) Computational

GPU Programming 360iDev

Computer Graphics 3 Lecture 4: GPU Programming

Lecture 6: GPU Architecture & CUDA Programming15418.courses.cs.cmu.edu/.../lectures/05_gpuarch/... · Lecture 6: GPU Architecture & CUDA Programming. CMU 15-418, Spring 2013 Today

CS 380 - GPU and GPGPU Programming Lecture 24: Additional ... · CS 380 - GPU and GPGPU Programming Lecture 24: Additional Stuff, Part 1 Markus Hadwiger, KAUST

Lecture 8: Compute-mode GPU Programming Interfacesgraphics.cs.cmu.edu/courses/15869/fall2014content/... · Compute-mode GPU Programming Interfaces. CMU 15-869, Fall 2014 ... -Side-eﬀect-free

Optimization GPU Profiling andcavazos/cisc879/Lecture-10.pdfBenefits of GPU Programming GPU program performance likely to improve on new architecture w/ no program adjustment Used

Lecture 2: GPU History & CUDA Programming Basics

GPU Programming

CS 380 - GPU and GPGPU Programming Lecture 2: Introduction ... · CS 380 - GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1 Markus Hadwiger, KAUST

CS 380 - GPU and GPGPU Programming Lecture 16: GPU Texturing 6 · CS 380 - GPU and GPGPU Programming Lecture 16: GPU Texturing 6 Markus Hadwiger, KAUST

CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

CS 380 - GPU and GPGPU Programming Lecture 1: Introduction · Lecture Structure Lectures • Part 1: GPU Basics and Architecture (both: graphics, compute) • Part 2: GPUs for Graphics

CS179: GPU Programming Lecture 16: Final Project Discussion.

1 CIS 665: GPU Programming Lecture 2: The CUDA Programming Model.

GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute.

CS101c GPU Programmingcourses.cms.caltech.edu/cs101gpu/2013/lec10_cuda_intro_2.pdf · 1 CS179 GPU Programming CS179 GPU Programming Intro to CUDA: Part II Lecture originally by Luke

CS 193G Lecture 2: GPU History & CUDA Programming Basics.

CS 380 - GPU and GPGPU Programming Lecture 4: GPU ... · CS 380 - GPU and GPGPU Programming Lecture 4: GPU Architecture 3 Markus Hadwiger, KAUST