CS101c GPU Programmingcourses.cms.caltech.edu/cs101gpu/2013/lec10_cuda_intro_2.pdf · 1 CS179 GPU...

22
1 CS179 GPU Programming CS179 GPU Programming Intro to CUDA: Part II Lecture originally by Luke Durant and Tamas Szalay

Transcript of CS101c GPU Programmingcourses.cms.caltech.edu/cs101gpu/2013/lec10_cuda_intro_2.pdf · 1 CS179 GPU...

Page 1: CS101c GPU Programmingcourses.cms.caltech.edu/cs101gpu/2013/lec10_cuda_intro_2.pdf · 1 CS179 GPU Programming CS179 GPU Programming Intro to CUDA: Part II Lecture originally by Luke

1 CS179 GPU Programming

CS179 GPU ProgrammingIntro to CUDA: Part II

Lecture originally by Luke Durant and Tamas Szalay

Page 2: CS101c GPU Programmingcourses.cms.caltech.edu/cs101gpu/2013/lec10_cuda_intro_2.pdf · 1 CS179 GPU Programming CS179 GPU Programming Intro to CUDA: Part II Lecture originally by Luke

2 CS179 GPU Programming

Today – More CUDAMore overviewHow to use in programsMatrix multiplication, with codeCompiling CUDA

Page 3: CS101c GPU Programmingcourses.cms.caltech.edu/cs101gpu/2013/lec10_cuda_intro_2.pdf · 1 CS179 GPU Programming CS179 GPU Programming Intro to CUDA: Part II Lecture originally by Luke

3 CS179 GPU Programming

CUDA Summary What is CUDA?

Different interface to underlying hardware Functions to interface host and device (memory copy, etc) Library to simplify hardware interaction

Kernels are small programs/functions A thread executes a kernel

A block executes a group of threads (of same kernel) All on one multiprocessor, can share some data

A grid executes multiple blocks (also of same kernel) Blocks are scheduled arbitrarily, no thread-safety

Page 4: CS101c GPU Programmingcourses.cms.caltech.edu/cs101gpu/2013/lec10_cuda_intro_2.pdf · 1 CS179 GPU Programming CS179 GPU Programming Intro to CUDA: Part II Lecture originally by Luke

4 CS179 GPU Programming

By Analogy Global memory, shared memory, constant

memory… Gets confusing – think analogy with graphics

Kernels –> shaders Global memory –> buffer objects

CUDA can access global memory as textures too Grid –> single render call

Things shaders do not have Shared memory Arbitrary read-write (scattering)

In shaders, in/out arrays indexed automatically Thread block division, threadIdx, blockIdx

Page 5: CS101c GPU Programmingcourses.cms.caltech.edu/cs101gpu/2013/lec10_cuda_intro_2.pdf · 1 CS179 GPU Programming CS179 GPU Programming Intro to CUDA: Part II Lecture originally by Luke

5 CS179 GPU Programming

CUDA Layers

Page 6: CS101c GPU Programmingcourses.cms.caltech.edu/cs101gpu/2013/lec10_cuda_intro_2.pdf · 1 CS179 GPU Programming CS179 GPU Programming Intro to CUDA: Part II Lecture originally by Luke

6 CS179 GPU Programming

CUDA LayersRarely need to use driverFirst labs will concentrate on using

runtime Sufficient for most things

Later ones will use libraries briefly CUBLAS, CUFFT Can even handle CPU-GPU memory transfer

Fast!

Page 7: CS101c GPU Programmingcourses.cms.caltech.edu/cs101gpu/2013/lec10_cuda_intro_2.pdf · 1 CS179 GPU Programming CS179 GPU Programming Intro to CUDA: Part II Lecture originally by Luke

7 CS179 GPU Programming

CUFFT Benchmark

Page 8: CS101c GPU Programmingcourses.cms.caltech.edu/cs101gpu/2013/lec10_cuda_intro_2.pdf · 1 CS179 GPU Programming CS179 GPU Programming Intro to CUDA: Part II Lecture originally by Luke

8 CS179 GPU Programming

Using CUDA Notice that CUFFT much slower with

memory transfer PCIe 2.0 is 0.5 GB/s per x

e.g. 16x is 8 GB/s But still have scheduling overhead

Need to transfer some data to start a grid, for example

Want to copy data back and forth little, if possible

We will only be using CUDA synchronously Though async interfaces exist

Page 9: CS101c GPU Programmingcourses.cms.caltech.edu/cs101gpu/2013/lec10_cuda_intro_2.pdf · 1 CS179 GPU Programming CS179 GPU Programming Intro to CUDA: Part II Lecture originally by Luke

9 CS179 GPU Programming

Common Program FlowProgramming GPU is all about memory

Minimize global memory access, host/device transfer

Consider matrix example from last lecture Copy input matrices to graphics card Start kernel grid Each block copies sub-matrices into shared

memory and multiplies Result is copied back onto host machine

Let’s do this in detail

Page 10: CS101c GPU Programmingcourses.cms.caltech.edu/cs101gpu/2013/lec10_cuda_intro_2.pdf · 1 CS179 GPU Programming CS179 GPU Programming Intro to CUDA: Part II Lecture originally by Luke

10 CS179 GPU Programming

Matrix MultiplicationComputing AxB = C, of inner dimension

wACalculating each sub-matrix Csub as

product of two long rectangular matrices Each multiplied as Csub-sized

blocks and accumulated

Page 11: CS101c GPU Programmingcourses.cms.caltech.edu/cs101gpu/2013/lec10_cuda_intro_2.pdf · 1 CS179 GPU Programming CS179 GPU Programming Intro to CUDA: Part II Lecture originally by Luke

11 CS179 GPU Programming

Matrix Multiplication Want sub-matrices as large as possible

Each thread block is a sub-matrix Each thread computes a single element of

Csub

Maximum threads/block is 512, so choose Csub to be 16x16 Grid size is then determined by C/16

But how do we step through the A, B sub-matrices? Simple – with a big for loop in the kernel Loading pair Asub, Bsub into shared memory

Page 12: CS101c GPU Programmingcourses.cms.caltech.edu/cs101gpu/2013/lec10_cuda_intro_2.pdf · 1 CS179 GPU Programming CS179 GPU Programming Intro to CUDA: Part II Lecture originally by Luke

12 CS179 GPU Programming

Memory Benefits So why are we doing it this way again? Pretend each thread just computes an element

of C by stepping along entire length of A, B We have ~ 1 global memory access per

arithmetic instruction Global memory access is around 400 clock cycles Multiplication is around 10 This is very bad!

We have a fixed number of arithmetic instructions

Want to reduce memory accesses instead

Page 13: CS101c GPU Programmingcourses.cms.caltech.edu/cs101gpu/2013/lec10_cuda_intro_2.pdf · 1 CS179 GPU Programming CS179 GPU Programming Intro to CUDA: Part II Lecture originally by Luke

13 CS179 GPU Programming

Memory Benefits If we instead load Asub, Bsub into shared

memory and multiply them into Csub there… Takes 256 global accesses

But we get 16x16x16 arithmetic operations Which effectively corresponds to a 16x

speedup!

Page 14: CS101c GPU Programmingcourses.cms.caltech.edu/cs101gpu/2013/lec10_cuda_intro_2.pdf · 1 CS179 GPU Programming CS179 GPU Programming Intro to CUDA: Part II Lecture originally by Luke

14 CS179 GPU Programming

Matrix Code: Setup

Page 15: CS101c GPU Programmingcourses.cms.caltech.edu/cs101gpu/2013/lec10_cuda_intro_2.pdf · 1 CS179 GPU Programming CS179 GPU Programming Intro to CUDA: Part II Lecture originally by Luke

15 CS179 GPU Programming

Matrix Code: Launch

Note <<<dimGrid, dimBlock>>> syntax used to launch

Can pass values, pointers to global memory, etc. Will talk more about syntax in recitation

Page 16: CS101c GPU Programmingcourses.cms.caltech.edu/cs101gpu/2013/lec10_cuda_intro_2.pdf · 1 CS179 GPU Programming CS179 GPU Programming Intro to CUDA: Part II Lecture originally by Luke

16 CS179 GPU Programming

Matrix Code: Loading

Note that each thread only loads one piece of submatrices, indexes shared memory via threadIdx

And thus they need to be synchronized

threadIdx.x, threadIdx.y

Page 17: CS101c GPU Programmingcourses.cms.caltech.edu/cs101gpu/2013/lec10_cuda_intro_2.pdf · 1 CS179 GPU Programming CS179 GPU Programming Intro to CUDA: Part II Lecture originally by Luke

17 CS179 GPU Programming

Matrix Code: Multiplication

Note we need to synchronize again

Page 18: CS101c GPU Programmingcourses.cms.caltech.edu/cs101gpu/2013/lec10_cuda_intro_2.pdf · 1 CS179 GPU Programming CS179 GPU Programming Intro to CUDA: Part II Lecture originally by Luke

18 CS179 GPU Programming

Matrix MultiplicationThis is all confusing, sub-parallelizationRead over the CUDA Programmer’s Guide

Covers matrix multiplication in greater detail Plus, explains how to do everything

Page 19: CS101c GPU Programmingcourses.cms.caltech.edu/cs101gpu/2013/lec10_cuda_intro_2.pdf · 1 CS179 GPU Programming CS179 GPU Programming Intro to CUDA: Part II Lecture originally by Luke

19 CS179 GPU Programming

CUDA CodeA couple CUDA language features to keep

an eye out for Some special identifiers, __device__,

__shared__ function<<< … >>>() kernel launch syntax Special data types: dim3, but also have

float2…float4, etc. (like vec2…vec4 in GLSL) CUDA runtime functions start with “cuda” Driver functions start with just “cu”

Again, more on coding specifics in recitation

Page 20: CS101c GPU Programmingcourses.cms.caltech.edu/cs101gpu/2013/lec10_cuda_intro_2.pdf · 1 CS179 GPU Programming CS179 GPU Programming Intro to CUDA: Part II Lecture originally by Luke

20 CS179 GPU Programming

Compiling CUDAWhat happens when you compile?

Some of the code has to be built for the GPU…nvcc is NVIDIA’s CUDA compiler

Extracts and compiles C code intended for device

Then calls main compiler (gcc, or cl if using Windows) on remainder (host code)

Typically operates on files ending in .cuSetting up Makefiles can be a pain

So we do it for you

Page 21: CS101c GPU Programmingcourses.cms.caltech.edu/cs101gpu/2013/lec10_cuda_intro_2.pdf · 1 CS179 GPU Programming CS179 GPU Programming Intro to CUDA: Part II Lecture originally by Luke

21 CS179 GPU Programming

Emulation Mode CUDA programs can also be compiled/linked

to emulation library Some of you may need to do this, if your

personal computers don’t support CUDA Two problems:

Very very slow (obviously) Synchronous and deterministic – if you have

multithreading bugs, you might not see them Still, rather useful for testing/debugging Download the SDK, try it out

Compile it with ‘make emu=1’ to use emulation

Page 22: CS101c GPU Programmingcourses.cms.caltech.edu/cs101gpu/2013/lec10_cuda_intro_2.pdf · 1 CS179 GPU Programming CS179 GPU Programming Intro to CUDA: Part II Lecture originally by Luke

22 CS179 GPU Programming

HomeworkGrab the CUDA programming manualCheck out table of contents so you know

what’s in itRead matrix multiplication codeUnderstand it conceptuallyCoding details will be explained in

recitation