CS179: GPU Programming Lecture 11: Lab 5 Recitation.

21
CS179: GPU Programming Lecture 11: Lab 5 Recitation

Transcript of CS179: GPU Programming Lecture 11: Lab 5 Recitation.

Page 1: CS179: GPU Programming Lecture 11: Lab 5 Recitation.

CS179: GPU ProgrammingLecture 11: Lab 5 Recitation

Page 2: CS179: GPU Programming Lecture 11: Lab 5 Recitation.

Today Monte-Carlo Integration Recap on CUBLAS/CURAND Reductions Optimizing a reduction

Page 3: CS179: GPU Programming Lecture 11: Lab 5 Recitation.

Monte-Carlo Integration Integration is a common tool is computational math

Oftentimes used for finding areas Integration is hard on a computer

Difficult to do analytically Integration is sometimes analytically impossible

Can’t integrate exp(x2) analytically..

Page 4: CS179: GPU Programming Lecture 11: Lab 5 Recitation.

Monte-Carlo Integration Could use discrete Riemann sum

Page 5: CS179: GPU Programming Lecture 11: Lab 5 Recitation.

Monte-Carlo Integration What if there’s no predefined function?

Ex.: Area of union of shapes

Page 6: CS179: GPU Programming Lecture 11: Lab 5 Recitation.

Monte-Carlo Integration Solution: Monte-Carlo Integration Saturate bounded space with sample points Check if each point is in any shape Area = # of points in a shape / # of points total * area of space

Page 7: CS179: GPU Programming Lecture 11: Lab 5 Recitation.

Monte-Carlo Integration Lab 5: Given N spheres in a bounded space, find the volume

of their union Possible to do analytically…

But very difficult! Spheres have random positions, area of intersections, etc.

Makes good use of Monte-Carlo integration Easy to check if a point is in any of the spheres Easy to use CURAND to generate lots of points!

Page 8: CS179: GPU Programming Lecture 11: Lab 5 Recitation.

Lab 5 Remember: CURAND has host API and device API

You will use both! volumeCUBLAS: uses host API with CUBLAS volumeCUDA: uses device API with reduction kernel

Page 9: CS179: GPU Programming Lecture 11: Lab 5 Recitation.

Lab 5volumeCUBLAS Allocate necessary memory

Need memory for points Need memory for 1 bool per point

Is point in any sphere?

Use CURAND host API to generate lots of points Create, seed, generate, destroy

Use CheckPointsK kernel to see if each point is in a sphere You must write this kernel!

Get total # of points in a sphere using cublasDasum cublasDasum(int n, double *src, int stride)

Free initialized memory

Page 10: CS179: GPU Programming Lecture 11: Lab 5 Recitation.

Lab 5volumeCUDA Allocate memory for data

Now, we also need memory for curandStates! Generate lots of points using CURAND device API

Call GenerateRandom3K kernel -- but you must fill in the kernel! Check if points are in sphere

Same as volumeCUBLAS Use reduction to sum vector

More on this later… Free memory

Page 11: CS179: GPU Programming Lecture 11: Lab 5 Recitation.

Lab 5Kernels PointInSphere: Checks if a point is in a given sphere

Do this first! Should be easy geometry

CheckPointsK: Checks if a point is in any sphere Copy spheres to shared memory, then iterate through spheres Remember to make sure array entry is non-NULL

GenerateRandom3K: Generates lots of float3 points Use CURAND device API

Page 12: CS179: GPU Programming Lecture 11: Lab 5 Recitation.

Reduction Iteratively reduces array via reduce function (ex. addition)

Page 13: CS179: GPU Programming Lecture 11: Lab 5 Recitation.

Reduction Start with size = nPts / 2 Repeatedly call reduction on block size, halving it each time With main loop in host, device code is very simple…

Just need to add element i and element i + size for each thread Alternatively, could build loop into device code, and call kernel only

once Once size == 1, we should have

summed up all elements

Page 14: CS179: GPU Programming Lecture 11: Lab 5 Recitation.

Reduction Lots of optimizations to make!

Avoiding thread divergence Contiguous memory accesses Avoiding shared memory bank conflicts

More we haven’t discussed yet… Unrolling loops Templates And more!

Page 15: CS179: GPU Programming Lecture 11: Lab 5 Recitation.

Optimizations Avoiding thread divergence

Avoid calls that make different calls to threads in same warp if(threadIdx.x % 2 == 0)

Instead, group by warps if(threadIdx.x / WARP_SIZE == 0)

1 2 3 4

23

10

7 4

2 7 4

1 2 3 4

23

10

7 4

2 7 4

0 2

0

0 1

1

Page 16: CS179: GPU Programming Lecture 11: Lab 5 Recitation.

Optimizations Contiguous memory accesses

Memory is linear, can’t swap dimensions Need to address non-sequential accesses…

Shared memory banks Also solved by sequential addressing!

1 2 3 4

23

10

7 4

2 7 4

1 2 3 4

23

10

7 4

2 7 4

Page 17: CS179: GPU Programming Lecture 11: Lab 5 Recitation.

Optimizations Example in reduction kernel:

Reversed loop indexing for (int i = 1; i < max_size; i *= 2) { … }

for (int i = max_size / 2; i > 0; i /= 2) { … }

1 2 3 4

23

10

7 4

2 7 4

1 2 3 4

23

10

7 4

2 7 4

Page 18: CS179: GPU Programming Lecture 11: Lab 5 Recitation.

Optimizations Unrolling loops Basic idea: when reduction size < 32, threads are wasting

space due to warps Unrolling last iteration of loop saves useless work

Page 19: CS179: GPU Programming Lecture 11: Lab 5 Recitation.

Optimizations Unrolling loops example:

for (int i = max_size / 2; i > 0; i /= 2) {sdata[tid] += sdata[tid + i];

}

for (int i = max_size / 2; i > 0; i /= 2) {sdata[tid] += sdata[tid + i];if (tid < 32) {

sdata[tid] += sdata[tid + 32];sdata[tid] += sdata[tid + 16];sdata[tid] += sdata[tid + 8];// etc…

}}

Page 20: CS179: GPU Programming Lecture 11: Lab 5 Recitation.

Optimizations Advanced unrolling: templates Exploit compiler to handle some conditions at compile-time Use templated functions (like in C++) Ex.:

template<unsigned int blockSize>__global__ void kernel(…) {

if (blockSize >= 512) // some reduction code;else if (blockSize >= 256) // some reduction code;// etc…

} Then, call templated function on host:

kernel<blockSize><<<gridSize, blockSize>>>(…);

Page 21: CS179: GPU Programming Lecture 11: Lab 5 Recitation.

Optimizations Works well with a switch statement:

switch (numThreads) {case 512:

kernel<512><<<gridSize, blockSize>>>(…);case 256:

kernel<256><<<gridSize, blockSize>>>(…);case 128:

kernel<128><<<gridSize, blockSize>>>(…);// etc…

}