Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final...
Transcript of Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final...
![Page 1: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc2e0039fb56c7fb11f606c/html5/thumbnails/1.jpg)
Advanced Topics in Numerical Analysis:High Performance Computing
Intro to GPGPU
Georg Stadler, Dhairya MalhotraCourant Institute, NYU
Spring 2019, Monday, 5:10–7:00PM, WWH #1302
April 8, 2019
1 / 20
![Page 2: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc2e0039fb56c7fb11f606c/html5/thumbnails/2.jpg)
Outline
Organization issues
Final projects
Computing on GPUs
2 / 20
![Page 3: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc2e0039fb56c7fb11f606c/html5/thumbnails/3.jpg)
Organization
Scheduling:I Homework assignment #4 due next Monday
Topics today:I Final project overview/discussionI More GPGPU programming (several examples)I Algorithms: image filtering (convolution), parallel scan,
bitonic sort
Outlook for next week(s):I Distributed memory programming (MPI)
3 / 20
![Page 4: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc2e0039fb56c7fb11f606c/html5/thumbnails/4.jpg)
Outline
Organization issues
Final projects
Computing on GPUs
4 / 20
![Page 5: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc2e0039fb56c7fb11f606c/html5/thumbnails/5.jpg)
Final projects
I Final projects! Pitch/discuss your final project to/with us.We’re available Tuesday (tomorrow) 5-6pm and Thursday11-12:30 in WWH #1111 or over Slack.
I Would like to (more or less) finalize project groups and topicsin the next week.
I Final projects are in teams of 1-3 people (2 preferred!)I We posted suggestions for final projects. More ideas on the
next slides. Also, take a look at the HPC projects we collectedfrom the first homework assignment.
I Final project presentations (max 10min each) in the week May20/21. You are also required to hand in a short paper withyour results, as well as the git repo with the code.
5 / 20
![Page 6: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc2e0039fb56c7fb11f606c/html5/thumbnails/6.jpg)
Final projects
Final project examples (from example list):I Parallel multigridI Image denoisingI Adaptive finite volumesI Parallel k-meansI Fluid mechanics simulationI Data partitioning using parallel octrees
6 / 20
![Page 7: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc2e0039fb56c7fb11f606c/html5/thumbnails/7.jpg)
Final projects
Final project examples (more examples):I Parallelizing a DFT sub-calculation (Tkatchenko-Scheffler
dispersion energies and forces)I Parallelizing a neural network color transfer method for imagesI Parallel all-pairs shortest paths via Floyd-WarshallI Fast CUDA kernels for ResNet inferenceI . . . Take an existing serious code and speed it up/parallelize itI . . .
7 / 20
![Page 8: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc2e0039fb56c7fb11f606c/html5/thumbnails/8.jpg)
Outline
Organization issues
Final projects
Computing on GPUs
8 / 20
![Page 9: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc2e0039fb56c7fb11f606c/html5/thumbnails/9.jpg)
Review of Last Class
I CUDA programming model: GPU architecture,memory-hierarchy, thread-hierarchy.
I Shared memory: fast, low-latency, shared withinthread-block, 48KB - 128KB (depending on computecapability)
I avoid bank conflicts within a warp.I Synchronization
I syncthreads() all threads in a blockI syncwarp() all threads in a warp
I Reduction on GPUs
9 / 20
![Page 10: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc2e0039fb56c7fb11f606c/html5/thumbnails/10.jpg)
Hiding LatencyI All operations have latencyI CPUs hide latency using out-of-order computation and branch
prediction; reduce latency of memory accesses using caches.I GPUs hide latency using parallelism:
I execute warp-1 (threads 0-31)I when warp-1 stalls, start executing warp-2 (threads 32-63)
and so on · · ·
10 / 20
![Page 11: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc2e0039fb56c7fb11f606c/html5/thumbnails/11.jpg)
Occupancy CalculatorGet resource usage for kernel functions (compiler flag: -Xptxas -v)Example:# nvcc -std=c++11 -Xcompiler "-fopenmp" -Xptxas -v reduction.cu
ptxas info : Compiling entry function’ Z16reduction kernelPdPKdl’ for ’sm 30’ptxas info : Function properties for Z16reduction kernelPdPKdl0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 28 registers, 8192 bytes smem, 344 bytes cmem[0]
Occupancy = #-of-threads per SMmax-#-of-threads per SM
I Calculate occupancy for your code:I https://developer.download.nvidia.com/compute/cuda/CUDA Occupancy calculator.xlsI web version: https://xmartlabs.github.io/cuda-calculatorI Improve occupancy to improve performance
11 / 20
![Page 12: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc2e0039fb56c7fb11f606c/html5/thumbnails/12.jpg)
Device Management (Multiple GPUs)
I Get number of GPUs: cudaGetDeviceCount(int *count)
I Set the current GPU: cudaSetDevice(int device)
I Get current GPU: cudaGetDevice(int *device)
I Get GPU properties:cudaGetDeviceProperties(cudaDeviceProp *prop, intdevice)
12 / 20
![Page 13: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc2e0039fb56c7fb11f606c/html5/thumbnails/13.jpg)
StreamsI execute multiple tasks in parallel;
either on separate GPUs or on thesame GPU.
I useful for executing severalindependent small tasks whereeach task does not have sufficientparallelism.
// create streamscudaStream_t stream1 , stream2cudaStreamCreate (& streams1 );cudaStreamCreate (& streams2 );
// launch two kernels in parallelkernel <<<1, 64, 0, streams1 >>>();kernel <<<1, 128, 0, streams2 >>>();
// synchronizecudaStreamSynchronize ( stream1 )cudaStreamSynchronize ( stream2 )
13 / 20
![Page 14: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc2e0039fb56c7fb11f606c/html5/thumbnails/14.jpg)
Image FilteringConvolution: read k × k block of the image multiply by filterweights and sum.
figure from: GPU Computing: Image Convolution - Jan Novak,Gabor Liktor, Carsten Dachsbacher
14 / 20
![Page 15: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc2e0039fb56c7fb11f606c/html5/thumbnails/15.jpg)
Image FilteringUsing shared memory as cache to minimize global memory reads
I read a 32× 32 block of original image from main memoryI compute convolution in shared memoryI write back result sub-block (excluding halo)
figure from: GPU Computing: Image Convolution - Jan Novak,Gabor Liktor, Carsten Dachsbacher
15 / 20
![Page 16: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc2e0039fb56c7fb11f606c/html5/thumbnails/16.jpg)
SortingComparison based sorting algorithms: bubble sort O(N2),sample sort O(N log N), merge sort O(N log N)
Bitonic merge sort O(N log2 N)I great for small to medium problem sizes.I sorting networks, simple deterministic algorithm bases on
compare and swap.I sequence of log N bitonic merge operations.
16 / 20
![Page 17: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc2e0039fb56c7fb11f606c/html5/thumbnails/17.jpg)
SortingComparison based sorting algorithms: bubble sort O(N2),sample sort O(N log N), merge sort O(N log N)
Bitonic merge sort O(N log2 N)I great for small to medium problem sizes.I sorting networks, simple deterministic algorithm bases on
compare and swap.I sequence of log N bitonic merge operations.
16 / 20
![Page 18: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc2e0039fb56c7fb11f606c/html5/thumbnails/18.jpg)
SortingComparison based sorting algorithms: bubble sort O(N2),sample sort O(N log N), merge sort O(N log N)
Bitonic merge sort O(N log2 N)I great for small to medium problem sizes.I sorting networks, simple deterministic algorithm bases on
compare and swap.I sequence of log N bitonic merge operations.
16 / 20
![Page 19: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc2e0039fb56c7fb11f606c/html5/thumbnails/19.jpg)
SortingComparison based sorting algorithms: bubble sort O(N2),sample sort O(N log N), merge sort O(N log N)
Bitonic merge sort O(N log2 N)I great for small to medium problem sizes.I sorting networks, simple deterministic algorithm bases on
compare and swap.I sequence of log N bitonic merge operations.
16 / 20
![Page 20: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc2e0039fb56c7fb11f606c/html5/thumbnails/20.jpg)
Bitonic SortBitonic merge O(N log N) cost for each merge operation
I divide-and-conquer algorithm on bitonic sequences.I Bitonic sequence: a sequence that changes monotonicity
exactly once.
I if bitonic-sequence larger than block-size, then read and writedirectly from global memory; otherwise read/write fromshared-memory
17 / 20
![Page 21: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc2e0039fb56c7fb11f606c/html5/thumbnails/21.jpg)
Parallel Scan (within thread-block)Reduction tree Scan tree
-2 1 2 0 -2 0 1 -3 4 -4 -3 2 -5 4 -2 1
-1 2 -2 -2 0 -1 -1 -1
1 -4 -1 -2
-3 -3
-6
-6
-3 -6
1 -3 -4 -6
-1 1 -1 -3 -3 -4 -5 -6
-2 -1 1 1 -1 -1 0 -3 1 -3 -6 -4 -9 -5 -7 -6
Construct scan tree: right child: copy parent’s value.left child: difference between parent’s value and sibling’s value inreduction tree.
18 / 20
![Page 22: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc2e0039fb56c7fb11f606c/html5/thumbnails/22.jpg)
Libraries
Optimized libraries for
I cuBLAS for linear algebra
I cuFFT for Fast Fourier Transform
I cuDNN for Deep Neural Networks
cuBLAS Demo!
19 / 20
![Page 23: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm](https://reader035.fdocuments.us/reader035/viewer/2022071005/5fc2e0039fb56c7fb11f606c/html5/thumbnails/23.jpg)
Summary
I Calculating Occupancy: higher is betterI useful for debugging performance bottlenecks
I Miscellaneous:I managing multiple GPUsI executing multiple streams in parallel
I AlgorithmsI Image filteringI Parallel scanI Bitonic sort
I Libraries: cuBLAS, cuFFT, cuDNN
20 / 20