Algorithm Engineering „ GPGPU“
description
Transcript of Algorithm Engineering „ GPGPU“
![Page 1: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/1.jpg)
Algorithm Engineering
„GPGPU“
Stefan Edelkamp
![Page 2: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/2.jpg)
Graphics Processing Units
GPGPU = (GP)²U General Purpose Programming on the GPU „Parallelism for the masses“ Application: Fourier-Transformation, Model Checking,Bio-Informatics, see CUDA-ZONE
![Page 3: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/3.jpg)
Programming the Graphics Processing Unitwith Cuda
![Page 4: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/4.jpg)
Overview
Cluster / Multicore / GPU comparisonComputing on the GPUGPGPU languagesCUDASmall Example
![Page 5: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/5.jpg)
Overview
Cluster / Multicore / GPU comparisonComputing on the GPUGPGPU languagesCUDASmall Example
![Page 6: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/6.jpg)
Cluster / Multicore / GPU
Cluster systemmany unique systemseach one
one (or more) processors internal memory often HDD
communication over network slow compared to internal no shared memory
CPU RAM
HDD
CPU RAM
HDD
CPU RAM
HDD
Switch
![Page 7: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/7.jpg)
Cluster / Multicore / GPU
Multicore systemsmultiple CPUsRAMexternal memory on HDD communication over RAM
CPU1 CPU2
CPU4CPU3
RAM
HDD
![Page 8: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/8.jpg)
Cluster / Multicore / GPU
System with a Graphic Processing UnitMany (240) Parallel processing unitsHierarchical memory structure
RAM VideoRAM SharedRAM
Communication PCI BUS Graphics Card
GPU
SRAM VRAM RAM
CPU
Hard Disk Drive
![Page 9: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/9.jpg)
Overview
Cluster / Multicore / GPU comparisonComputing on the GPUGPGPU languagesCUDASmall Example
![Page 10: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/10.jpg)
Computing on the GPU
Hierarchical executionGroups
executed sequentiallyThreads
executed parallel lightweight (creation / switching nearly free)
one Kernel function executed by each thread
• Group 0
![Page 11: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/11.jpg)
Computing on the GPU
Hierarchical memoryVideo RAM
1 GB Comparable to RAM
Shared RAM in the GPU 16 KB Comparable to registers parallel access by threads
Graphic Card
GPUSRAM VideoRAM
![Page 12: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/12.jpg)
Beispielarchitektur G200 z.B. in 280GTX
![Page 13: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/13.jpg)
Beispielprobleme
![Page 14: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/14.jpg)
Ranking und Unranking mit Parity
![Page 15: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/15.jpg)
2-Bit BFS
![Page 16: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/16.jpg)
1-Bit BFS
![Page 17: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/17.jpg)
Schiebepuzzle
![Page 18: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/18.jpg)
Some Results…
![Page 19: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/19.jpg)
Weitere Resultate …
![Page 20: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/20.jpg)
Overview
Cluster / Multicore / GPU comparisonComputing on the GPUGPGPU languagesCUDASmall Example
![Page 21: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/21.jpg)
GPGPU Languages
RapidMindSupports MultiCore, ATI, NVIDIA and CellC++ analysed and compiled for target hardware
Accelerator (Microsoft)Library for .NET language
BrookGPU (Stanford University)Supports ATI, NVIDIAOwn Language, variant of ANSI C
![Page 22: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/22.jpg)
Overview
Cluster / Multicore / GPU comparisonComputing on the GPUProgramming languagesCUDASmall Example
![Page 23: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/23.jpg)
CUDA
Programming languageSimilar to CFile suffix .cuOwn compiler called nvccCan be linked to C
![Page 24: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/24.jpg)
CUDAC++ code CUDA Code
Compile with GCC Compile with nvcc
Link with ld
Executable
![Page 25: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/25.jpg)
CUDA
Additional variable typesDim3 Int3Char3
![Page 26: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/26.jpg)
CUDA
Different types of functions__global__ invoked from host__device__ called from device
Different types of variables__device__ located in VRAM__shared__ located in SRAM
![Page 27: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/27.jpg)
CUDA
Calling the kernel functionname<<<dim3 grid, dim3 block>>>(...)
Grid dimensions (groups)Block dimensions (threads)
![Page 28: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/28.jpg)
CUDA
Memory handlingCudaMalloc(...) - allocating VRAMCudaMemcpy(...) - copying Memory CudaFree(...) - free VRAM
![Page 29: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/29.jpg)
CUDA
Distinguish threads blockDim – Number of all groupsblockIdx – Id of Group (starting with 0)threadIdx – Id of Thread (starting with
0)Id =
blockDim.x*blockIdx.x+threadIdx.x
![Page 30: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/30.jpg)
Overview
Cluster / Multicore / GPU comparisonComputing on the GPUProgramming languagesCUDASmall Example
![Page 31: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/31.jpg)
CUDA
void inc(int *a, int b, int N) {
for (int i = 0; i<N; i++) a[i] = a[i] + b;
}
void main(){
...
inc(a,b,N);}
__global__ void inc(int *a, int b, int N){ int id = blockDim.x*blockIdx.x+threadIdx.x; if (id<N)
a[id] = a[id] + b;}
void main(){
...int * a_d = CudaAlloc(N);CudaMemCpy(a_d,a,N,HostToDevice);dim3 dimBlock ( blocksize, 0, 0 );dim3 dimGrid ( N / blocksize, 0, 0 );inc<<<dimGrid,dimBlock>>>(a_d,b,N);
}
![Page 32: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/32.jpg)
Realworld Example
LTL Model checkingTraversing an implicit Graph G=(V,E)Vertices called statesEdges represented by transitionsDuplicate removal needed
![Page 33: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/33.jpg)
Realworld Example
External Model checkingGenerate Graph with external BFSEach BFS layer needs to be sorted
GPU proven to be fast in sorting
![Page 34: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/34.jpg)
Realworld Example
ChallengesMillions of states in one layerHuge state sizeFast access only in SRAMElements needs to be moved
![Page 35: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/35.jpg)
Realworld Example
Solutions:Gpuqsort
Qsort optimized for GPUs Intensive swapping in VRAM
Bitonic based sorting Fast for subgroupsConcatenating Groups slow
![Page 36: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/36.jpg)
Realworld Example
Our solutionStates S presorted by Hash H(S) Bucket sorted in SRAM by a Group
• VRAM
• SRAM
![Page 37: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/37.jpg)
Realworld Example
Our solutionOrder given by H(S),S
![Page 38: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/38.jpg)
Realworld Example
Results
![Page 39: Algorithm Engineering „ GPGPU“](https://reader033.fdocuments.us/reader033/viewer/2022050821/56816386550346895dd4712c/html5/thumbnails/39.jpg)
Questions???
Programming the GPU