CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and...
Transcript of CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and...
![Page 1: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,](https://reader033.fdocuments.us/reader033/viewer/2022050517/5fa0c87a4424027fc6401516/html5/thumbnails/1.jpg)
GPU
Ben de WaalSummer 2008
![Page 2: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,](https://reader033.fdocuments.us/reader033/viewer/2022050517/5fa0c87a4424027fc6401516/html5/thumbnails/2.jpg)
Agenda
Quick Roadmap
A few observations
© NVIDIA Corporation 20072
A few observations
And a few positions
![Page 3: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,](https://reader033.fdocuments.us/reader033/viewer/2022050517/5fa0c87a4424027fc6401516/html5/thumbnails/3.jpg)
GPUs are Great at Graphics
© NVIDIA Corporation 20073Hellgate: London © 2005-2006 Flagship
Studios, Inc.Licensed by NAMCO BANDAI Games
America, Inc.
Crysis © 2006 Crytek / Electronic Arts
Full Spectrum Warrior: Ten Hammers© 2006 Pandemic Studios, LLC. All rights
reserved.© 2006 THQ Inc. All rights reserved.
![Page 4: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,](https://reader033.fdocuments.us/reader033/viewer/2022050517/5fa0c87a4424027fc6401516/html5/thumbnails/4.jpg)
GPUs are Great at Other Things!
An expanding trend over the last few years
Successful applications in many areas
Computational geometry, biology, chemistry, physics, finance…
Computer vision
© NVIDIA Corporation 20074
Computer vision
Database management
Signal processing
Physics simulation
…
![Page 5: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,](https://reader033.fdocuments.us/reader033/viewer/2022050517/5fa0c87a4424027fc6401516/html5/thumbnails/5.jpg)
dim3 DimGrid(100, 50); // 5000 thread blocks
dim3 DimBlock(4, 8, 8); // 256 threads per blocksize_t SharedMemBytes = 64; // 64 bytes of shared memoryKernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...);
Standard C ProgrammingNew Architecture for Computing
Parallel
Data Cache
P’ = P + V * t
P’ = P + V * t
P’ = P + V * t
ThreadExecutionManager
ALU
Control
ALU
Control
ALU
Control
ALU
DRAM
P1, V1
P2, V2
P3, V3
P4, V4
P5, V5
Shared
Data
Parallel
Data Cache
P’ = P + V * t
P’ = P + V * t
P’ = P + V * t
ThreadExecutionManager
ALU
Control
ALU
Control
ALU
Control
ALU
DRAM
P1, V1
P2, V2
P3, V3
P4, V4
P5, V5
Shared
Data
C for the GPU CUDA & GPU Computing
© NVIDIA Corporation 20075
New ApplicationsUnprecedented Performance
P’ = P + V * tP’ = P + V * t
![Page 6: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,](https://reader033.fdocuments.us/reader033/viewer/2022050517/5fa0c87a4424027fc6401516/html5/thumbnails/6.jpg)
70M CUDA GPUs
Heterogeneous Computing
CPUCPUCPUCPU
GPUGPUGPUGPU60K CUDA Developers
© NVIDIA Corporation 20076
Oil & Gas
Finance Medical Biophysics Numerics Audio Video Imaging
![Page 7: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,](https://reader033.fdocuments.us/reader033/viewer/2022050517/5fa0c87a4424027fc6401516/html5/thumbnails/7.jpg)
GeForce GTX 280 Parallel Computing Architecture
Thread Scheduler
© NVIDIA Corporation 20077
Atomic Tex L2 Atomic Tex L2 Atomic Tex L2 Atomic Tex L2
Memory Memory Memory Memory Memory Memory Memory Memory
![Page 8: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,](https://reader033.fdocuments.us/reader033/viewer/2022050517/5fa0c87a4424027fc6401516/html5/thumbnails/8.jpg)
CUDA Terminology:Grids, Blocks, and Threads
CPU
Kernel 1
GPU deviceGrid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)Sequence
Programmer partitions problem into a sequence of kernels.
A kernel executes as a grid of thread blocks
A thread block is an array of
© NVIDIA Corporation 20078
Kernel 2
Grid 2
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
A thread block is an array of threads that can cooperate
Threads within the same block synchronize and share data in Shared Memory
Execute thread blocks on multithreaded multiprocessor SM cores
![Page 9: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,](https://reader033.fdocuments.us/reader033/viewer/2022050517/5fa0c87a4424027fc6401516/html5/thumbnails/9.jpg)
CUDA Programming Model:Thread Memory Spaces
Each kernel thread can read:
Thread Id per thread
Block Id per block
Constants per grid
Texture per grid
Thread Id, Block Id
RegistersKernelThread ProgramWritten in C
Local Memory
© NVIDIA Corporation 20079
Each thread can read and write:
Registers per thread
Local memory per thread
Shared memory per block
Global memory per grid
Host CPU can read and write:
Constants per grid
Texture per grid
Global memory per grid
Constants
Texture
Global Memory
SharedMemory
Written in C
![Page 10: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,](https://reader033.fdocuments.us/reader033/viewer/2022050517/5fa0c87a4424027fc6401516/html5/thumbnails/10.jpg)
Trends and Observations
Cores generally doubles per familyLow end has substantially less cores than high endRanges from 8 to 100s
Memory hierarchy will likely remain
Evolving
© NVIDIA Corporation 200710
EvolvingProcessor expressivenessEasy of programmingReducing performance cliffsHierarchical scheduling & partitioningNested parallelism
Heterogeneous computingAlgorithms varyRun them on the most suitable processor
![Page 11: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,](https://reader033.fdocuments.us/reader033/viewer/2022050517/5fa0c87a4424027fc6401516/html5/thumbnails/11.jpg)
Autotuning – Super Languages
One Possible extreme outcome:
People program in an expressive enough language that maps fairly cleanly onto the installed base of processors
Programmer driven
Just very simple machine translation needed
© NVIDIA Corporation 200711
As an example, CUDA’s programming paradigm also scales with CPU cores
Data parallel
Memory hierarchy is explicit
i.e. It reflects an architectural superset of several different designs
![Page 12: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,](https://reader033.fdocuments.us/reader033/viewer/2022050517/5fa0c87a4424027fc6401516/html5/thumbnails/12.jpg)
Heterogeneous Tuning Space
Cache hit architectures
Like traditional CPUs
Thread driven execution
NUMA / Cost of global coherence
Cache cliffs (hits, misses, aliasing, etc.)
Scalar / Vector (SIMD)
© NVIDIA Corporation 200712
Scalar / Vector (SIMD)
Cache miss architectures
Like many GPUs
Data driven execution
Wide range of cores
NUMA / sometimes no global coherence
Memory technology exposure (banks, etc.)
Vector / Scalar
![Page 13: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,](https://reader033.fdocuments.us/reader033/viewer/2022050517/5fa0c87a4424027fc6401516/html5/thumbnails/13.jpg)
Autotuning – Really Smart Code
Another extreme outcome:
Genetic programming style autotuners
Evolves optimal code for any (local) architecture
Potential to find a diamond in the state of Texas
Somehow still generalize
© NVIDIA Corporation 200713
Good news: It’s parallelizable!
Detour: Circuit Synthesis
Similar Problem
Remarkable success
Remarkable exploitation
Genetic Programming III, Koza, John R, et al, 1999
Chapter 25
![Page 14: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,](https://reader033.fdocuments.us/reader033/viewer/2022050517/5fa0c87a4424027fc6401516/html5/thumbnails/14.jpg)
Autotuning
Both extremes seem to be too good to be true
We’ll probably end up in the middle
Programmer will do some of the parameterization
Identify blocks
© NVIDIA Corporation 200714
Identify blocks
Memory tradeoffs
Serial code
Autotuners explores smaller space
![Page 15: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,](https://reader033.fdocuments.us/reader033/viewer/2022050517/5fa0c87a4424027fc6401516/html5/thumbnails/15.jpg)
Composition is key
Tuners likely need access to complete code base
Need powerful/expressive enough IL that isnt source
Allow investment
Client side must be smart upfront, or binary ships its
© NVIDIA Corporation 200715
Client side must be smart upfront, or binary ships its own brains
Smart client
can have IL logic for local system, supplied perhaps by IHVs
Smart binary
more flexible but may not understand the target
Seems desirable for IL to include high level expression
![Page 16: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,](https://reader033.fdocuments.us/reader033/viewer/2022050517/5fa0c87a4424027fc6401516/html5/thumbnails/16.jpg)
Compiling CUDA
NVCC
C/C++ CUDAApplication
CPU Code
© NVIDIA Corporation 200716
PTX to Target
Translator
GPU … GPU
Target code
PTX CodeVirtual
Target
![Page 17: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,](https://reader033.fdocuments.us/reader033/viewer/2022050517/5fa0c87a4424027fc6401516/html5/thumbnails/17.jpg)
Virtual to Target ISA Translation
PTX to Target
Translator
PTX Code
ld.global.v4.f32 {$f1,$f3,$f5,$f7},[$r9+0];mad.f32 $f1,$f5,$f3,$f1;
Parallel Thread eXecution (PTX)
Virtual Machine and ISA
Distribution format for applications
Install-time translation
“fat binary” caches target-specific
© NVIDIA Corporation 200717
Translator
GPU … GPU
Target code
0x103c8009 0x0fffffff0xd00e0609 0xa0c007800x100c8009 0x000000030x21000409 0x07800780
“fat binary” caches target-specific versions
Target-specific translation optimizes for:
ISA diffences
Resource allocation
Performance
![Page 18: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,](https://reader033.fdocuments.us/reader033/viewer/2022050517/5fa0c87a4424027fc6401516/html5/thumbnails/18.jpg)
Interesting Architectures
Do more on GPUs
Millions out there
Compact, well suited for server farms
© NVIDIA Corporation 200718
Plenty of tuning parameters
A very hard problem
Represents many issues many-core CPUs are going to
Its like the future – Today
![Page 19: CScADS Ben de Waalcscads.rice.edu/deWaal--NVIDIA--GPUs.pdfCUDA Terminology: Grids, Blocks, and Threads CPU Kernel 1 GPU device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0,](https://reader033.fdocuments.us/reader033/viewer/2022050517/5fa0c87a4424027fc6401516/html5/thumbnails/19.jpg)
Interesting Architectures
Heterogeneous Tuning
Figuring out how to divide work appropriately among asymmetrical cores
E.g. partitioning a problem to map serial code onto an aggressive out-of-order mono-core CPU plus parallel
© NVIDIA Corporation 200719
aggressive out-of-order mono-core CPU plus parallel parts of problem onto a plenty core GPU.