Post on 24-Feb-2018
7/25/2019 How a GPU Works - Kayvon Fatahalian
1/87
Kayvon Fatahalian
15-462 (Fall 2011)
How a GPU Works
7/25/2019 How a GPU Works - Kayvon Fatahalian
2/87
Today
1.
Review: the graphics pipeline
2. History: a few old GPUs
3. How a modern GPU works (and why it is s
4. Closer look at a real GPU design
7/25/2019 How a GPU Works - Kayvon Fatahalian
3/87
Part 1:
The graphics pipeline
7/25/2019 How a GPU Works - Kayvon Fatahalian
4/87
Vertex processing
v0
v1
v2
v3
v4
v5
Vertices are transformed into screen space
7/25/2019 How a GPU Works - Kayvon Fatahalian
5/87
Vertex processing
v0
v1
v2
v3
v4
v5
Vertices are transformed into screen space
EACH VERT
TRANSFORM
INDEPENDE
7/25/2019 How a GPU Works - Kayvon Fatahalian
6/87
Primitive processing
v0
v1
v2
v3
v4
v5
v0
v3
v5
Then organized into primitives that are clippculled
7/25/2019 How a GPU Works - Kayvon Fatahalian
7/87
Rasterization
Primitives are rasterized into pixel fragment
7/25/2019 How a GPU Works - Kayvon Fatahalian
8/87
Rasterization
Primitives are rasterized into pixel fragment
7/25/2019 How a GPU Works - Kayvon Fatahalian
9/87
7/25/2019 How a GPU Works - Kayvon Fatahalian
10/87
Fragment processing
Fragments are shaded to compute a color at each
7/25/2019 How a GPU Works - Kayvon Fatahalian
11/87
Pixel operations
Fragments are blended into the frame buffer
pixel locations (z-buffer determines visibility)
7/25/2019 How a GPU Works - Kayvon Fatahalian
12/87
Pipeline entities
v0
v1
v2
v3
v4
v5v0
v1
v2
v3
v4
v5
Vertices Primitives Frag
7/25/2019 How a GPU Works - Kayvon Fatahalian
13/87
Graphics pipeline
Primitive Generation
Vertex Generation
Vertex Processing
Fragment Generation
Memory Buffers
Vertex Data Buffers
Textures
TexturesPrimitive Processing
Vertex stream
Vertex stream
Primitive stream
Primitive stream
F t t
Vertices
Primitives
7/25/2019 How a GPU Works - Kayvon Fatahalian
14/87
Part 2:
Graphics architecture
7/25/2019 How a GPU Works - Kayvon Fatahalian
15/87
Independent
Whats so important about independent
computations?
7/25/2019 How a GPU Works - Kayvon Fatahalian
16/87
Silicon Graphics RealityEngine (1
Primitive G
Vertex Ge
Vertex Pr
Fragment G
Primitive P
graphics supercomputer
7/25/2019 How a GPU Works - Kayvon Fatahalian
17/87
Pre-1999 PC 3D graphics accelera
Primitive G
Vertex Ge
Vertex Pr
Fragment G
Primitive P
3dfx Voodoo
NVIDIA RIVA TNT
Clip/cull/rasterize
Tex Tex
CPU
7/25/2019 How a GPU Works - Kayvon Fatahalian
18/87
GPU* circa 1999
Primitive G
Vertex Ge
Vertex Pr
Fragment G
Primitive P
CPU
GPU
7/25/2019 How a GPU Works - Kayvon Fatahalian
19/87
7/25/2019 How a GPU Works - Kayvon Fatahalian
20/87
Direct3D 10 programmability: 20
Primitive G
Vertex Ge
Vertex Pr
Fragment G
Primitive P
NVIDIA G F 8800
Core Pixel op
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Tex
Pixel op
Pixel op
Pixel op
Pixel op
Pixel op
Clip/Cull/Rast
Scheduler
7/25/2019 How a GPU Works - Kayvon Fatahalian
21/87
Part 3:
How a shader core wor
7/25/2019 How a GPU Works - Kayvon Fatahalian
22/87
GPUs are fast
Intel Core i7 Quad Core
~100 GFLOPS peak730 million transistors
AMD Radeon HD
~2.7 TFLOPS p2 2 billion transi
A diff fl t h d
7/25/2019 How a GPU Works - Kayvon Fatahalian
23/87
A diffuse reflectance shader
Shader programming mo
Fragments are processed
but there is no explicit pa
programming.
Independent logical sequ
per fragment. ***
7/25/2019 How a GPU Works - Kayvon Fatahalian
24/87
A diffuse reflectance shader
7/25/2019 How a GPU Works - Kayvon Fatahalian
25/87
A diffuse reflectance shader
Shader programming mo
Fragments are processed
but there is no explicit pa
programming.
Independent logical sequ
per fragment. ***
Big Guy lookin diffuse
7/25/2019 How a GPU Works - Kayvon Fatahalian
26/87
Big Guy, lookin diffuse
Compile shader
7/25/2019 How a GPU Works - Kayvon Fatahalian
27/87
Compile shader
2;733/!&*9";&'6
!"#$%& 'G> ?:> .
#/% '5> ?G> FKG
#";; '5> ?H> FKG
#";; '5> ?0> FKG
F%#$ '5> '5> % 'G> '5
#/% 4H> 'H> '5
#/% 40> '0> '5
#4? 45> %
7/25/2019 How a GPU Works - Kayvon Fatahalian
28/87
Execute shader
2;733/!&*9";&'6
!"#$%& 'G> ?:> .
#/% '5> ?G> FKG
#";; '5> ?H> FKG
#";; '5> ?0> FKG
F%#$ '5> '5> % 'G> '5
#/% 4H> 'H> '5
#/% 40> '0> '5
#4? 45> %
7/25/2019 How a GPU Works - Kayvon Fatahalian
29/87
Execute shader
2;733/!&*9";&'6
!"#$%& 'G> ?:> .
#/% '5> ?G> FKG
#";; '5> ?H> FKG
#";; '5> ?0> FKG
F%#$ '5> '5> % 'G> '5
#/% 4H> 'H> '5
#/% 40> '0> '5
#4? 45> %
7/25/2019 How a GPU Works - Kayvon Fatahalian
30/87
Execute shader
2;733/!&*9";&'6
!"#$%& 'G> ?:> .
#/% '5> ?G> FKG
#";; '5> ?H> FKG
#";; '5> ?0> FKG
F%#$ '5> '5> % 'G> '5
#/% 4H> 'H> '5
#/% 40> '0> '5
#4? 45> %
7/25/2019 How a GPU Works - Kayvon Fatahalian
31/87
Execute shader
2;733/!&*9";&'6
!"#$%& 'G> ?:> .
#/% '5> ?G> FKG
#";; '5> ?H> FKG
#";; '5> ?0> FKG
F%#$ '5> '5> % 'G> '5
#/% 4H> 'H> '5
#/% 40> '0> '5
#4? 45> %
7/25/2019 How a GPU Works - Kayvon Fatahalian
32/87
Execute shader
2;733/!&*9";&'6
!"#$%& 'G> ?:> .
#/% '5> ?G> FKG
#";; '5> ?H> FKG
#";; '5> ?0> FKG
F%#$ '5> '5> % 'G> '5
#/% 4H> 'H> '5
#/% 40> '0> '5
#4? 45> %
7/25/2019 How a GPU Works - Kayvon Fatahalian
33/87
CPU-style cores
7/25/2019 How a GPU Works - Kayvon Fatahalian
34/87
CPU style cores
Fetch/
Decode
Execution
Context
ALU
(Execute)
Data cache
(a big one)
Out-of-order control logic
Fancy branch predictor
Memory pre-fetcher
Slimming down
7/25/2019 How a GPU Works - Kayvon Fatahalian
35/87
Slimming down
Fetch/
Decode
Execution
Context
ALU
(Execute)
Idea #1:
Remove components that
help a single instruction
stream run fast
Two cores (two fragments in parallel)
7/25/2019 How a GPU Works - Kayvon Fatahalian
36/87
Two cores (two fragments in parallel)
Fetch/Decode
Execution
Context
ALU
(Execute)
Fetch/Decode
Execution
Context
ALU
(Execute)
2;733/!&*9";&'6J
!"#$%& 'G> ?:> .G> !G
#/% '5> ?G> FKGLGM
#";; '5> ?H> FKGLHM> '5
#";; '5> ?0> FKGL0M> '5
F%#$ '5> '5> % % 'G> '5
#/% 4H> 'H> '5
#/% 40> '0> '5
#4? 45> %
7/25/2019 How a GPU Works - Kayvon Fatahalian
37/87
Four cores (four fragments in parallel)
Fetch/Decode
ExecutionContext
ALU(Execute)
Fetch/Decode
ExecutionContext
ALU(Execute)
Fetch/Decode
ExecutionContext
ALU(Execute)
Fetch/Decode
ExecutionContext
ALU(Execute)
Sixteen cores (sixteen fragments in parallel)
7/25/2019 How a GPU Works - Kayvon Fatahalian
38/87
Sixteen cores (sixteen fragments in parallel)
16 cores = 16 simultaneous instruct
Instruction stream sharing
7/25/2019 How a GPU Works - Kayvon Fatahalian
39/87
g
But ... many fragments
should be able to shareinstruction stream!
2;733/!&*9";&'6J
!"#$%& 'G> ?:> .G> !G
#/% '5> ?G> FKGLGM
#";; '5> ?H> FKGLHM> '5
#";; '5> ?0> FKGL0M> '5
F%#$ '5> '5> % % 'G> '5
#/% 4H> 'H> '5
#/% 40> '0> '5
#4? 45> %
7/25/2019 How a GPU Works - Kayvon Fatahalian
40/87
Fetch/
Decode
p p g
Execution
Context
ALU
(Execute)
Add ALUs
7/25/2019 How a GPU Works - Kayvon Fatahalian
41/87
Fetch/
Decode
Idea #2:
Amortize cost/complexmanaging an instructio
stream across many ALU
SIMD processing
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Shared Ctx Data
ALU 1 ALU 2 ALU 3 ALU 4
ALU 5 ALU 6 ALU 7 ALU 8
Modifying the shader
7/25/2019 How a GPU Works - Kayvon Fatahalian
42/87
y g
Fetch/
Decode
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Shared Ctx Data
ALU 1 ALU 2 ALU 3 ALU 4
ALU 5 ALU 6 ALU 7 ALU 8
2;733/!&*9";&'6J!"#$%& 'G> ?:> .G> !G
#/% '5> ?G> FKGLGM
#";; '5> ?H> FKGLHM> '5
#";; '5> ?0> FKGL0M> '5
F%#$ '5> '5> % % 'G> '5
#/% 4H> 'H> '5
#/% 40> '0> '5
#4? 45> %
7/25/2019 How a GPU Works - Kayvon Fatahalian
43/87
y g
Fetch/
Decode
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Shared Ctx Data
ALU 1 ALU 2 ALU 3 ALU 4
ALU 5 ALU 6 ALU 7 ALU 8
New compiled shader:
Processes eight fragme
vector ops on vector re
2NOPQR;733/!&*9";&'6JVEC8_!"#$%& ?&FR'G> ?&FR?:> .G> ?&FR
VEC8_#/% ?&FR'5> ?&FR?G> FKGLGM
VEC8_#";; ?&FR'5> ?&FR?H> FKGLHM> ?&
VEC8_#";; ?&FR'5> ?&FR?0> FKGL0M> ?&
VEC8_F%#$ ?&FR'5> ?&FR'5> % % ?&FR'G> ?&FR'5
VEC8_#/% ?&FR4H> ?&FR'H> ?&FR'5
VEC8_#/% ?&FR40> ?&FR'0> ?&FR'5
VEC8_#4? 45> %
7/25/2019 How a GPU Works - Kayvon Fatahalian
44/87
y g
Fetch/
Decode
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Shared Ctx Data
ALU 1 ALU 2 ALU 3 ALU 4
ALU 5 ALU 6 ALU 7 ALU 8
1 2 3 4
5 6 7 8
2NOPQR;733/!&*9";&'6JVEC8_!"#$%& ?&FR'G> ?&FR?:> .G> ?&FR
VEC8_#/% ?&FR'5> ?&FR?G> FKGLGM
VEC8_#";; ?&FR'5> ?&FR?H> FKGLHM> ?&
VEC8_#";; ?&FR'5> ?&FR?0> FKGL0M> ?&
VEC8_F%#$ ?&FR'5> ?&FR'5> % % ?&FR'G> ?&FR'5
VEC8_#/% ?&FR4H> ?&FR'H> ?&FR'5
VEC8_#/% ?&FR40> ?&FR'0> ?&FR'5
VEC8_#4? 45> %
7/25/2019 How a GPU Works - Kayvon Fatahalian
45/87
g
16 cores = 128 ALUs, 16 simultaneous instruction stream
128 [ ] in parallelvertices/fragmentsprimitivesOpenCL work items
7/25/2019 How a GPU Works - Kayvon Fatahalian
46/87
CUDA threads
fragments
vertices
primitives
But what about branches?
7/25/2019 How a GPU Works - Kayvon Fatahalian
47/87
ALU 1 ALU 2 . . . ALU 8. . .Time (clocks)
2 . . .1 . . . 8
73
7/25/2019 How a GPU Works - Kayvon Fatahalian
48/87
ALU 1 ALU 2 . . . ALU 8. . .Time (clocks)
2 . . .1 . . . 8
73
7/25/2019 How a GPU Works - Kayvon Fatahalian
49/87
ALU 1 ALU 2 . . . ALU 8. . .Time (clocks)
2 . . .1 . . . 8
73
7/25/2019 How a GPU Works - Kayvon Fatahalian
50/87
ALU 1 ALU 2 . . . ALU 8. . .Time (clocks)
2 . . .1 . . . 8
73
7/25/2019 How a GPU Works - Kayvon Fatahalian
51/87
! Coherent execution*** (admittedly fuzzy definition): when processing
entities is similar, and thus can share resources for efficient execution
- Instruction stream coherence: different fragments follow same sequence of logic
- Memory access coherence:
Different fragments access similar data (avoid memory transactions by reusing data in ca
Different fragments simultaneously access contiguous data (enables efficient, bulk granu
transactions)
!Divergence: lack of coherence
- Usually used in reference to instruction streams (divergent execution does not make full us
processing)
*** Do not confuse this use of term coherence with cache coherence protocols
GPUs share instruction streams across many fragm
7/25/2019 How a GPU Works - Kayvon Fatahalian
52/87
In modern GPUs: 16 to 64 fragments share an instruction stream.
7/25/2019 How a GPU Works - Kayvon Fatahalian
53/87
Stalls!Stalls occur when a core cannot run the next instruction because odependency on a previous operation.
Recall: diffuse reflectance shader
7/25/2019 How a GPU Works - Kayvon Fatahalian
54/87
Texture acces
Latency of 10
Recall: CPU-style core
7/25/2019 How a GPU Works - Kayvon Fatahalian
55/87
ALU
Fetch/Decode
Execution
Context
OOO exec logic
Branch predictor
Data cache
(a big one: several MB)
CPU-style memory hierarchy
7/25/2019 How a GPU Works - Kayvon Fatahalian
56/87
CPU cores run efficiently when data is resident in cache
(caches reduce latency, provide high bandwidth)
ALU
Fetch/Decode
Execution
contexts
OOO exec logic
Branch predictor
L1 cache
(32 KB)
L2 cache
(256 KB)
L3 cache
(8 MB)
shared across cores
Processing Core (several cores per chip)
7/25/2019 How a GPU Works - Kayvon Fatahalian
57/87
Stalls!Texture access latency = 100s to 1000s of cycles
Weve removed the fancy caches and logic that helps avoid stall
Stalls occur when a core cannot run the next instruction because o
dependency on a previous operation.
7/25/2019 How a GPU Works - Kayvon Fatahalian
58/87
But we have LOTSof independent fragments.(Way more fragments to process than ALUs)
Idea #3:Interleave processing of many fragments on a single core to avoid
stalls caused by high latency operations.
Hiding shader stalls
7/25/2019 How a GPU Works - Kayvon Fatahalian
59/87
Time (clocks) Frag 1 8
ALU 1 AL
ALU 5 AL
Ctx C
Ctx C
Shar
Hiding shader stalls
7/25/2019 How a GPU Works - Kayvon Fatahalian
60/87
Time (clocks)
ALU 1 AL
ALU 5 AL
Frag 9 16 Frag 17 24 Frag 25 32Frag 1 8
1 2 3 4
1
3
Hiding shader stalls
7/25/2019 How a GPU Works - Kayvon Fatahalian
61/87
Time (clocks)
Frag 9 16 Frag 17 24 Frag 25 32Frag 1 8
1 2 3 4
Stall
Runnable
Hiding shader stalls
7/25/2019 How a GPU Works - Kayvon Fatahalian
62/87
Time (clocks)
Frag 9 16 Frag 17 24 Frag 25 32Frag 1 8
1 2 3 4
Stall
Runnable
Stall
Stall
Stall
Throughput!
7/25/2019 How a GPU Works - Kayvon Fatahalian
63/87
Time (clocks)
Frag 9 16 Frag 17 24 Frag 25 32Frag 1 8
1 2 3 4
Stall
Runnable
Stall
Runnable
Stall
Runnable
Stall
Runnable
Done!
Done!
Done!
Done!
Start
Start
Start
Increase run time of one group
to increase throughput of many groups
Storing contexts
7/25/2019 How a GPU Works - Kayvon Fatahalian
64/87
Fetch/
Decode
ALU 1 ALU 2 ALU 3 ALU 4
ALU 5 ALU 6 ALU 7 ALU 8
Pool of context storage
128 KB
Eighteen small contexts (maximal latency hiding
7/25/2019 How a GPU Works - Kayvon Fatahalian
65/87
Fetch/
Decode
ALU 1 ALU 2 ALU 3 ALU 4
ALU 5 ALU 6 ALU 7 ALU 8
Twelve medium contexts
7/25/2019 How a GPU Works - Kayvon Fatahalian
66/87
Fetch/
Decode
ALU 1 ALU 2 ALU 3 ALU 4
ALU 5 ALU 6 ALU 7 ALU 8
Four large contexts (low latency hiding abilit
7/25/2019 How a GPU Works - Kayvon Fatahalian
67/87
Fetch/
Decode
ALU 1 ALU 2 ALU 3 ALU 4
ALU 5 ALU 6 ALU 7 ALU 8
1 2
3 4
My chip!
7/25/2019 How a GPU Works - Kayvon Fatahalian
68/87
16 cores
8 mul-add ALUs per core
(128 total)
16 simultaneousinstruction streams
64 concurrent (but interleaved)instruction streams
512 concurrent fragments
= 256 GFLOPs (@ 1GHz)
My enthusiast chip!
7/25/2019 How a GPU Works - Kayvon Fatahalian
69/87
32 cores, 16 ALUs per core (512 total) = 1 TFLOP (@ 1 GHz
Summary: three key ideas for high-throughput ex
7/25/2019 How a GPU Works - Kayvon Fatahalian
70/87
1. Use many slimmed down cores, run them in parallel
2. Pack cores full of ALUs (by sharing instruction stream overhead acrogroups of fragments)
Option 1: Explicit SIMD vector instructions
Option 2: Implicit sharing managed by hardware
3. Avoid latency stalls by interleaving execution of many groups of fra
When one group stalls, work on another group
7/25/2019 How a GPU Works - Kayvon Fatahalian
71/87
Putting the three ideas into practice:
A closer look at a real GPU
NVIDIA GeForce GTX 480
NVIDIA GeForce GTX 480 (Fermi)! NVIDIA-speak:
7/25/2019 How a GPU Works - Kayvon Fatahalian
72/87
! NVIDIA-speak:
480 stream processors (CUDA cores)
SIMT execution
! Generic speak:
15 cores
2 groups of 16 SIMD functional units per core
NVIDIA GeForce GTX 480 core
7/25/2019 How a GPU Works - Kayvon Fatahalian
73/87
= SIMD function unit,
control shared acros
(1 MUL-ADD per clock
Shared scratchpad memory
(16+48 KB)
Execution contexts
(128 KB)
Fetch/
Decode
Groups of 32 fragments share a
stream
Up to 48 groups are simultane
Up to 1536 individual contexts
Source: Fermi Compute Architecture Whitepaper
CUDA Programming Guide 3.1, Appendix G
NVIDIA GeForce GTX 480 core
7/25/2019 How a GPU Works - Kayvon Fatahalian
74/87
= SIMD function unit,
control shared acros
(1 MUL-ADD per clock
Shared scratchpad memory
(16+48 KB)
Execution contexts
(128 KB)
Fetch/
Decode
Fetch/
Decode The core contains 32 functiona
Two groups are selected each c
(decode, fetch, and execute tw
streams in parallel)
Source: Fermi Compute Architecture Whitepaper
CUDA Programming Guide 3.1, Appendix G
NVIDIA GeForce GTX 480 SM
7/25/2019 How a GPU Works - Kayvon Fatahalian
75/87
= CUDA core
(1 MUL-ADD per clock
Shared scratchpad memory
(16+48 KB)
Execution contexts
(128 KB)
Fetch/
Decode
Fetch/
Decode The SMcontains 32 CUDA cores
Two warpsare selected each cl
(decode, fetch, and execute tw
parallel)
Up to 48 warps are interleaved
CUDA threads
Source: Fermi Compute Architecture Whitepaper
CUDA Programming Guide 3.1, Appendix G
NVIDIA GeForce GTX 480
7/25/2019 How a GPU Works - Kayvon Fatahalian
76/87
There are 15 of these things on the GT
Thats 23,000 fragments!
(or 23,000 CUDA threads!)
7/25/2019 How a GPU Works - Kayvon Fatahalian
77/87
Looking Forward
Current and future: GPU architectures! Bigger and faster (more cores, more FLOPS)
7/25/2019 How a GPU Works - Kayvon Fatahalian
78/87
gg ( , )
2 TFLOPs today, and counting
! Addition of (select) CPU-like features
More traditional caches
! Tight integration with CPUs (CPU+GPU hybrids)
See AMD Fusion
! What fixed-function hardware should remain?
Recent trends
S t f lt ti i i t f
7/25/2019 How a GPU Works - Kayvon Fatahalian
79/87
! Support for alternative programming interfaces
Accelerate non-graphics applications using GPU (CUDA, OpenCL)
! How does graphics pipeline abstraction change to enable more advance
real-time graphics?
Direct3D 11 adds three new pipeline stages
Global illumination algorithms Credit: Bratincevic
7/25/2019 How a GPU Works - Kayvon Fatahalian
80/87
Credit: NVIDIA
Ray tracing:
for accurate reflections, shadows
Credit: Ingo Wald
Alternative shading structures (e.g., deferred shading)
7/25/2019 How a GPU Works - Kayvon Fatahalian
81/87
For more efficient scaling to many lights (1000 lights, [Andersson 09])
Simulation
7/25/2019 How a GPU Works - Kayvon Fatahalian
82/87
Cinematic scene complexity
7/25/2019 How a GPU Works - Kayvon Fatahalian
83/87
7/25/2019 How a GPU Works - Kayvon Fatahalian
84/87
Motion blur
7/25/2019 How a GPU Works - Kayvon Fatahalian
85/87
7/25/2019 How a GPU Works - Kayvon Fatahalian
86/87
7/25/2019 How a GPU Works - Kayvon Fatahalian
87/87
Thanks!
Relevant CMU Courses for students interested in high performance graphics:
15-869: Graphics and Imaging Architectures (my special topics course)
15-668: Advanced Parallel Graphics (Treuille)15-418: Parallel Architecture and Programming (spring semester)