How a GPU Works - Kayvon Fatahalian

7/25/2019 How a GPU Works - Kayvon Fatahalian

1/87

Kayvon Fatahalian

15-462 (Fall 2011)

How a GPU Works


2/87

Today

1.

Review: the graphics pipeline

2. History: a few old GPUs

3. How a modern GPU works (and why it is s

4. Closer look at a real GPU design


3/87

Part 1:

The graphics pipeline


4/87

Vertex processing

v0

v1

v2

v3

v4

v5

Vertices are transformed into screen space


5/87

Vertex processing

v0

v1

v2

v3

v4

v5

Vertices are transformed into screen space

EACH VERT

TRANSFORM

INDEPENDE


6/87

Primitive processing

v0

v1

v2

v3

v4

v5

v0

v3

v5

Then organized into primitives that are clippculled


7/87

Rasterization

Primitives are rasterized into pixel fragment


8/87

Rasterization

Primitives are rasterized into pixel fragment


9/87


10/87

Fragment processing

Fragments are shaded to compute a color at each


11/87

Pixel operations

Fragments are blended into the frame buffer

pixel locations (z-buffer determines visibility)


12/87

Pipeline entities

v0

v1

v2

v3

v4

v5v0

v1

v2

v3

v4

v5

Vertices Primitives Frag


13/87

Graphics pipeline

Primitive Generation

Vertex Generation

Vertex Processing

Fragment Generation

Memory Buffers

Vertex Data Buffers

Textures

TexturesPrimitive Processing

Vertex stream

Vertex stream

Primitive stream

Primitive stream

F t t

Vertices

Primitives


14/87

Part 2:

Graphics architecture


15/87

Independent

Whats so important about independent

computations?


16/87

Silicon Graphics RealityEngine (1

Primitive G

Vertex Ge

Vertex Pr

Fragment G

Primitive P

graphics supercomputer


17/87

Pre-1999 PC 3D graphics accelera

Primitive G

Vertex Ge

Vertex Pr

Fragment G

Primitive P

3dfx Voodoo

NVIDIA RIVA TNT

Clip/cull/rasterize

Tex Tex

CPU


18/87

GPU* circa 1999

Primitive G

Vertex Ge

Vertex Pr

Fragment G

Primitive P

CPU

GPU


19/87


20/87

Direct3D 10 programmability: 20

Primitive G

Vertex Ge

Vertex Pr

Fragment G

Primitive P

NVIDIA G F 8800

Core Pixel op

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Tex

Pixel op

Pixel op

Pixel op

Pixel op

Pixel op

Clip/Cull/Rast

Scheduler


21/87

Part 3:

How a shader core wor


22/87

GPUs are fast

Intel Core i7 Quad Core

~100 GFLOPS peak730 million transistors

AMD Radeon HD

~2.7 TFLOPS p2 2 billion transi

A diff fl t h d


23/87

A diffuse reflectance shader

Shader programming mo

Fragments are processed

but there is no explicit pa

programming.

Independent logical sequ

per fragment. ***


24/87



25/87


Shader programming mo

Fragments are processed

but there is no explicit pa

programming.

Independent logical sequ

per fragment. ***

Big Guy lookin diffuse


26/87

Big Guy, lookin diffuse

Compile shader


27/87

Compile shader

2;733/!&*9";&'6

!"#$%& 'G> ?:> .

#/% '5> ?G> FKG

#";; '5> ?H> FKG

#";; '5> ?0> FKG

F%#$ '5> '5> % 'G> '5

#/% 4H> 'H> '5

#/% 40> '0> '5

#4? 45> %


28/87

Execute shader

2;733/!&*9";&'6

!"#$%& 'G> ?:> .

#/% '5> ?G> FKG

#";; '5> ?H> FKG

#";; '5> ?0> FKG

F%#$ '5> '5> % 'G> '5

#/% 4H> 'H> '5

#/% 40> '0> '5

#4? 45> %


29/87

Execute shader

2;733/!&*9";&'6

!"#$%& 'G> ?:> .

#/% '5> ?G> FKG

#";; '5> ?H> FKG

#";; '5> ?0> FKG

F%#$ '5> '5> % 'G> '5

#/% 4H> 'H> '5

#/% 40> '0> '5

#4? 45> %


30/87

Execute shader

2;733/!&*9";&'6

!"#$%& 'G> ?:> .

#/% '5> ?G> FKG

#";; '5> ?H> FKG

#";; '5> ?0> FKG

F%#$ '5> '5> % 'G> '5

#/% 4H> 'H> '5

#/% 40> '0> '5

#4? 45> %


31/87

Execute shader

2;733/!&*9";&'6

!"#$%& 'G> ?:> .

#/% '5> ?G> FKG

#";; '5> ?H> FKG

#";; '5> ?0> FKG

F%#$ '5> '5> % 'G> '5

#/% 4H> 'H> '5

#/% 40> '0> '5

#4? 45> %


32/87

Execute shader

2;733/!&*9";&'6

!"#$%& 'G> ?:> .

#/% '5> ?G> FKG

#";; '5> ?H> FKG

#";; '5> ?0> FKG

F%#$ '5> '5> % 'G> '5

#/% 4H> 'H> '5

#/% 40> '0> '5

#4? 45> %


33/87

CPU-style cores


34/87

CPU style cores

Fetch/

Decode

Execution

Context

ALU

(Execute)

Data cache

(a big one)

Out-of-order control logic

Fancy branch predictor

Memory pre-fetcher

Slimming down


35/87

Slimming down

Fetch/

Decode

Execution

Context

ALU

(Execute)

Idea #1:

Remove components that

help a single instruction

stream run fast

Two cores (two fragments in parallel)


36/87

Two cores (two fragments in parallel)

Fetch/Decode

Execution

Context

ALU

(Execute)

Fetch/Decode

Execution

Context

ALU

(Execute)

2;733/!&*9";&'6J

!"#$%& 'G> ?:> .G> !G

#/% '5> ?G> FKGLGM

#";; '5> ?H> FKGLHM> '5

#";; '5> ?0> FKGL0M> '5

F%#$ '5> '5> % % 'G> '5

#/% 4H> 'H> '5

#/% 40> '0> '5

#4? 45> %


37/87

Four cores (four fragments in parallel)

Fetch/Decode

ExecutionContext

ALU(Execute)

Fetch/Decode

ExecutionContext

ALU(Execute)

Fetch/Decode

ExecutionContext

ALU(Execute)

Fetch/Decode

ExecutionContext

ALU(Execute)

Sixteen cores (sixteen fragments in parallel)


38/87

Sixteen cores (sixteen fragments in parallel)

16 cores = 16 simultaneous instruct

Instruction stream sharing


39/87

g

But ... many fragments

should be able to shareinstruction stream!

2;733/!&*9";&'6J

!"#$%& 'G> ?:> .G> !G

#/% '5> ?G> FKGLGM

#";; '5> ?H> FKGLHM> '5

#";; '5> ?0> FKGL0M> '5

F%#$ '5> '5> % % 'G> '5

#/% 4H> 'H> '5

#/% 40> '0> '5

#4? 45> %


40/87

Fetch/

Decode

p p g

Execution

Context

ALU

(Execute)

Add ALUs


41/87

Fetch/

Decode

Idea #2:

Amortize cost/complexmanaging an instructio

stream across many ALU

SIMD processing

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

ALU 1 ALU 2 ALU 3 ALU 4


Modifying the shader


42/87

y g

Fetch/

Decode

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data



2;733/!&*9";&'6J!"#$%& 'G> ?:> .G> !G

#/% '5> ?G> FKGLGM

#";; '5> ?H> FKGLHM> '5

#";; '5> ?0> FKGL0M> '5

F%#$ '5> '5> % % 'G> '5

#/% 4H> 'H> '5

#/% 40> '0> '5

#4? 45> %


43/87

y g

Fetch/

Decode

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data



New compiled shader:

Processes eight fragme

vector ops on vector re

2NOPQR;733/!&*9";&'6JVEC8_!"#$%& ?&FR'G> ?&FR?:> .G> ?&FR

VEC8_#/% ?&FR'5> ?&FR?G> FKGLGM

VEC8_#";; ?&FR'5> ?&FR?H> FKGLHM> ?&

VEC8_#";; ?&FR'5> ?&FR?0> FKGL0M> ?&

VEC8_F%#$ ?&FR'5> ?&FR'5> % % ?&FR'G> ?&FR'5

VEC8_#/% ?&FR4H> ?&FR'H> ?&FR'5

VEC8_#/% ?&FR40> ?&FR'0> ?&FR'5

VEC8_#4? 45> %


44/87

y g

Fetch/

Decode

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data



1 2 3 4

5 6 7 8

2NOPQR;733/!&*9";&'6JVEC8_!"#$%& ?&FR'G> ?&FR?:> .G> ?&FR

VEC8_#/% ?&FR'5> ?&FR?G> FKGLGM

VEC8_#";; ?&FR'5> ?&FR?H> FKGLHM> ?&

VEC8_#";; ?&FR'5> ?&FR?0> FKGL0M> ?&

VEC8_F%#$ ?&FR'5> ?&FR'5> % % ?&FR'G> ?&FR'5

VEC8_#/% ?&FR4H> ?&FR'H> ?&FR'5

VEC8_#/% ?&FR40> ?&FR'0> ?&FR'5

VEC8_#4? 45> %


45/87

g

16 cores = 128 ALUs, 16 simultaneous instruction stream

128 [ ] in parallelvertices/fragmentsprimitivesOpenCL work items


46/87

CUDA threads

fragments

vertices

primitives

But what about branches?


47/87

ALU 1 ALU 2 . . . ALU 8. . .Time (clocks)

2 . . .1 . . . 8

73


48/87


2 . . .1 . . . 8

73


49/87


2 . . .1 . . . 8

73


50/87


2 . . .1 . . . 8

73


51/87

! Coherent execution*** (admittedly fuzzy definition): when processing

entities is similar, and thus can share resources for efficient execution

- Instruction stream coherence: different fragments follow same sequence of logic

- Memory access coherence:

Different fragments access similar data (avoid memory transactions by reusing data in ca

Different fragments simultaneously access contiguous data (enables efficient, bulk granu

transactions)

!Divergence: lack of coherence

- Usually used in reference to instruction streams (divergent execution does not make full us

processing)

*** Do not confuse this use of term coherence with cache coherence protocols

GPUs share instruction streams across many fragm


52/87

In modern GPUs: 16 to 64 fragments share an instruction stream.


53/87

Stalls!Stalls occur when a core cannot run the next instruction because odependency on a previous operation.

Recall: diffuse reflectance shader


54/87

Texture acces

Latency of 10

Recall: CPU-style core


55/87

ALU

Fetch/Decode

Execution

Context

OOO exec logic

Branch predictor

Data cache

(a big one: several MB)

CPU-style memory hierarchy


56/87

CPU cores run efficiently when data is resident in cache

(caches reduce latency, provide high bandwidth)

ALU

Fetch/Decode

Execution

contexts

OOO exec logic

Branch predictor

L1 cache

(32 KB)

L2 cache

(256 KB)

L3 cache

(8 MB)

shared across cores

Processing Core (several cores per chip)


57/87

Stalls!Texture access latency = 100s to 1000s of cycles

Weve removed the fancy caches and logic that helps avoid stall

Stalls occur when a core cannot run the next instruction because o

dependency on a previous operation.


58/87

But we have LOTSof independent fragments.(Way more fragments to process than ALUs)

Idea #3:Interleave processing of many fragments on a single core to avoid

stalls caused by high latency operations.

Hiding shader stalls


59/87

Time (clocks) Frag 1 8

ALU 1 AL

ALU 5 AL

Ctx C

Ctx C

Shar



60/87

Time (clocks)

ALU 1 AL

ALU 5 AL

Frag 9 16 Frag 17 24 Frag 25 32Frag 1 8

1 2 3 4

1

3



61/87

Time (clocks)


1 2 3 4

Stall

Runnable



62/87

Time (clocks)


1 2 3 4

Stall

Runnable

Stall

Stall

Stall

Throughput!


63/87

Time (clocks)


1 2 3 4

Stall

Runnable

Stall

Runnable

Stall

Runnable

Stall

Runnable

Done!

Done!

Done!

Done!

Start

Start

Start

Increase run time of one group

to increase throughput of many groups

Storing contexts


64/87

Fetch/

Decode



Pool of context storage

128 KB

Eighteen small contexts (maximal latency hiding


65/87

Fetch/

Decode



Twelve medium contexts


66/87

Fetch/

Decode



Four large contexts (low latency hiding abilit


67/87

Fetch/

Decode



1 2

3 4

My chip!


68/87

16 cores

8 mul-add ALUs per core

(128 total)

16 simultaneousinstruction streams

64 concurrent (but interleaved)instruction streams

512 concurrent fragments

= 256 GFLOPs (@ 1GHz)

My enthusiast chip!


69/87

32 cores, 16 ALUs per core (512 total) = 1 TFLOP (@ 1 GHz

Summary: three key ideas for high-throughput ex


70/87

1. Use many slimmed down cores, run them in parallel

2. Pack cores full of ALUs (by sharing instruction stream overhead acrogroups of fragments)

Option 1: Explicit SIMD vector instructions

Option 2: Implicit sharing managed by hardware

3. Avoid latency stalls by interleaving execution of many groups of fra

When one group stalls, work on another group


71/87

Putting the three ideas into practice:

A closer look at a real GPU

NVIDIA GeForce GTX 480

NVIDIA GeForce GTX 480 (Fermi)! NVIDIA-speak:


72/87

! NVIDIA-speak:

480 stream processors (CUDA cores)

SIMT execution

! Generic speak:

15 cores

2 groups of 16 SIMD functional units per core

NVIDIA GeForce GTX 480 core


73/87

= SIMD function unit,

control shared acros

(1 MUL-ADD per clock

Shared scratchpad memory

(16+48 KB)

Execution contexts

(128 KB)

Fetch/

Decode

Groups of 32 fragments share a

stream

Up to 48 groups are simultane

Up to 1536 individual contexts

Source: Fermi Compute Architecture Whitepaper

CUDA Programming Guide 3.1, Appendix G

NVIDIA GeForce GTX 480 core


74/87

= SIMD function unit,

control shared acros



(16+48 KB)

Execution contexts

(128 KB)

Fetch/

Decode

Fetch/

Decode The core contains 32 functiona

Two groups are selected each c

(decode, fetch, and execute tw

streams in parallel)



NVIDIA GeForce GTX 480 SM


75/87

= CUDA core



(16+48 KB)

Execution contexts

(128 KB)

Fetch/

Decode

Fetch/

Decode The SMcontains 32 CUDA cores

Two warpsare selected each cl

(decode, fetch, and execute tw

parallel)

Up to 48 warps are interleaved

CUDA threads



NVIDIA GeForce GTX 480


76/87

There are 15 of these things on the GT

Thats 23,000 fragments!

(or 23,000 CUDA threads!)


77/87

Looking Forward

Current and future: GPU architectures! Bigger and faster (more cores, more FLOPS)


78/87

gg ( , )

2 TFLOPs today, and counting

! Addition of (select) CPU-like features

More traditional caches

! Tight integration with CPUs (CPU+GPU hybrids)

See AMD Fusion

! What fixed-function hardware should remain?

Recent trends

S t f lt ti i i t f


79/87

! Support for alternative programming interfaces

Accelerate non-graphics applications using GPU (CUDA, OpenCL)

! How does graphics pipeline abstraction change to enable more advance

real-time graphics?

Direct3D 11 adds three new pipeline stages

Global illumination algorithms Credit: Bratincevic


80/87

Credit: NVIDIA

Ray tracing:

for accurate reflections, shadows

Credit: Ingo Wald

Alternative shading structures (e.g., deferred shading)


81/87

For more efficient scaling to many lights (1000 lights, [Andersson 09])

Simulation


82/87

Cinematic scene complexity


83/87


84/87

Motion blur


85/87


86/87


87/87

Thanks!

Relevant CMU Courses for students interested in high performance graphics:

15-869: Graphics and Imaging Architectures (my special topics course)

15-668: Advanced Parallel Graphics (Treuille)15-418: Parallel Architecture and Programming (spring semester)

How a GPU Works - Kayvon Fatahalian

Documents

Transcript of How a GPU Works - Kayvon Fatahalian

Many-Core Programming with GRAMPS Jeremy Sugerman Kayvon Fatahalian Solomon Boulos Kurt Akeley Pat Hanrahan.

Precomputing Interactive Dynamic Deformable Scenes › ~kayvonf › papers › james03.pdf · Precomputing Interactive Dynamic Deformable Scenes Doug L. James and Kayvon Fatahalian

Lecture 11: “GPGPU” computing and the CUDA/OpenCL ... · “GPGPU” computing and the CUDA/OpenCL Programming Model Kayvon Fatahalian ... -Data-parallel programing: ZPL, Nesl-Stream

Scanner: Efficient Video Analysis at ScaleScanner: Efficient Video Analysis at Scale ALEX POMS, Carnegie Mellon University WILL CRICHTON, PAT HANRAHAN, and KAYVON FATAHALIAN, Stanford

GPU Physics - Nvidiadeveloper.download.nvidia.com/.../siggraph/gpu_physics-siggraph-06.pdf · NVIDIA GPU Physics Multi-GPU configurations, mixed or same GPU type One GPU does both

Cygnus: GPU meets FPGA for HPC - RIKEN R-CCS · 2020. 2. 27. · FPGA-GPU DMA (FPGA ← GPU) FPGA-GPU DMA (FPGA → GPU) direction via CPU FPGA-GPU DMA GPU→FPGA 17 1.44 FPGA→GPU

Lecture 16: A Camera’s Image Processing Pipeline Part 1 · 2011-11-17 · A Camera’s Image Processing Pipeline Part 1 Kayvon Fatahalian CMU 15-869: Graphics and Imaging Architectures

PyCUDA: Even Simpler GPU Programming with Python · Python Code GPU Code GPU Compiler GPU Binary GPU Result Machine Human In GPU scripting, GPU code does not need to be a compile-time

GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

GPU Architecture: Implications & Trendss08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf · GPU Architecture: Implications & Trends ... GPU architecture increasingly centers

Tips for Giving Clear Talks - Computer graphicsgraphics.stanford.edu/~kayvonf/misc/cleartalktips.pdf · Tips for Giving Clear Talks Kayvon Fatahalian ... point towards the light?

Denver Broncos’ Kayvon Webster: “I want to be part of the ...media.denverbroncos.com/images/9008/Daily Clippings/160610.pdf · Denver Broncos’ Kayvon Webster: “I want to be

GPU GPU and Supercomputer - University of Rochester€¦ · 4/16/2015 9 Outline: GPU GPGPU and CUDA CPU + GPU SuperComputer (TianHe 1A) Latest Trends GPU accelerated supercomputer

Introduction to GPU architecture review CUDA (1 of n*)cis565/Lectures2011/Lecture9.pdf · Kayvon Fatahalian's From Shader Code to a Teraflop: ... ... (3, 1 ...

Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan Stanford University August.

Krishna Kumar Singh, Kayvon Fatahalian, Alexei Efroskrsingh.cs.ucdavis.edu/krishna_files/papers/krishnacam/...Krishna Kumar Singh, Kayvon Fatahalian, Alexei Efros Objective Dataset

Lecture 10: Part 1: Discussion of “SIMT” Abstraction Part ... · Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011) Assume: !ctitious throughput processor

Many-Core Programming with GRAMPS & “Real Time REYES” Jeremy Sugerman, Kayvon Fatahalian Stanford University June 12, 2008.

Best Practices GPU-Based Video Processing | GTC 2013on-demand.gputechconf.com/...Based-Video-Processing... · GPU-Based Video Processing Optimal GPU Upload GPU Processing GPU Readback

Lecture 7: The Programmable GPU Core · Lecture 7: The Programmable GPU Core Kayvon Fatahalian CMU 15-869: Graphics and Imaging Architectures (Fall 2011) Kayvon Fatahalian, Graphics