Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

Pervasive Massively Multithreaded GPU

Processors

Michael C. Shebanow

Sr. Arch Mgr, GPUs

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

The “Real” Title

This talk is about SIMT Processors

The Past, Present, and a glimpse of the Future


The Past


Brief Chronology of GPUs at NVIDIA

1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005

3dfx

NVIDIA

Gigapixel

NV1 (NV2) NV3 NV3.5

NV4

NV5

NV10

NV15

NV11

NV20NV25

NV2A NV17

NV30NV35

NV40 G70G71

G80

NV41

NV44

G72

G73

Voodoo 1 Voodoo 2Banshee

Voodoo 3

Monet

R300

2006

Merlot Pinot

NV31

NV36

NV34

NV43

1998 1999 2000 2001 2002 2003 2004

DirectX 6Multitexturing

Riva TNT

DirectX 8SM 1.x

GeForce 3 Cg

DirectX 9SM 2.0

GeForceFX

DirectX 9.0cSM 3.0

GeForce 6DirectX 5Riva 128

DirectX 7T&L TextureStageState

GeForce 256

Quake 3 Giants Halo Far Cry UE3Half-Life


Early NVIDIA GPUs(Precambrian Eon)

NV1 (1995)Forward texturing

Traverse in texel space, generate pixels(vs. conventional “reverse” texturing where pixel locations are sampled in texture space)

Quadratic patchesDifferent than DirectX polygon rendering approach

Integrated audio


Precambrian (cont’d)

NV3 - Riva 128 (Aug 1997)

1st 128-bit memory bus“Wider is better”

DirectX 3 support

1 pix/clk 100 MHzUnified memory for frame buffer and texture

16b Z / 16b color

Integrated VGA from Weitek


Shades of Programmability (Phanerozoic Eon)

NV4 - Riva TNT (Summer 1998)

2 pix/clk @ 90 MHz

DirectX 5

Dual texturing @ 1 pix/clk

Register combiners


Rudimentary Shader Processors

Early Programmable ShadingGoogle “register combiners”, http://developer.nvidia.com/object/registercombiners.html

“Fixed function but programmable”

http://developer.nvidia.com/object/registercombiners.html


General Combiner Flow


The Birth of Modern GPUs(Cenozoic Eon)

NV20 - GeForce3 (Feb 2001)

4 pix/clk @ 240 MHz, 2 bilinear tex/pix

DirectX 8

Shaders!Programmable vertex shaders

“Configurable” pixel shaders

Input 2Input 1Input 0

OP

Temp 2Temp 1Temp 0

ADDR R0.xyz, eyePosition.xyzx, -f[TEX0].xyzx;DP3R R0.w, R0.xyzx, R0.xyzx;RSQR R0.w, R0.w;MULR R0.xyz, R0.w, R0.xyzx;ADDR R1.xyz, lightPosition.xyzx, -f[TEX0].xyzx;DP3R R0.w, R1.xyzx, R1.xyzx;RSQR R0.w, R0.w;MADR R0.xyz, R0.w, R1.xyzx, R0.xyzx;MULR R1.xyz, R0.w, R1.xyzx;DP3R R0.w, R1.xyzx, f[TEX1].xyzx;MAXR R0.w, R0.w, {0}.x;


11

Shaders: Before and AfterShaders: Before and After

Halo, © Bungie, Elder Scrolls 3: Morrowind, © Bethesda


Fully Programmable Shader Engines(Cretaceous Period)

NV30, NV31 - GeForce FX (Jan 2003)

4 pix/clk 500MHz (Ultra)8 pix/clk for Z-only

128 pin DDR DRAM interface

Superset of DirectX 9FP32 programmable pixel shader

Mainstream derivative: NV31

Not a stellar market success


GeForce FX Shader Program Examples


Programmability Improved(Paleocene Epoch)

NV40 - GeForce 6800 (April 2004)

16 pix/clk @ 500 MHz ; DX9 Shader Model 3.0

256 pin DRAM interface

Transition from AGP to PCI-E

Evolved NV3x shader; focused perf/area effort

SLI Re-born

Texture

pixeltexture

FB memory

Host Interface / Front End

Geometry

Rasterize

Texture

Raster Op

Shader

Fra

me

Bu

ffe

r In

terf

ace

(F

BI)

GPU

Display

video

L2 Cache

vertex texture

Shader

Sh

ader

Pip

elin

e 0

Quad Distribute

FIF

O

Sh

ader

Pip

elin

e 1

Sh

ader

Pip

elin

e 2

Sh

ader

Pip

elin

e 3

Rasterize

Quad Collect

ROP


The Present


Modern Shader Processors – The “SM”(Pliocene Epoch)

G80 - GeForce 8800 (Nov 2006)24 pix/clk @ 575 MHz

384-bit local memory interface

Virtual memory remapping for system and frame buffer

DirectX 10

Unified shader for vertex, geometry, and pixel programs

Compute!


G8x

GPU

T L2ROP T L2ROP T L2ROP T L2ROP T L2ROP T L2ROP

Data Assembler

Front End

DRAM DRAM DRAM DRAM DRAM DRAM

FB FB FB FB FB FBFB

Hub

Out Out Out Out Out Out

In In In In In In

C

m

d

F

I

F

O

Setup, Raster, Zcull

Primitive Control

TPC

Texture

L1

Texture

Unit

PreRop

SM

SMC3

SP0 R

SP1 R

SP2 R

SP3 R

SP4R

SP5R

SP6R

SP7R

Shared Memory

Data L1 Cache

SFU

SFU

Instruction Fetch

Instruction L1 Cache

Instruction Decode

SM

SMC2

SP0 R

SP1 R

SP2 R

SP3 R

SP4R

SP5R

SP6R

SP7R

Shared Memory

Data L1 Cache

SFU

SFU

Instruction Fetch


Instruction Decode

Geometry

Raster

I&D

L2

Cache

SMC

TPC

Texture

L1

Texture

Unit

PreRop

SM

SMC3

SP0 R

SP1 R

SP2 R

SP3 R

SP4R

SP5R

SP6R

SP7R

Shared Memory

Data L1 Cache

SFU

SFU

Instruction Fetch


Instruction Decode

SM

SMC2

SP0 R

SP1 R

SP2 R

SP3 R

SP4R

SP5R

SP6R

SP7R

Shared Memory

Data L1 Cache

SFU

SFU

Instruction Fetch


Instruction Decode

Geometry

Raster

I&D

L2

Cache

SMC

TPC

Texture

L1

Texture

Unit

PreRop

SM

SMC3

SP0 R

SP1 R

SP2 R

SP3 R

SP4R

SP5R

SP6R

SP7R

Shared Memory

Data L1 Cache

SFU

SFU

Instruction Fetch


Instruction Decode

SM

SMC2

SP0 R

SP1 R

SP2 R

SP3 R

SP4R

SP5R

SP6R

SP7R

Shared Memory

Data L1 Cache

SFU

SFU

Instruction Fetch


Instruction Decode

Geometry

Raster

I&D

L2

Cache

SMC

TPC

Texture

L1

Texture

Unit

PreRop

SM

SMC3

SP0 R

SP1 R

SP2 R

SP3 R

SP4R

SP5R

SP6R

SP7R

Shared Memory

Data L1 Cache

SFU

SFU

Instruction Fetch


Instruction Decode

SM

SMC2

SP0 R

SP1 R

SP2 R

SP3 R

SP4R

SP5R

SP6R

SP7R

Shared Memory

Data L1 Cache

SFU

SFU

Instruction Fetch


Instruction Decode

Geometry

Raster

I&D

L2

Cache

SMC

TPC

Texture

L1

Texture

Unit

PreRop

SM

SMC3

SP0 R

SP1 R

SP2 R

SP3 R

SP4R

SP5R

SP6R

SP7R

Shared Memory

Data L1 Cache

SFU

SFU

Instruction Fetch


Instruction Decode

SM

SMC2

SP0 R

SP1 R

SP2 R

SP3 R

SP4R

SP5R

SP6R

SP7R

Shared Memory

Data L1 Cache

SFU

SFU

Instruction Fetch


Instruction Decode

Geometry

Raster

I&D

L2

Cache

SMC

TPC

Texture

L1

Texture

Unit

PreRop

SM

SMC3

SP0 R

SP1 R

SP2 R

SP3 R

SP4R

SP5R

SP6R

SP7R

Shared Memory

Data L1 Cache

SFU

SFU

Instruction Fetch


Instruction Decode

SM

SMC2

SP0 R

SP1 R

SP2 R

SP3 R

SP4R

SP5R

SP6R

SP7R

Shared Memory

Data L1 Cache

SFU

SFU

Instruction Fetch


Instruction Decode

Geometry

Raster

I&D

L2

Cache

SMC

Host Unit

SM Streaming Multiprocessors

TPC Texture-Processor Clusters


NVIDIA TeslaScalable High Density ComputingMassively Multi-threaded Parallel Computing


Unified Design

Shader D

Shader A

Shader B

Shader C

Shader Core

ibuffer ibuffer ibuffer ibuffer

obuffer obuffer obufferobuffer

Discrete Design Unified Design


Streaming Multiprocessor (SM)

SM

TPC

SP

DP

SP

SP SP

SP SP

SP SP

I-Cache

MT Issue

C-Cache

SFU SFU

SharedMemory

Streaming Multiprocessor (SM)8 Streaming Processors (SP)

8 SP FMA, 1 shared DP FMA

2 Super Function Units (SFU)

Multi-threaded instruction dispatch1 to 768 threads active

SIMD instruction per 16/32 threads

Hot clock 1.5 GHz, tepid 750 MHz, 24 GFLOPS

32 KB local register file (RFn)

16 KB global register file (GRF), aka Shared Memory


SM Conceptual Block Diagram

Warp 0

Warp 1

Warp K

FetchUnit

RegisterFiles

PC

PC

PC

InstructionCache

SchedUnit

ALUs

LSU

SingleInstruction

(SI)

Multi- Threaded

(MT)


The Future


The CMOS “Canvas”

20 mm

20 mm


The Ideal Processor?

M

M M

M

M

M M

M

M

M M

M

M

M M

M

M

M M

M

M

M M

M

M

M M

M

M

M M

M

M

M M

M

MathUnit


The Processor We Live With?

G

M G

M G

M G

M G

M G

M

MathUnit

G

M G

M G

M G

M G

M G

M

G

M G

M G

M G

M G

M G

M

“Glue”Unit

Performance = Total Area X Computational Area Efficiency X Achieved Dynamic Efficiency


What is SIMT?

SIMD MIMDSIMT


SIMD versus MIMD versus SIMT?

SIMD: “Synchronous Internally Parallel”

MIMD: “Asynchronous Externally Parallel”

SIMT: “Quasi-Synchronous Externally Parallel”

SIMT = “Near” MIMD Programming Model w/ SIMD Implementation Efficiencies

X

+

Rd

Rs Rs

Rs X

+

Rd

Rs Rs

Rs X

+

Rd

Rs Rs

Rs

LD

Rd

Rs #imm

LD

Rd

Rs #imm

LD

Rd

Rs #imm

VLD R1,R0,#imm

VMA R3,R1,R2,R3

ADD R0,R0,#imm +

Rd

Rs #imm

+

Rd

Rs #imm

+

Rd

Rs #imm

SIMD Vector Instruction

SIMD Vector Instruction

Scalar Instruction

MIMD/SIMT Thread MIMD/SIMT Thread MIMD/SIMT Thread


SIMT Multithreaded Execution

SIMT: Single-Instruction Multi-Threadexecutes one instruction across many independent threads

Warp: a set of 32 parallel threadsthat execute a SIMT instructionSIMT provides easy single-thread scalar programming with SIMD efficiency

Hardware implements zero-overhead warp and thread scheduling

SIMT threads can execute independentlySIMT warp diverges and converges when threads branch independentlyBest efficiency and performance when threads of a warp execute together

warp 8 instruction 11

Single-Instruction Multi-Threadinstruction scheduler




...

time



A Few Open SIMT Problems

Control Divergence

Data Divergence

Data Representation

Coherence

Diversity


Control Divergence

A_Code;If (cond) { B_Code; While (cond) { C_Code; If (cond) { D_Code; } Else { E_Code; } F_Code; } G_Code;} Else { H_Code;}I_Code;

B

T NT

E

A

B

H

I

GT NT

C

E

NT T

D

F

ImmediateDominator

ImmediatePost-Dominator

Control Flow Operation

Control Flow Divergence can Happen at control flow operations


Why is Control Divergence Bad?

Loss of efficiency in SIMD execution

If different execution path threads are executed together

Unequal path execution delays implies the “wait or stay diverged” dilemma

T NT

41

25 16


Data Access via Pointers in Parallel Programs

Pointers represent a major problem in parallel programs

Location that a pointer references cannot be resolved until runtime

struct {int x;int y;

} *p;int z = p->y;

LD R1,R0[4] // R0 = p

FETCH

DECODE

ISSUE

ADDRESS

CACHE

WB

Resolved


Data Divergence

SIMT magnifies the pointer problem

Non-converged memory accesses

= data divergence

Classic scatter/gather problem

FETCH

DECODE

ISSUE

ADDRESS

CACHE

WB

ADDRESS

CACHE

WB

ADDRESS

CACHE

WB

ADDRESS

CACHE

WB

Memory


Data Representation:The AOS versus SOA Dilemma

AOS (array of structure)

#define NNN nnnstruct {type1 field1;type2 field2;...

} data[NNN];

SOA (structure of array)

#define NNN nnnstruct {type1 field1[NNN];type2 field2[NNN];...

} data;


AOS versus SOA in Memory

AOS:

SOA:

Field1 [0]Field2 [0]...FieldN [0]Field1 [1]Field2 [1]...FieldN [1]




... ... ... ... ... ... ... ...

000001010011100101110111

000xxx

001xxx

010xxx

011xxx

100xxx

Field1 [0]Field1 [2]Field1 [4]Field1 [6] Field1 [1]Field1 [3]Field1 [5]Field1 [7]

Field2 [0]Field2 [2]Field2 [4]Field2 [6] Field2 [1]Field2 [3]Field2 [5]Field2 [7]

... ... ... ... ... ... ... ...

000001010011100101110111

000xxx

001xxx

010xxx

Vector Access

Scalar Array Access


AOS versus SOA: How to Choose?

Programmer: pick AOSNatural way to think about data: group related fields

In some cases, better memory access efficiencySparse access to records

SIMT: pick SOAThreads executing same code want to access same data element at the same time

Very convenient for HW

How to reconcile?


Descriptors

AKA “capabilities”For example, Plessey 250, Cambridge CAP, Intel 432

D3D employs a form of descriptor

“Resources descriptors” are capabilities

Major language issue for parallel programming?


Diversity: CPU-GPU Détente?

Really SISD vs. SIMT

Sequential applications on SIMT hardware?

Conversely, thread parallel applications on multi-core scalar machines?

Room for both?


Coherent Caches?

Some small planes have built in parachutes

Really good idea?

Fact: existing GPUs don’t support cache coherency

Bad?

Should coherent caches be added?


The Future Revisited

So what is the future in high performance computing?

1. SIMT

2. Lots of cores

3. Clouds


The Demise of ILP

Uniprocessor performance improvements are crawling to a halt

Very hard to architecturally extract more ILP from single threads

1e+0

1e+1

1e+2

1e+3

1e+4

1e+5

1e+6

1e+7

1980 1990 2000 2010 2020

Perf (ps/Inst)52%/year

19%/year

ps/gate 19%Gates/clock 9%

Clocks/inst 18%


Parallel Processing

Conjecture: most problems worth solving can be solved via a parallel program

SIMT fundamentally a better model than either SIMD or MIMD

146X

Medical Imaging U of Utah

36X

Molecular DynamicsU of Illinois, Urbana

18X

Video TranscodingElemental Tech

50X

Matlab ComputingAccelerEyes

100X

AstrophysicsRIKEN

149X

Financial simulationOxford

47X

Linear AlgebraUniversidad Jaime

20X

3D UltrasoundTechniscan

130X

Quantum ChemistryU of Illinois, Urbana

30X

Gene SequencingU of Maryland


Scaling

Can a single GPU do it all?

Systems have to scale to multiple boxes

Programming systems have to scale with them


Final Thoughts

The future is bright for parallel programming

Future supercomputers = networked SIMT-based processing systems

[email protected]

Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

Documents

Transcript of Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.