Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

44
Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs

Transcript of Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

Page 1: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

Pervasive Massively Multithreaded GPU

Processors

Michael C. Shebanow

Sr. Arch Mgr, GPUs

Page 2: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

The “Real” Title

This talk is about SIMT Processors

The Past, Present, and a glimpse of the Future

Page 3: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

The Past

Page 4: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

Brief Chronology of GPUs at NVIDIA

1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005

3dfx

NVIDIA

Gigapixel

NV1 (NV2) NV3 NV3.5

NV4

NV5

NV10

NV15

NV11

NV20NV25

NV2A NV17

NV30NV35

NV40 G70G71

G80

NV41

NV44

G72

G73

Voodoo 1 Voodoo 2Banshee

Voodoo 3

Monet

R300

2006

Merlot Pinot

NV31

NV36

NV34

NV43

1998 1999 2000 2001 2002 2003 2004

DirectX 6Multitexturing

Riva TNT

DirectX 8SM 1.x

GeForce 3 Cg

DirectX 9SM 2.0

GeForceFX

DirectX 9.0cSM 3.0

GeForce 6DirectX 5Riva 128

DirectX 7T&L TextureStageState

GeForce 256

Quake 3 Giants Halo Far Cry UE3Half-Life

Page 5: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

Early NVIDIA GPUs(Precambrian Eon)

NV1 (1995)Forward texturing

Traverse in texel space, generate pixels(vs. conventional “reverse” texturing where pixel locations are sampled in texture space)

Quadratic patchesDifferent than DirectX polygon rendering approach

Integrated audio

Page 6: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

Precambrian (cont’d)

NV3 - Riva 128 (Aug 1997)

1st 128-bit memory bus“Wider is better”

DirectX 3 support

1 pix/clk 100 MHzUnified memory for frame buffer and texture

16b Z / 16b color

Integrated VGA from Weitek

Page 7: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

Shades of Programmability (Phanerozoic Eon)

NV4 - Riva TNT (Summer 1998)

2 pix/clk @ 90 MHz

DirectX 5

Dual texturing @ 1 pix/clk

Register combiners

Page 8: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

Rudimentary Shader Processors

Early Programmable ShadingGoogle “register combiners”, http://developer.nvidia.com/object/registercombiners.html

“Fixed function but programmable”

Page 9: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

General Combiner Flow

Page 10: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

The Birth of Modern GPUs(Cenozoic Eon)

NV20 - GeForce3 (Feb 2001)

4 pix/clk @ 240 MHz, 2 bilinear tex/pix

DirectX 8

Shaders!Programmable vertex shaders

“Configurable” pixel shaders

Input 2Input 1Input 0

OP

Temp 2Temp 1Temp 0

ADDR R0.xyz, eyePosition.xyzx, -f[TEX0].xyzx;DP3R R0.w, R0.xyzx, R0.xyzx;RSQR R0.w, R0.w;MULR R0.xyz, R0.w, R0.xyzx;ADDR R1.xyz, lightPosition.xyzx, -f[TEX0].xyzx;DP3R R0.w, R1.xyzx, R1.xyzx;RSQR R0.w, R0.w;MADR R0.xyz, R0.w, R1.xyzx, R0.xyzx;MULR R1.xyz, R0.w, R1.xyzx;DP3R R0.w, R1.xyzx, f[TEX1].xyzx;MAXR R0.w, R0.w, {0}.x;

Page 11: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

11

Shaders: Before and AfterShaders: Before and After

Halo, © Bungie, Elder Scrolls 3: Morrowind, © Bethesda

Page 12: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

Fully Programmable Shader Engines(Cretaceous Period)

NV30, NV31 - GeForce FX (Jan 2003)

4 pix/clk 500MHz (Ultra)8 pix/clk for Z-only

128 pin DDR DRAM interface

Superset of DirectX 9FP32 programmable pixel shader

Mainstream derivative: NV31

Not a stellar market success

Page 13: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

GeForce FX Shader Program Examples

Page 14: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

Programmability Improved(Paleocene Epoch)

NV40 - GeForce 6800 (April 2004)

16 pix/clk @ 500 MHz ; DX9 Shader Model 3.0

256 pin DRAM interface

Transition from AGP to PCI-E

Evolved NV3x shader; focused perf/area effort

SLI Re-born

Texture

pixeltexture

FB memory

Host Interface / Front End

Geometry

Rasterize

Texture

Raster Op

Shader

Fra

me

Bu

ffe

r In

terf

ace

(F

BI)

GPU

Display

video

L2 Cache

vertex texture

Shader

Sh

ader

Pip

elin

e 0

Quad Distribute

FIF

O

Sh

ader

Pip

elin

e 1

Sh

ader

Pip

elin

e 2

Sh

ader

Pip

elin

e 3

Rasterize

Quad Collect

ROP

Page 15: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

The Present

Page 16: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

Modern Shader Processors – The “SM”(Pliocene Epoch)

G80 - GeForce 8800 (Nov 2006)24 pix/clk @ 575 MHz

384-bit local memory interface

Virtual memory remapping for system and frame buffer

DirectX 10

Unified shader for vertex, geometry, and pixel programs

Compute!

Page 17: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

G8x

GPU

T L2ROP T L2ROP T L2ROP T L2ROP T L2ROP T L2ROP

Data Assembler

Front End

DRAM DRAM DRAM DRAM DRAM DRAM

FB FB FB FB FB FBFB

Hub

Out Out Out Out Out Out

In In In In In In

C

m

d

F

I

F

O

Setup, Raster, Zcull

Primitive Control

TPC

Texture

L1

Texture

Unit

PreRop

SM

SMC3

SP0 R

SP1 R

SP2 R

SP3 R

SP4R

SP5R

SP6R

SP7R

Shared Memory

Data L1 Cache

SFU

SFU

Instruction Fetch

Instruction L1 Cache

Instruction Decode

SM

SMC2

SP0 R

SP1 R

SP2 R

SP3 R

SP4R

SP5R

SP6R

SP7R

Shared Memory

Data L1 Cache

SFU

SFU

Instruction Fetch

Instruction L1 Cache

Instruction Decode

Geometry

Raster

I&D

L2

Cache

SMC

TPC

Texture

L1

Texture

Unit

PreRop

SM

SMC3

SP0 R

SP1 R

SP2 R

SP3 R

SP4R

SP5R

SP6R

SP7R

Shared Memory

Data L1 Cache

SFU

SFU

Instruction Fetch

Instruction L1 Cache

Instruction Decode

SM

SMC2

SP0 R

SP1 R

SP2 R

SP3 R

SP4R

SP5R

SP6R

SP7R

Shared Memory

Data L1 Cache

SFU

SFU

Instruction Fetch

Instruction L1 Cache

Instruction Decode

Geometry

Raster

I&D

L2

Cache

SMC

TPC

Texture

L1

Texture

Unit

PreRop

SM

SMC3

SP0 R

SP1 R

SP2 R

SP3 R

SP4R

SP5R

SP6R

SP7R

Shared Memory

Data L1 Cache

SFU

SFU

Instruction Fetch

Instruction L1 Cache

Instruction Decode

SM

SMC2

SP0 R

SP1 R

SP2 R

SP3 R

SP4R

SP5R

SP6R

SP7R

Shared Memory

Data L1 Cache

SFU

SFU

Instruction Fetch

Instruction L1 Cache

Instruction Decode

Geometry

Raster

I&D

L2

Cache

SMC

TPC

Texture

L1

Texture

Unit

PreRop

SM

SMC3

SP0 R

SP1 R

SP2 R

SP3 R

SP4R

SP5R

SP6R

SP7R

Shared Memory

Data L1 Cache

SFU

SFU

Instruction Fetch

Instruction L1 Cache

Instruction Decode

SM

SMC2

SP0 R

SP1 R

SP2 R

SP3 R

SP4R

SP5R

SP6R

SP7R

Shared Memory

Data L1 Cache

SFU

SFU

Instruction Fetch

Instruction L1 Cache

Instruction Decode

Geometry

Raster

I&D

L2

Cache

SMC

TPC

Texture

L1

Texture

Unit

PreRop

SM

SMC3

SP0 R

SP1 R

SP2 R

SP3 R

SP4R

SP5R

SP6R

SP7R

Shared Memory

Data L1 Cache

SFU

SFU

Instruction Fetch

Instruction L1 Cache

Instruction Decode

SM

SMC2

SP0 R

SP1 R

SP2 R

SP3 R

SP4R

SP5R

SP6R

SP7R

Shared Memory

Data L1 Cache

SFU

SFU

Instruction Fetch

Instruction L1 Cache

Instruction Decode

Geometry

Raster

I&D

L2

Cache

SMC

TPC

Texture

L1

Texture

Unit

PreRop

SM

SMC3

SP0 R

SP1 R

SP2 R

SP3 R

SP4R

SP5R

SP6R

SP7R

Shared Memory

Data L1 Cache

SFU

SFU

Instruction Fetch

Instruction L1 Cache

Instruction Decode

SM

SMC2

SP0 R

SP1 R

SP2 R

SP3 R

SP4R

SP5R

SP6R

SP7R

Shared Memory

Data L1 Cache

SFU

SFU

Instruction Fetch

Instruction L1 Cache

Instruction Decode

Geometry

Raster

I&D

L2

Cache

SMC

Host Unit

SM Streaming Multiprocessors

TPC Texture-Processor Clusters

Page 18: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

NVIDIA TeslaScalable High Density ComputingMassively Multi-threaded Parallel Computing

Page 19: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

Unified Design

Shader D

Shader A

Shader B

Shader C

Shader Core

ibuffer ibuffer ibuffer ibuffer

obuffer obuffer obufferobuffer

Discrete Design Unified Design

Page 20: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

Streaming Multiprocessor (SM)

SM

TPC

SP

DP

SP

SP SP

SP SP

SP SP

I-Cache

MT Issue

C-Cache

SFU SFU

SharedMemory

Streaming Multiprocessor (SM)8 Streaming Processors (SP)

8 SP FMA, 1 shared DP FMA

2 Super Function Units (SFU)

Multi-threaded instruction dispatch1 to 768 threads active

SIMD instruction per 16/32 threads

Hot clock 1.5 GHz, tepid 750 MHz, 24 GFLOPS

32 KB local register file (RFn)

16 KB global register file (GRF), aka Shared Memory

Page 21: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

SM Conceptual Block Diagram

Warp 0

Warp 1

Warp K

FetchUnit

RegisterFiles

PC

PC

PC

InstructionCache

SchedUnit

ALUs

LSU

SingleInstruction

(SI)

Multi- Threaded

(MT)

Page 22: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

The Future

Page 23: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

The CMOS “Canvas”

20 mm

20 mm

Page 24: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

The Ideal Processor?

M

M M

M

M

M M

M

M

M M

M

M

M M

M

M

M M

M

M

M M

M

M

M M

M

M

M M

M

M

M M

M

MathUnit

Page 25: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

The Processor We Live With?

G

M G

M G

M G

M G

M G

M

MathUnit

G

M G

M G

M G

M G

M G

M

G

M G

M G

M G

M G

M G

M

“Glue”Unit

Performance = Total Area X Computational Area Efficiency X Achieved Dynamic Efficiency

Page 26: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

What is SIMT?

SIMD MIMDSIMT

Page 27: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

SIMD versus MIMD versus SIMT?

SIMD: “Synchronous Internally Parallel”

MIMD: “Asynchronous Externally Parallel”

SIMT: “Quasi-Synchronous Externally Parallel”

SIMT = “Near” MIMD Programming Model w/ SIMD Implementation Efficiencies

X

+

Rd

Rs Rs

Rs X

+

Rd

Rs Rs

Rs X

+

Rd

Rs Rs

Rs

LD

Rd

Rs #imm

LD

Rd

Rs #imm

LD

Rd

Rs #imm

VLD R1,R0,#imm

VMA R3,R1,R2,R3

ADD R0,R0,#imm +

Rd

Rs #imm

+

Rd

Rs #imm

+

Rd

Rs #imm

SIMD Vector Instruction

SIMD Vector Instruction

Scalar Instruction

MIMD/SIMT Thread MIMD/SIMT Thread MIMD/SIMT Thread

Page 28: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

SIMT Multithreaded Execution

SIMT: Single-Instruction Multi-Threadexecutes one instruction across many independent threads

Warp: a set of 32 parallel threadsthat execute a SIMT instructionSIMT provides easy single-thread scalar programming with SIMD efficiency

Hardware implements zero-overhead warp and thread scheduling

SIMT threads can execute independentlySIMT warp diverges and converges when threads branch independentlyBest efficiency and performance when threads of a warp execute together

warp 8 instruction 11

Single-Instruction Multi-Threadinstruction scheduler

warp 1 instruction 42

warp 3 instruction 95

warp 8 instruction 12

...

time

warp 3 instruction 96

Page 29: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

A Few Open SIMT Problems

Control Divergence

Data Divergence

Data Representation

Coherence

Diversity

Page 30: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

Control Divergence

A_Code;If (cond) { B_Code; While (cond) { C_Code; If (cond) { D_Code; } Else { E_Code; } F_Code; } G_Code;} Else { H_Code;}I_Code;

B

T NT

E

A

B

H

I

GT NT

C

E

NT T

D

F

ImmediateDominator

ImmediatePost-Dominator

Control Flow Operation

Control Flow Divergence can Happen at control flow operations

Page 31: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

Why is Control Divergence Bad?

Loss of efficiency in SIMD execution

If different execution path threads are executed together

Unequal path execution delays implies the “wait or stay diverged” dilemma

T NT

41

25 16

Page 32: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

Data Access via Pointers in Parallel Programs

Pointers represent a major problem in parallel programs

Location that a pointer references cannot be resolved until runtime

struct {int x;int y;

} *p;int z = p->y;

LD R1,R0[4] // R0 = p

FETCH

DECODE

ISSUE

ADDRESS

CACHE

WB

Resolved

Page 33: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

Data Divergence

SIMT magnifies the pointer problem

Non-converged memory accesses

= data divergence

Classic scatter/gather problem

FETCH

DECODE

ISSUE

ADDRESS

CACHE

WB

ADDRESS

CACHE

WB

ADDRESS

CACHE

WB

ADDRESS

CACHE

WB

Memory

Page 34: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

Data Representation:The AOS versus SOA Dilemma

AOS (array of structure)

#define NNN nnnstruct {type1 field1;type2 field2;...

} data[NNN];

SOA (structure of array)

#define NNN nnnstruct {type1 field1[NNN];type2 field2[NNN];...

} data;

Page 35: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

AOS versus SOA in Memory

AOS:

SOA:

Field1 [0]Field2 [0]...FieldN [0]Field1 [1]Field2 [1]...FieldN [1]

Field1 [2]Field2 [2]...FieldN [2]Field1 [3]Field2 [3]...FieldN [3]

Field1 [4]Field2 [4]...FieldN [4]Field1 [5]Field2 [5]...FieldN [5]

Field1 [6]Field2 [6]...FieldN [6]Field1 [7]Field2 [7]...FieldN [7]

... ... ... ... ... ... ... ...

000001010011100101110111

000xxx

001xxx

010xxx

011xxx

100xxx

Field1 [0]Field1 [2]Field1 [4]Field1 [6] Field1 [1]Field1 [3]Field1 [5]Field1 [7]

Field2 [0]Field2 [2]Field2 [4]Field2 [6] Field2 [1]Field2 [3]Field2 [5]Field2 [7]

... ... ... ... ... ... ... ...

000001010011100101110111

000xxx

001xxx

010xxx

Vector Access

Scalar Array Access

Page 36: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

AOS versus SOA: How to Choose?

Programmer: pick AOSNatural way to think about data: group related fields

In some cases, better memory access efficiencySparse access to records

SIMT: pick SOAThreads executing same code want to access same data element at the same time

Very convenient for HW

How to reconcile?

Page 37: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

Descriptors

AKA “capabilities”For example, Plessey 250, Cambridge CAP, Intel 432

D3D employs a form of descriptor

“Resources descriptors” are capabilities

Major language issue for parallel programming?

Page 38: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

Diversity: CPU-GPU Détente?

Really SISD vs. SIMT

Sequential applications on SIMT hardware?

Conversely, thread parallel applications on multi-core scalar machines?

Room for both?

Page 39: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

Coherent Caches?

Some small planes have built in parachutes

Really good idea?

Fact: existing GPUs don’t support cache coherency

Bad?

Should coherent caches be added?

Page 40: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

The Future Revisited

So what is the future in high performance computing?

1. SIMT

2. Lots of cores

3. Clouds

Page 41: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

The Demise of ILP

Uniprocessor performance improvements are crawling to a halt

Very hard to architecturally extract more ILP from single threads

1e+0

1e+1

1e+2

1e+3

1e+4

1e+5

1e+6

1e+7

1980 1990 2000 2010 2020

Perf (ps/Inst)52%/year

19%/year

ps/gate 19%Gates/clock 9%

Clocks/inst 18%

Page 42: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

Parallel Processing

Conjecture: most problems worth solving can be solved via a parallel program

SIMT fundamentally a better model than either SIMD or MIMD

146X

Medical Imaging U of Utah

36X

Molecular DynamicsU of Illinois, Urbana

18X

Video TranscodingElemental Tech

50X

Matlab ComputingAccelerEyes

100X

AstrophysicsRIKEN

149X

Financial simulationOxford

47X

Linear AlgebraUniversidad Jaime

20X

3D UltrasoundTechniscan

130X

Quantum ChemistryU of Illinois, Urbana

30X

Gene SequencingU of Maryland

Page 43: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

Scaling

Can a single GPU do it all?

Systems have to scale to multiple boxes

Programming systems have to scale with them

Page 44: Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

Final Thoughts

The future is bright for parallel programming

Future supercomputers = networked SIMT-based processing systems

[email protected]