Streaming Architectures and GPUs

25
1 February 11, 2004 Streaming Architectures and GPUs Ian Buck Bill Dally & Pat Hanrahan Stanford University February 11, 2004

description

Streaming Architectures and GPUs. Ian Buck Bill Dally & Pat Hanrahan Stanford University February 11, 2004. To Exploit VLSI Technology We Need:. Parallelism To keep 100s of ALUs per chip (thousands/board millions/system) busy Latency tolerance To cover 500 cycle remote memory access time - PowerPoint PPT Presentation

Transcript of Streaming Architectures and GPUs

Page 1: Streaming Architectures and GPUs

1 February 11, 2004

Streaming Architectures and GPUs

Ian BuckBill Dally & Pat Hanrahan

Stanford UniversityFebruary 11, 2004

Page 2: Streaming Architectures and GPUs

2 February 11, 2004

To Exploit VLSI Technology We Need:

• Parallelism– To keep 100s of ALUs per chip

(thousands/board millions/system) busy

• Latency tolerance– To cover 500 cycle remote

memory access time

• Locality– To match 20Tb/s ALU

bandwidth to ~100Gb/s chip bandwidth

• Moore’s Law– Growth of transistors, not

performance

Arithmetic is cheap, global bandwidth is expensiveLocal << global on-chip << off-chip << global system

90nm Chip$2001GHz

64-bit FPU(to scale)

12mm

0.5mm

1 clock50pJ/FLOP

Courtesy of Bill Dally

Page 3: Streaming Architectures and GPUs

3 February 11, 2004

Arithmetic Intensity

Lots of ops per word transferred

• Compute-to-Bandwidth ratio• High Arithmetic Intensity

desirable– App limited by ALU performance,

not off-chip bandwidth– More chip real estate for ALUs,

not caches

Courtesy of Pat Hanrahan

90nm Chip$2001GHz

64-bit FPU(to scale)

12mm

0.5mm

1 clock50pJ/FLOP

Page 4: Streaming Architectures and GPUs

4 February 11, 2004

Brook: Stream programming Model

– Enforce Data Parallel computing– Encourage Arithmetic Intensity– Provide fundamental ops for stream computing

Page 5: Streaming Architectures and GPUs

5 February 11, 2004

Streams & Kernels

• Streams– Collection of records requiring similar computation

• Vertex positions, voxels, FEM cell, …

– Provide data parallelism

• Kernels– Functions applied to each element in stream

• transforms, PDE, …• No dependencies between stream elements

– Encourage high Arithmetic Intensity

Page 6: Streaming Architectures and GPUs

6 February 11, 2004

Vectors vs. Streams

• Vectors:• v: array of floats• Instruction sequence

LD v0LD v1ADD v0, v1, v2ST v2

Large set of temps

• Streams:• s: stream of records• Instruction sequence

LD s0LD s1CALLS f, s0, s1, s2ST s2

Small set of temps

Higher arithmetic intensity: |f|/|s| |+|/|v|

Page 7: Streaming Architectures and GPUs

7 February 11, 2004

Imagine

• Imagine– Stream processor for image and

signal processing– 16mm die in 0.18um TI process– 21M transistors

2GB/s 32GB/s

SDRAM

SDRAM

SDRAM

SDRAM

Str

eam

R

egis

ter

Fil

e ALU Cluster

ALU Cluster

ALU Cluster

544GB/s

Page 8: Streaming Architectures and GPUs

8 February 11, 2004

Merrimac Processor

• 90nm tech (1 V)• ASIC technology• 1 GHz (37 FO4)• 128 GOPs• Inter-cluster switch

between clusters• 127.5 mm2 (small

~12x10)– Stanford Imagine is 16mm

x 16mm– MIT Raw is 18mm x 18mm

• 25 Watts (P4 = 75 W)– ~41W with memoriesNetwork

12.5 mm

10

.2 m

m

Microcontroller

Clu

ster

Clu

ster

Clu

ster

Clu

ster

Clu

ste

r

Clu

ste

r

Clu

ste

r

Clu

ste

r

Clu

ste

r

Clu

ste

r

Clu

ste

r

Clu

ste

r

Clu

ste

r

Clu

ste

r

Clu

ste

r

Clu

ste

r

16 R

DR

AM

In

t erf

ac e

s

Fo

rwa

rd E

CC

Ad

dre

ss

Ge

n

Re

ord

er

Bu

ffe

rA

dd

res

s G

en

$ bank

$ bank

$ bank

$ bank

$ bank

$ bank

$ bank

$ bank

Ad

dre

ss

Ge

n

Re

ord

er

Bu

ffe

rA

dd

res

s G

en

Me

m s

witc

h

Mips6420kc

Mips6420kc

8K

Wor

ds S

RF

Ban

k

2.3 mm

1.6

mm

FP/INT64 BitMADD

64 W RF64 W RF64 W RF

FP/INT64 BitMADD

64 W RF64 W RF64 W RF

FP/INT64 BitMADD

64 W RF64 W RF64 W RF

FP/INT64 BitMADD

64 W RF64 W RF64 W RF

0.6

mm

0.9 mm

Page 9: Streaming Architectures and GPUs

9 February 11, 2004

Merrimac Streaming Supercomputer

StreamProcessor128 FPUs

128GFLOPS

16 xDRDRAM2GBytes

16GBytes/s

16GBytes/s32+32 pairs

Node

On-Board Network

Node2

Node16 Board 2

16 Nodes1K FPUs2TFLOPS32GBytes

Intra-Cabinet Network

Board 32

64GBytes/s128+128 pairs

6" Teradyne GbX

Board

Backplane

Inter-Cabinet Network

Backplane 232 Boards512 Nodes64K FPUs64TFLOPS

1TByte

E/OO/E

1TBytes/s2K+2K links

Ribbon Fiber

Backplane32

Bisection 32TBytes/s

All links 5Gb/s per pair or fiberAll bandwidths are full duplex

Page 10: Streaming Architectures and GPUs

10 February 11, 2004

Streaming Applications

• Finite volume – StreamFLO (from TFLO)• Finite element - StreamFEM• Molecular dynamics code (ODEs) - StreamMD• Model (elliptic, hyperbolic and parabolic) PDEs• PCA Applications: FFT, Matrix Mul, SVD, Sort

Page 11: Streaming Architectures and GPUs

11 February 11, 2004

StreamFLO

• StreamFLO is the Brook version of FLO82, a FORTRAN code written by Prof. Jameson, for the solution of the inviscid flow around an airfoil.

• The code uses a cell centered finite volume formulation with a multigrid acceleration to solve the 2D Euler equations.

• The structure of the code is similar to TFLO and the algorithm is found in many compressible flow solvers.

Page 12: Streaming Architectures and GPUs

12 February 11, 2004

StreamFEM

• A Brook implementation of the Discontinuous Galerkin (DG) Finite Element

• Method (FEM) in 2D triangulated domains.

Page 13: Streaming Architectures and GPUs

13 February 11, 2004

StreamMD: motivation

• Application: study the folding of human proteins.

• Molecular Dynamics: computer simulation of the dynamics of macro molecules.

• Why this application?– Expect high arithmetic

intensity.– Requires variable length

neighborlists.– Molecular Dynamics can be

used in engine simulation to model spray, e.g. droplet formation and breakup, drag, deformation of droplet.

• Test case chosen for initial evaluation: box of water molecules.

DNA molecule

Human immunodeficiency virus (HIV)

Page 14: Streaming Architectures and GPUs

14 February 11, 2004

Summary of Application Results

Application Sustained

GFLOPS1

FP Ops / Mem Ref

LRF Refs SRF Refs Mem Refs

StreamFEM2D(Euler, quadratic)

32.2 23.5 169.5M(93.6%)

10.3M(5.7%)

1.4M(0.7%)

StreamFEM2D(MHD, cubic)

33.5 50.6 733.3M(94.0%)

43.8M(5.6%)

3.2M(0.4%)

StreamMD 14.22 12.1 90.2M(97.5%)

1.6M(1.7%)

0.7M(0.8%)

StreamFLO 11.42 7.4 234.3M(95.7%)

7.2M(2.9%)

3.4M(1.4%)

1. Simulated on a machine with 64GFLOPS peak performance2. The low numbers are a result of many divide and square-root operations

Page 15: Streaming Architectures and GPUs

15 February 11, 2004

0

5

10

15

20

25

Jun-01 Sep-01 Dec-01 Mar-02 Jun-02 Sep-02 Dec-02 Apr-03 Jul-03

GF

LO

PS

Streaming on graphics hardware?

Pentium 4 SSE theoretical*3GHz * 4 wide * .5 inst / cycle = 6 GFLOPS

GeForce FX 5900 (NV35) fragment shader observed:MULR R0, R0, R0: 20 GFLOPSequivalent to a 10 GHz P4

and getting faster: 3x improvement over NV30 (6 months)

*from Intel P4 Optimization Manual

GeForce FX

Pentium 4NV30

NV35

Page 16: Streaming Architectures and GPUs

16 February 11, 2004

GPU Program Architecture

InputRegisters

Program

OutputRegisters

Constants

Registers

Texture

Page 17: Streaming Architectures and GPUs

17 February 11, 2004

Example Program

Simple Specular and Diffuse Lighting!!VP1.0## c[0-3] = modelview projection (composite) matrix# c[4-7] = modelview inverse transpose# c[32] = eye-space light direction# c[33] = constant eye-space half-angle vector (infinite viewer)# c[35].x = pre-multiplied monochromatic diffuse light color & diffuse mat.# c[35].y = pre-multiplied monochromatic ambient light color & diffuse mat.# c[36] = specular color# c[38].x = specular power# outputs homogenous position and color# DP4 o[HPOS].x, c[0], v[OPOS]; # Compute position.DP4 o[HPOS].y, c[1], v[OPOS];DP4 o[HPOS].z, c[2], v[OPOS];DP4 o[HPOS].w, c[3], v[OPOS];DP3 R0.x, c[4], v[NRML]; # Compute normal.DP3 R0.y, c[5], v[NRML]; DP3 R0.z, c[6], v[NRML]; # R0 = N' = transformed normalDP3 R1.x, c[32], R0; # R1.x = Ldir DOT N'DP3 R1.y, c[33], R0; # R1.y = H DOT N'MOV R1.w, c[38].x; # R1.w = specular powerLIT R2, R1; # Compute lighting valuesMAD R3, c[35].x, R2.y, c[35].y; # diffuse + ambientMAD o[COL0].xyz, c[36], R2.z, R3; # + specularEND

Page 18: Streaming Architectures and GPUs

18 February 11, 2004

Cg/HLSL: High level language for GPUs

Specular Lighting

// Lookup the normal map

float4 normal = 2 * (tex2D(normalMap, I.texCoord0.xy) - 0.5);

// Multiply 3 X 2 matrix generated using lightDir and halfAngle with

// scaled normal followed by lookup in intensity map with the result.

float2 intensCoord = float2(dot(I.lightDir.xyz, normal.xyz),

dot(I.halfAngle.xyz, normal.xyz));

float4 intensity = tex2D(intensityMap, intensCoord);

// Lookup color

float4 color = tex2D(colorMap, I.texCoord3.xy);

// Blend/Modulate intensity with color

return color * intensity;

Page 19: Streaming Architectures and GPUs

19 February 11, 2004

GPU: Data Parallel

– Each fragment shaded independently• No dependencies between fragments

– Temporary registers are zeroed – No static variables– No Read-Modify-Write textures

• Multiple “pixel pipes”

– Data Parallelism• Support ALU heavy architectures • Hide Memory Latency

[Torborg and Kajiya 96, Anderson et al. 97, Igehy et al. 98]

Page 20: Streaming Architectures and GPUs

20 February 11, 2004

GPU: Arithmetic Intensity

Lots of ops per word transferredGraphics pipeline

– Vertex• BW: 1 triangle = 32 bytes; • OP: 100-500 f32-ops / triangle

– Rasterization• Create 16-32 fragments per triangle

– Fragment • BW: 1 fragment = 10 bytes• OP: 300-1000 i8-ops/fragment

Shader Programs

Courtesy of Pat Hanrahan

Page 21: Streaming Architectures and GPUs

21 February 11, 2004

Streaming Architectures

SDRAM

SDRAM

SDRAM

SDRAM

Str

eam

R

egis

ter

Fil

e ALU Cluster

ALU Cluster

ALU Cluster

Page 22: Streaming Architectures and GPUs

22 February 11, 2004

Streaming Architectures

SDRAM

SDRAM

SDRAM

SDRAM

Str

eam

R

egis

ter

Fil

e ALU Cluster

ALU Cluster

ALU Cluster

MAD R3, R1, R2;MAD R5, R2, R3;

Kernel Execution Unit

Page 23: Streaming Architectures and GPUs

23 February 11, 2004

Streaming Architectures

SDRAM

SDRAM

SDRAM

SDRAM

Str

eam

R

egis

ter

Fil

e ALU Cluster

ALU Cluster

ALU Cluster

MAD R3, R1, R2;MAD R5, R2, R3;

Kernel Execution Unit

Parallel Fragment Pipelines

Page 24: Streaming Architectures and GPUs

24 February 11, 2004

Streaming Architectures

SDRAM

SDRAM

SDRAM

SDRAM

Str

eam

R

egis

ter

Fil

e ALU Cluster

ALU Cluster

ALU Cluster

MAD R3, R1, R2;MAD R5, R2, R3;

Kernel Execution Unit

Parallel Fragment PipelinesStream Register File:• Texture Cache?• F-Buffer [Mark et al.]

Page 25: Streaming Architectures and GPUs

25 February 11, 2004

Conclusions

• The problem is bandwidth – arithmetic is cheap• Stream processing & architectures can provide

VLSI-efficient scientific computing– Imagine– Merrimac

• GPUs are first generation streaming architectures– Apply same stream programming model for general

purpose computing on GPUs

GeForce FX