Stream Processors - University of California, San Diego¦ DLP across stream elements æ TLP across...

EE 273 Lecture 1 1/10/01

Copyright 1998,2000,2001 by W. J. Dally, All rights reserved 1

1

Stream Processors

William J. DallyComputer Systems Laboratory

Stanford [email protected]

2

Stream Processors

ï What is Stream Processor Architecture?ñ What problem is being solved

ï Performance scaling, power efficiency, bandwidth bottlenecksñ What is a stream program?ñ What is a stream processor

EE 273 Lecture 1 1/10/01


3

What is a Stream Processor?

ï A stream program is a computation organized as streams of records operated on by computation kernels.

ï A stream processor is optimized to exploit the locality and concurrency of stream programs

4

Stream Processing is becoming pervasive

InsightsPeter Huber, 01.07.02A new type of computing--"stream computing"--has emerged to handle the back end of sonar, radar, X-ray sources and certain broadband applications such as voice-over Internet and digital TV.

EE 273 Lecture 1 1/10/01


5

Motivation ñ Why do we need a stream processor?

ï Application demandñ Powerñ Bandwidthñ Performance Scaling

ï Emerging media applications demand 10s of GOPS toTOPs with low power.ñ Video compressionñ Image understandingñ Signal processingñ Graphics

ï Scientific applications also need TOPs of performance with reasonable cost

6

Conventional Processors No Longer Scale Performance by 50% each year

1e+0

1e+1

1e+2

1e+3

1e+4

1e+5

1e+6

1e+7

1980 1990 2000 2010 2020

Perf (ps/Inst)52%/year

19%/year

ps/gate 19%Gates/clock 9%

Clocks/inst 18%

EE 273 Lecture 1 1/10/01


7

Future potential of novel architecture is large (1000 vs 30)

1e-41e-31e-21e-11e+01e+11e+21e+31e+41e+51e+61e+7

1980 1990 2000 2010 2020

Perf (ps/Inst)Linear (ps/Inst)

52%/year

74%/year

19%/year30:1

1,000:1

30,000:1

8

Clock Scaling: Historical and Projected

0

1

10

100

1000

10000

1970

1972

1974

1976

1978

1980

1982

1984

1986

1988

1990

1992

1994

1996

1998

2000

2002

2004

2006

2008

2010

2012

2014

Clo

ck S

peed

(MH

z)

8FO416FO4Intel Processors

375 FO4

70 FO4

85 FO4115 FO4

230 FO4185 FO4

175 FO4

53 FO4

40 FO446 FO4

25 FO4

9 FO4

15 FO4

10µm 3µm 1µm1.5µm 0.8µm 0.35µm 0.18µm0.13µm 0.07µm

0.1µm 0.05µm0.035µm

EE 273 Lecture 1 1/10/01


9

Anant Agarwal (MIT) Panel at HPCA ë02

10

Pentium III vs. Pentium IV

2x108108# Grids

12KBytes16KBytesL1 D$ Capacity

42 million24 millionTransistor Count

217mm2106mm2Die Size

0.350.45SpecInt/MHz

524454SpecInt2000

1.5GHz (10.4 FO4)1GHz (15 FO4)Clock Rate

2010Pipeline Stages

180nm180nmTechnology

Pentium IVPentium III

EE 273 Lecture 1 1/10/01


11

Outline

ï Motivation ñ We need low-power, programmable TeraOps

ï The problem is bandwidthñ Growing gap between special-purpose and general-purpose

hardwareñ Its easy to make ALUs, hard to keep them fed

ï A stream processor gives programmable bandwidthñ Streams expose locality and concurrency in the applicationñ A bandwidth hierarchy exploits this

ï Imagine is a 20GFLOPS prototype stream processorï Many opportunities to do better

ñ Scaling upñ Simplifying programming

12

Motivation

ï Some things Iíd like to do with a few TeraOpsñ Have a realistic face-to-face meeting with someone in Boston

without riding an airplaneï 4-8 cameras, extract depth, fit model, compress, render to several

screensñ High-quality rendering at video rates

ï Ray tracing a 2K x 4K image with 105 objects at 60 frames/s

EE 273 Lecture 1 1/10/01


13

The good news ñFLOPS are cheap, OPS are cheaper

ï 32-bit FPU ñ 2GFLOPS/mm2 ñ 400GFLOPS/chipï 16-bit add ñ 40GOPS/mm2 ñ 8TOPS/chip

460 µm

146.7 µm

Local RF

Integer Adder

14

The bad news ñGeneral purpose processors canít harness this

1e+8

1e+9

1e+10

1e+11

1e+12

1e+13

1e+14

1e+15

2001 2003 2005 2007 2009 2011Year

FLO

PS

FLOPSGP-PeakGP-Useful

EE 273 Lecture 1 1/10/01


15

Why do Special-Purpose Processors Perform Well?

Fed by dedicated wires/memoriesLots (100s) of ALUs

16

Care and Feeding of ALUs

DataBandwidth

Instruction Bandwidth

Regs

Instr.Cache

IR

IP

ëFeedingí Structure Dwarfs ALU

EE 273 Lecture 1 1/10/01


17

Stream Programs make communication explicit

ï This reduces energy and delayï Energy is a matter of distance (interconnect)

1.1nJ 130pJExecute a uP instruction (SB-1)

10pJ 0.6pJ32b Register Read

1.9nJ 1.9nJTransfer 32b off chip (200M HSTL)1.3nJ 400pJTransfer 32b off chip (2.5G CML)

100pJ 17pJTransfer 32b across chip (10mm)50pJ 3pJRead 32b from 8KB RAM

5pJ 0.3pJ32b ALU Operation

Energy(0.13um) (0.05um)

Operation

300: 20: 1 off-chip to global to local ratio in 2002

1300: 56: 1 in 2010

18

Interconnect dominates delay

2800ps 4600psTransfer 32b across chip (20mm)

325ps 125ps32b Register Read

1400ps 2300psTransfer 32b across chip (10mm)780ps 300psRead 32b from 8KB RAM

650ps 250ps32b ALU Operation

Delay(0.13um) (0.05um)

Operation

2:1 global on-chip comm to operation delay9:1 in 2010

EE 273 Lecture 1 1/10/01


19

What is a Stream Program?

ï A program organized as streams of records flowing through kernels

ï Example, stereo depth extraction

20

Stereo Depth Extraction Stream Program

SAD

Image 1 convolve convolve

Image 0 convolve convolve

Depth Map

Kernels exploit both instruction (ILP) and data (SIMD) level parallelism.

Streams expose producer-consumer locality.

Kernels can be partitioned across chips to exploit task parallelism.

The stream model exploits parallelism without the complexity of traditional parallel programming.

EE 273 Lecture 1 1/10/01


21

Why Organize an Application This Way?

ï Expose parallelism at three levelsñ ILP within kernelsñ DLP across stream elementsñ TLP across sub-streams and across kernelsñ Keeps ëeasyí parallelism easy

ï Expose locality in two waysñ Within a kernel ñ kernel localityñ Between kernels ñ producer-consumer localityñ This locality can be exploited independent of spatial or temporal

localityñ Put another way, stream programs make communication explicit

22

Streams expose Kernel Locality missed by Vectors

ï Streamsñ Traverse operations first

ï All operations for one record, then next record

ï Smaller working set of temporary values

ñ Store and access whole records as a unit

ï Spatial locality of memory references

ñ e.g., get contiguous record on gather/scatter

ï Vectorsñ Traverse records first

ï All records for one operation, then next operation

ï Large set of temporary valuesñ Group like-elements of records

into vectorsï Read one word of each record at

a timeñ No locality on gather/scatter

EE 273 Lecture 1 1/10/01


23

Example ñ Vertex Transform

x

y

z

w

t00x

t10x

t20x

t30x

t01y

t11y

t21y

t31y

t02z

t12z

t22z

t32z

t03w

t13w

t23w

t33w

xí

yí

zí

wí

×=

wzyx

T

wzyx

''''

input record intermediate results

result record

24

Vertex transform with Streams vs. Vectors

ï Small set of intermediate resultsñ enable small and fast LRFs

ï Large working set of intermediatesñ VL times larger (e.g. 64x)ñ Must use a large, slow global

VRF

EE 273 Lecture 1 1/10/01


25

What class of applications can be written as stream programs?

ï Media applications (signal, image, video, packet, and graphics processing) are naturally expressed in this style

ï Scientific applications can be efficiently cast as stream programs

ï Others?ñ This is an open question

ï Hypothesisñ Any application with a long run time (large operation count) has a

great deal of parallelism and hence can be cast as a stream program.

26

What is a Stream Processor?

ï A processor that is optimized to execute a stream program

ï Features includeñ Exploit parallelism

ï TLP with multiple processorsï DLP with multiple clusters within each processorï ILP with multiple ALUs within each cluster

ñ Exploit locality with a bandwidth hierarchyï Kernel locality within each clusterï Producer-consumer locality within each processor

ï Many different possible architectures

EE 273 Lecture 1 1/10/01


27

The Imagine Stream Processor

Stream Register File NetworkInterface

StreamController

Imagine Stream Processor

HostProcessor

Net

wor

k

AL

U C

lust

er 0

AL

U C

lust

er 1

AL

U C

lust

er 2

AL

U C

lust

er 3

AL

U C

lust

er 4

AL

U C

lust

er 5

AL

U C

lust

er 6

AL

U C

lust

er 7

SDRAMSDRAM SDRAMSDRAM

Streaming Memory System

Mic

roco

ntro

ller

28

Arithmetic Clusters

CU

Inte

rclu

ster

Net

wor

k

+

From SRF

To SRF

+ + * * /

Cross Point

Local Register File

EE 273 Lecture 1 1/10/01


29

A Bandwidth Hierarchy exploits locality and concurrency

ï VLIW clusters with shared controlï 41.2 32-bit floating-point operations per word of memory BW

2GB/s 32GB/s

SDRAM

SDRAM

SDRAM

SDRAMSt

ream

R

egis

ter F

ile

ALU Cluster

ALU Cluster

ALU Cluster

544GB/s

30

A Bandwidth Hierarchy exploits kernel and producer-consumer locality

Memory BW Global RF BW Local RF BWDepth Extractor 0.80 GB/s 18.45 GB/s 210.85 GB/s

MPEG Encoder 0.47 GB/s 2.46 GB/s 121.05 GB/s

Polygon Rendering 0.78 GB/s 4.06 GB/s 102.46 GB/s

QR Decomposition 0.46 GB/s 3.67 GB/s 234.57 GB/s

2GB/s 32GB/s

SDRAM

SDRAM

SDRAM

SDRAM

Stre

am

Reg

iste

r File

ALU ClusterALU Cluster

ALU Cluster

544GB/s

EE 273 Lecture 1 1/10/01


31

Stream Controller

ï 32 stream instructions that can be written to the stream controller by host processor

ï Acts as a scoreboardñ Keeps track of dependencies using a bit field for each stream

instruction

32

Pop Quiz

ï Can you have different data types in a stream?

ï How is memory access scheduling done?ñ Do streams need to be contiguous in memory?

ï In the convolve example, how is shared data of the partials managed?

EE 273 Lecture 1 1/10/01


33

Producer-Consumer Locality in the Depth Extractor

Memory/Global Data SRF/Streams Clusters/Kernels

row of pixels

previous partial sums

new partial sums

blurred row


new partial sums

sharpened row

filtered row segment

filtered row segment


new partial sumsdepth map row segment

Convolution(Gaussian)

Convolution(Laplacian)

SAD

1 : 23 : 317

34

Imagine gives high performance with low power and flexible programming

ï Matches capabilities of communication-limited technology to demands of signal and image processing applications

ï Performanceñ compound stream operations realize >10GOPS on key applicationsñ can be extended by partitioning an application across several Imagines

(TFLOPS on a circuit board)ï Power

ñ three-level register hierarchy gives 2-10GOPS/Wï Flexibility

ñ programmed in ìCîñ streaming modelñ conditional stream operations enable applications like sort

EE 273 Lecture 1 1/10/01


35

Bandwidth Demand of Applications

36

Die Photos

ï 21 M transistors / TI 0.15µm 1.5V CMOS / 16mm x 16mmï 300 MHz TTTT, hope for 400 MHz in labï Chips arrived 4/1/02, no fooling!

[Khailany et al., ICCD �02 (submitted)]

EE 273 Lecture 1 1/10/01


37

Performance demonstrated on signal and image processing

12.1

19.8

11.0

23.925.6

7.0

0

5

10

15

20

25

30G

OPS

depth mpeg qrd dct convolve fft

16-bit kernels16-bit

applications

floating-pointapplication

floating-pointkernel

38

Architecture of a Streaming Supercomputer

StreamProcessor64 FPUs

64GFLOPS

16 xDRDRAM2GBytes

38GBytes/s

20GBytes/s32+32 pairs

Node

On-Board Network

Node2

Node16 Board 2

16 Nodes1K FPUs1TFLOPS32GBytes

Intra-Cabinet Network(passive - wires only)

Board 64

160GBytes/s256+256 pairs

10.5" Teradyne GbX

Board

Cabinet

Inter-Cabinet Network

Cabinet 264 Boards1K Nodes64K FPUs64TFLOPS

2TBytes

E/OO/E

5TBytes/s8K+8K links

Ribbon Fiber

Cabinet 16

Bisection 64TBytes/s

All links 5Gb/s per pair or fiberAll bandwidths are full duplex

EE 273 Lecture 1 1/10/01


39

Streaming processor

Stream Execution Unit

Stream Register File

Scal

arPr

oces

sor

AddressGenerators

MemoryControl

On-

Chi

pM

emor

y

LocalDRDRAM(38 GB/s)

Net

wor

kIn

terfa

ce

NetworkChannels(20 GB/s)

Clu

ster

0

Clu

ster

1

Clu

ster

15

CommFP

MUL

FPMUL

FPADD

FPADD

FPDSQ

Regs

Regs

Regs

Regs

Clu

ster

Sw

itch

Regs

Regs

Regs

Regs

Regs

Regs

Regs

Regs

Scra

tch

Pad

Regs

Stream Execution Unit

Cluster

Stream Processor

To/From SRF(260 GB/s)

LRF BW = 1.5TB/s

40

Rough per-node budget

200200Processor chip

4$/M-GUPS (250/node)

15$/GFLOPS (64/node)

976Per-Node Cost

501Power

4950000Cabinet

1883000Board/Backplane

32020Memory chip

50200Router chip

Per NodeCostItem

Preliminary numbers, parts cost only, no I/O included.

EE 273 Lecture 1 1/10/01


41

Many open problems

ï A small samplingï Software

ñ Program transformationñ Program mappingñ Bandwidth optimizationñ Conditionalsñ Irregular data structures

ï Hardwareñ Alternative stream modelsñ Register organizationñ Bandwidth hierarchiesñ Memory organizationñ Short stream issuesñ ISA designñ Cluster organizationñ Processor organization

42

Example, Ingress Packet Functions

Finalize RouteClassify Pkt

Route Table1st Address

Calc

Rd

Route Table2nd Address

Calc

Rd

Route Table3rd Address

Calc

Rd

PolicingCalculation

ComputeStatisticsUpdates

F&A F&ACompute

Packet QueueAddress

Wr

Extract HeaderFields

Packet Checks

EE 273 Lecture 1 1/10/01


43

Split into Kernels

Finalize RouteClassify Pkt

Route Table1st Address

Calc

Rd

Route Table2nd Address

Calc

Rd

Route Table3rd Address

Calc

Rd

PolicingCalculation

ComputeStatisticsUpdates

F&A F&ACompute

Packet QueueAddress

Wr

Extract HeaderFields

Packet Checks

44

Software Pipeline The Loop

i-6 i-4 i-2 iArithmeticClusters

MemoryChannel 1

MemoryChannel 2

i-1i-3i-5

i-7

i-7

Each block represents a kernel ormemory operation on a strip of packets

Stream Processors - University of California, San Diego¦ DLP across stream elements æ TLP across...

Documents

Transcript of Stream Processors - University of California, San Diego¦ DLP across stream elements æ TLP across...