EE482S Lecture 1 Stream Processor...

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 1

EE482SLecture 1

Stream Processor Architecture

April 4, 2002

William J. DallyComputer Systems Laboratory

Stanford Universitybilld@csl.stanford.edu

Today’s Class Meeting

• What is EE482C?– Material covered– Course format– Assignments– Scribing

• What is Stream Processor Architecture?– What problem is being solved

• Performance scaling, power efficiency, bandwidth bottlenecks– What is a stream program?– What is a stream processor

What is EE482C?

• New course in EE482 sequence (advanced comp. arch.)– EE482A – superscalar architecture– EE482B – interconnection networks– EE482C – stream processor architecture

• Course format– Readings – typically one or two papers per class meeting

• Read the paper before the meeting for which it is listed• E.g., read Rixner et al. before 4/9/02

– Mix of lecture and discussion in class meetings• Be prepared to discuss each reading

– Two programming assignments• One in StreamC/KernelC, one in Brook

– Class project • Original research on stream architecture or programming

Where to find more information

• Course policy sheet• Web page http://cva.stanford.edu/ee482s• Class schedule

– Subject to change

What is a Stream Processor?

• A stream program is a computation organized as streams of records operated on by computation kernels.

• A stream processor is optimized to exploit the locality and concurrency of stream programs

• More later

Stream Processing is becoming pervasive

InsightsPeter Huber, 01.07.02A new type of computing--"stream computing"--has emerged to handle the back end of sonar, radar, X-ray sources and certain broadband applications such as voice-over Internet and digital TV.

Motivation – Why do we need a stream processor?

• Application demand• Power• Bandwidth• Performance Scaling

Application Pull

• Emerging media applications demand 10s of GOPS to TOPs with low power.– Video compression– Image understanding– Signal processing– Graphics

• Scientific applications also need TOPs of performance with reasonable cost

Conventional Processors No Longer Scale Performance by 50% each year

1980 1990 2000 2010 2020

Perf (ps/Inst)52%/year

19%/year

ps/gate 19%Gates/clock 9%

Clocks/inst 18%

Future potential of novel architecture is large (1000 vs 30)

1e-41e-31e-21e-11e+01e+11e+21e+31e+41e+51e+61e+7

1980 1990 2000 2010 2020

Perf (ps/Inst)Linear (ps/Inst)

52%/year

74%/year

19%/year30:1

1,000:1

30,000:1

Clock Scaling: Historical and Projected

8FO416FO4Intel Processors

375 FO4

70 FO4

85 FO4115 FO4

230 FO4185 FO4

175 FO4

53 FO4

40 FO446 FO4

25 FO4

15 FO4

10µm 3µm 1µm1.5µm 0.8µm 0.35µm 0.18µm0.13µm 0.07µm

0.1µm 0.05µm0.035µm

Anant Agarwal (MIT) Panel at HPCA ‘02

Pentium III vs. Pentium IV

2x108108# Grids

12KBytes16KBytesL1 D$ Capacity

42 million24 millionTransistor Count

217mm2106mm2Die Size

0.350.45SpecInt/MHz

524454SpecInt2000

1.5GHz (10.4 FO4)1GHz (15 FO4)Clock Rate

2010Pipeline Stages

180nm180nmTechnology

Pentium IVPentium III

Why do Special-Purpose Processors Perform Well?

Lots (100s) of ALUs Fed by dedicated wires/memories

Care and Feeding of ALUs

Instr.CacheIP

Instruction Bandwidth

DataBandwidth

‘Feeding’ Structure Dwarfs ALU

Stream Programs make communication explicit

• This reduces energy and delay

Energy is a matter of distance (interconnect)

1.1nJ 130pJExecute a uP instruction (SB-1)

10pJ 0.6pJ32b Register Read

1.9nJ 1.9nJTransfer 32b off chip (200M HSTL)1.3nJ 400pJTransfer 32b off chip (2.5G CML)

100pJ 17pJTransfer 32b across chip (10mm)50pJ 3pJRead 32b from 8KB RAM

5pJ 0.3pJ32b ALU Operation

Energy(0.13um) (0.05um)

Operation

300: 20:1 off-chip to global to local ratio in 20021300: 56:1 in 2010

Interconnect dominates delay

2800ps 4600psTransfer 32b across chip (20mm)

325ps 125ps32b Register Read

1400ps 2300psTransfer 32b across chip (10mm)780ps 300psRead 32b from 8KB RAM

650ps 250ps32b ALU Operation

Delay(0.13um) (0.05um)

Operation

2:1 global on-chip comm to operation delay9:1 in 2010

What is a Stream Program?

• A program organized as streams of records flowing through kernels

• Example, stereo depth extraction

Stereo Depth Extraction Stream Program

Kernels can be partitioned across chips to exploit task parallelism.

Kernels exploit both instruction (ILP) and data (SIMD) level parallelism.

convolve

Streams expose producer-consumer locality.

Image 1 convolve

Image 0 convolve convolve

Depth Map

The stream model exploits parallelism without the complexity of traditional parallel programming.

Why Organize an Application This Way?

• Expose parallelism at three levels– ILP within kernels– DLP across stream elements– TLP across sub-streams and across kernels– Keeps ‘easy’ parallelism easy

• Expose locality in two ways– Within a kernel – kernel locality– Between kernels – producer-consumer locality– This locality can be exploited independent of spatial or temporal

locality– Put another way, stream programs make communication explicit

Streams expose Kernel Locality missed by Vectors

• Streams– Traverse operations first

• All operations for one record, then next record

• Smaller working set of temporary values

– Store and access whole records as a unit

• Spatial locality of memory references

– e.g., get contiguous record on gather/scatter

• Vectors– Traverse records first

• All records for one operation, then next operation

• Large set of temporary values– Group like-elements of records

into vectors• Read one word of each record at

a time– No locality on gather/scatter

Example – Vertex Transform

input record

result recordintermediate results

Vertex transform with Streams vs. Vectors

• Small set of intermediate results– enable small and fast LRFs

• Large working set of intermediates– VL times larger (e.g. 64x)– Must use a large, slow global

What class of applications can be written as stream programs?

• Media applications (signal, image, video, packet, and graphics processing) are naturally expressed in this style

• Scientific applications can be efficiently cast as stream programs

• Others?– This is an open question

• Hypothesis– Any application with a long run time (large operation count) has a

great deal of parallelism and hence can be cast as a stream program.

Example, Ingress Packet Functions

Finalize RouteClassify Pkt

Route Table1st Address

Route Table2nd Address

Route Table3rd Address

PolicingCalculation

ComputeStatisticsUpdates

F&A F&ACompute

Packet QueueAddress

Extract HeaderFields

Packet Checks

Split into Kernels

Finalize RouteClassify Pkt

Route Table1st Address

Route Table2nd Address

Route Table3rd Address

PolicingCalculation

ComputeStatisticsUpdates

F&A F&ACompute

Packet QueueAddress

Extract HeaderFields

Packet Checks

Software Pipeline The Loop

i-6 i-4 i-2 iArithmeticClusters

MemoryChannel 1

MemoryChannel 2

i-1i-3i-5

Each block represents a kernel ormemory operation on a strip of packets

Is it hard to write stream programs?

• We will let you be the judge of that.• It does constrain how you write a program• Depends on the quality of the programming tools.

What is a Stream Processor?

• A processor that is optimized to execute a stream program

• Features include– Exploit parallelism

• TLP with multiple processors• DLP with multiple clusters within each processor• ILP with multiple ALUs within each cluster

– Exploit locality with a bandwidth hierarchy• Kernel locality within each cluster• Producer-consumer locality within each processor

• Many different possible architectures

The Imagine Stream Processor

Stream Register File NetworkInterface

StreamController

Imagine Stream Processor

HostProcessorA

SDRAMSDRAM SDRAMSDRAM

Streaming Memory SystemM

Arithmetic Clusters

From SRF

To SRF

+ + * * /

Cross Point

Local Register File

A Bandwidth Hierarchy exploits locality and concurrency

2GB/s 32GB/s

r File

ALU Cluster

544GB/s

• VLIW clusters with shared control• 41.2 32-bit floating-point operations per word of memory BW

A Bandwidth Hierarchy exploits kernel and producer-consumer locality

Memory BW Global RF BW Local RF BWDepth Extractor 0.80 GB/s 18.45 GB/s 210.85 GB/s

MPEG Encoder 0.47 GB/s 2.46 GB/s 121.05 GB/s

Polygon Rendering 0.78 GB/s 4.06 GB/s 102.46 GB/s

QR Decomposition 0.46 GB/s 3.67 GB/s 234.57 GB/s

2GB/s 32GB/s

SDRAMSt

ALU ClusterALU Cluster

ALU Cluster

544GB/s

Producer-Consumer Locality in the Depth Extractor

Memory/Global Data SRF/Streams Clusters/Kernels

new partial sums

sharpened row

Convolution(Laplacian)

row of pixels

previous partial sumsConvolution(Gaussian)

new partial sums

blurred row

previous partial sums

filtered row segment

previous partial sums

new partial sumsdepth map row segment

1 : 23 : 317

Bandwidth Demand of Applications

Overview

Die Photos

• 21 M transistors / TI 0.15µm 1.5V CMOS / 16mm x 16mm• 300 MHz TTTT, hope for 400 MHz in lab• Chips arrived 4/1/02, no fooling!

[Khailany et al., ICCD ’02 (submitted)]

Performance demonstrated on signal and image processing

23.925.6

depth mpeg qrd dct convolve fft

16-bit kernels16-bit

applications

floating-pointapplication

floating-pointkernel

Initial studies indicate that it also applies to solving PDEs and ODEs

Architecture of a Streaming Supercomputer

StreamProcessor64 FPUs

64GFLOPS

16 xDRDRAM2GBytes

38GBytes/s

20GBytes/s32+32 pairs

On-Board Network

Node16 Board 2

16 Nodes1K FPUs1TFLOPS32GBytes

Intra-Cabinet Network(passive - wires only)

Board 64

160GBytes/s256+256 pairs

10.5" Teradyne GbX

Cabinet

Inter-Cabinet Network

Cabinet 264 Boards1K Nodes64K FPUs64TFLOPS

2TBytes

E/OO/E

5TBytes/s8K+8K links

Ribbon Fiber

Cabinet 16

Bisection 64TBytes/s

All links 5Gb/s per pair or fiberAll bandwidths are full duplex

Streaming processor

Stream Execution Unit

Stream Register File

AddressGenerators

MemoryControl

LocalDRDRAM(38 GB/s)

NetworkChannels(20 GB/s)

CommFP

Stream Execution Unit

Cluster

Stream Processor

To/From SRF(260 GB/s)

LRF BW = 1.5TB/s

Rough per-node budget

200200Processor chip

4$/M-GUPS (250/node)

15$/GFLOPS (64/node)

976Per-Node Cost

501Power

4950000Cabinet

1883000Board/Backplane

32020Memory chip

50200Router chip

Per NodeCostItem

Preliminary numbers, parts cost only, no I/O included.

Many open problems

• A small sampling• Software

– Program transformation– Program mapping– Bandwidth optimization– Conditionals– Irregular data structures

• Hardware– Alternative stream models– Register organization– Bandwidth hierarchies– Memory organization– Short stream issues– ISA design– Cluster organization– Processor organization

Next Time

• Discuss Imagine paper• Discuss the stream programming model

EE482S Lecture 1 Stream Processor...

Documents

Transcript of EE482S Lecture 1 Stream Processor...

(Web Application Design track) "Liquid Stream Processing across Web Browsers and Web Servers" - Masiar Babazadeh, Andrea Gallidabino and Cesare Pautasso

LIVE STREAM · the level of their engagement. Our live-stream events bring teams from across the globe together for human networking centered around technology, delivering results

Stream Processors - University of California, San Diego¦ DLP across stream elements æ TLP across sub-streams and across kernels æ Keeps ºeasyí parallelism easy ï Expose locality

Imagine Programming System User’s Guidecva.stanford.edu/classes/ee482s/docs/ips_user.pdf · 2.3.1 Shared header file format (*_kc.hpp) ... configure Microsoft Visual C++ for use

Implementation of a Care Stream Model - GTA Rehab Network · Implementation of a Care Stream Model Supporting an Integrated Regional Rehabilitation System Across the Continuum of

Distributed Data Stream Processing and Edge Computing: A ... · Azure, Google, and others. 2.2. Data Streams and Models The deﬁnition of a data stream can vary across do-mains,

Value Stream Mapping · Seeing The Whole (mapping the extended value stream) Process Level Creating Continuous Flow Single Plant Learning to See Multiple Plants Across the Company

The Stream Biome Gradient Concept: factors … et al fresh sci 2015.pdfThe Stream Biome Gradient Concept: factors controlling lotic systems across broad ... Kansas State University,

Making Decisions in Complex Landscapes: Headwater · PDF fileMaking Decisions in Complex Landscapes: Headwater Stream Management Across ... landscape may significantly impede the ...

Hydrologic variation with land use across the contiguous ......Hydrologic variation with land use across the contiguous United States: Geomorphic and ecological consequences for stream

Enhanced mixing across the gyre boundary at the Gulf ... · EARTH, ATMOSPHERIC, AND PLANETARY SCIENCES Enhanced mixing across the gyre boundary at the Gulf Stream front Jacob O. Wenegrata,1,

Similarity of stream width distributions across headwater ...people.tamu.edu/~geoallen/publications/Allen_et_al_2018_NatComm... · in hyporheic zone transmissivity, a fundamental

EE482S Lecture 2 Discussion of 2 Papers …cva.stanford.edu/classes/ee482s/slides/lect03_slides.pdfTitle Equalized 4Gb/s Signalling Author William J. Dally Created Date 4/16/2002 5:56:52

Dokument2 - Masarykova univerzita · A small stream ran at the end of the street. There was an overgrown garden across the stream. She could hear the sound of the tractor and suddenly

Stream Order Stream Order Characteristics First Order Very small Cold, clear, clean Can jump across them Start from springs, groundwater Forested, rocky,

United States Fish and Wildlife Service...Figure 9. Dead tree leaning across stream is likely to fall in the near future and may be cut. Flgure 10. Cut tree repositioned along stream

Stream Bank Stabilization/ Stream Restoration

Building Large-Scale Stream Infrastructures Across Multiple Data Centers with Apache Kafka

EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Byte stream and character stream