Liquid Stream Processing Across Web Browsers and Web Servers
Stream Processors - University of California, San Diego¦ DLP across stream elements æ TLP across...
Transcript of Stream Processors - University of California, San Diego¦ DLP across stream elements æ TLP across...
EE 273 Lecture 1 1/10/01
Copyright 1998,2000,2001 by W. J. Dally, All rights reserved 1
1
Stream Processors
William J. DallyComputer Systems Laboratory
Stanford [email protected]
2
Stream Processors
ï What is Stream Processor Architecture?ñ What problem is being solved
ï Performance scaling, power efficiency, bandwidth bottlenecksñ What is a stream program?ñ What is a stream processor
EE 273 Lecture 1 1/10/01
Copyright 1998,2000,2001 by W. J. Dally, All rights reserved 2
3
What is a Stream Processor?
ï A stream program is a computation organized as streams of records operated on by computation kernels.
ï A stream processor is optimized to exploit the locality and concurrency of stream programs
4
Stream Processing is becoming pervasive
InsightsPeter Huber, 01.07.02A new type of computing--"stream computing"--has emerged to handle the back end of sonar, radar, X-ray sources and certain broadband applications such as voice-over Internet and digital TV.
EE 273 Lecture 1 1/10/01
Copyright 1998,2000,2001 by W. J. Dally, All rights reserved 3
5
Motivation ñ Why do we need a stream processor?
ï Application demandñ Powerñ Bandwidthñ Performance Scaling
ï Emerging media applications demand 10s of GOPS toTOPs with low power.ñ Video compressionñ Image understandingñ Signal processingñ Graphics
ï Scientific applications also need TOPs of performance with reasonable cost
6
Conventional Processors No Longer Scale Performance by 50% each year
1e+0
1e+1
1e+2
1e+3
1e+4
1e+5
1e+6
1e+7
1980 1990 2000 2010 2020
Perf (ps/Inst)52%/year
19%/year
ps/gate 19%Gates/clock 9%
Clocks/inst 18%
EE 273 Lecture 1 1/10/01
Copyright 1998,2000,2001 by W. J. Dally, All rights reserved 4
7
Future potential of novel architecture is large (1000 vs 30)
1e-41e-31e-21e-11e+01e+11e+21e+31e+41e+51e+61e+7
1980 1990 2000 2010 2020
Perf (ps/Inst)Linear (ps/Inst)
52%/year
74%/year
19%/year30:1
1,000:1
30,000:1
8
Clock Scaling: Historical and Projected
0
1
10
100
1000
10000
1970
1972
1974
1976
1978
1980
1982
1984
1986
1988
1990
1992
1994
1996
1998
2000
2002
2004
2006
2008
2010
2012
2014
Clo
ck S
peed
(MH
z)
8FO416FO4Intel Processors
375 FO4
70 FO4
85 FO4115 FO4
230 FO4185 FO4
175 FO4
53 FO4
40 FO446 FO4
25 FO4
9 FO4
15 FO4
10µm 3µm 1µm1.5µm 0.8µm 0.35µm 0.18µm0.13µm 0.07µm
0.1µm 0.05µm0.035µm
EE 273 Lecture 1 1/10/01
Copyright 1998,2000,2001 by W. J. Dally, All rights reserved 5
9
Anant Agarwal (MIT) Panel at HPCA ë02
10
Pentium III vs. Pentium IV
2x108108# Grids
12KBytes16KBytesL1 D$ Capacity
42 million24 millionTransistor Count
217mm2106mm2Die Size
0.350.45SpecInt/MHz
524454SpecInt2000
1.5GHz (10.4 FO4)1GHz (15 FO4)Clock Rate
2010Pipeline Stages
180nm180nmTechnology
Pentium IVPentium III
EE 273 Lecture 1 1/10/01
Copyright 1998,2000,2001 by W. J. Dally, All rights reserved 6
11
Outline
ï Motivation ñ We need low-power, programmable TeraOps
ï The problem is bandwidthñ Growing gap between special-purpose and general-purpose
hardwareñ Its easy to make ALUs, hard to keep them fed
ï A stream processor gives programmable bandwidthñ Streams expose locality and concurrency in the applicationñ A bandwidth hierarchy exploits this
ï Imagine is a 20GFLOPS prototype stream processorï Many opportunities to do better
ñ Scaling upñ Simplifying programming
12
Motivation
ï Some things Iíd like to do with a few TeraOpsñ Have a realistic face-to-face meeting with someone in Boston
without riding an airplaneï 4-8 cameras, extract depth, fit model, compress, render to several
screensñ High-quality rendering at video rates
ï Ray tracing a 2K x 4K image with 105 objects at 60 frames/s
EE 273 Lecture 1 1/10/01
Copyright 1998,2000,2001 by W. J. Dally, All rights reserved 7
13
The good news ñFLOPS are cheap, OPS are cheaper
ï 32-bit FPU ñ 2GFLOPS/mm2 ñ 400GFLOPS/chipï 16-bit add ñ 40GOPS/mm2 ñ 8TOPS/chip
460 µm
146.7 µm
Local RF
Integer Adder
14
The bad news ñGeneral purpose processors canít harness this
1e+8
1e+9
1e+10
1e+11
1e+12
1e+13
1e+14
1e+15
2001 2003 2005 2007 2009 2011Year
FLO
PS
FLOPSGP-PeakGP-Useful
EE 273 Lecture 1 1/10/01
Copyright 1998,2000,2001 by W. J. Dally, All rights reserved 8
15
Why do Special-Purpose Processors Perform Well?
Fed by dedicated wires/memoriesLots (100s) of ALUs
16
Care and Feeding of ALUs
DataBandwidth
Instruction Bandwidth
Regs
Instr.Cache
IR
IP
ëFeedingí Structure Dwarfs ALU
EE 273 Lecture 1 1/10/01
Copyright 1998,2000,2001 by W. J. Dally, All rights reserved 9
17
Stream Programs make communication explicit
ï This reduces energy and delayï Energy is a matter of distance (interconnect)
1.1nJ 130pJExecute a uP instruction (SB-1)
10pJ 0.6pJ32b Register Read
1.9nJ 1.9nJTransfer 32b off chip (200M HSTL)1.3nJ 400pJTransfer 32b off chip (2.5G CML)
100pJ 17pJTransfer 32b across chip (10mm)50pJ 3pJRead 32b from 8KB RAM
5pJ 0.3pJ32b ALU Operation
Energy(0.13um) (0.05um)
Operation
300: 20: 1 off-chip to global to local ratio in 2002
1300: 56: 1 in 2010
18
Interconnect dominates delay
2800ps 4600psTransfer 32b across chip (20mm)
325ps 125ps32b Register Read
1400ps 2300psTransfer 32b across chip (10mm)780ps 300psRead 32b from 8KB RAM
650ps 250ps32b ALU Operation
Delay(0.13um) (0.05um)
Operation
2:1 global on-chip comm to operation delay9:1 in 2010
EE 273 Lecture 1 1/10/01
Copyright 1998,2000,2001 by W. J. Dally, All rights reserved 10
19
What is a Stream Program?
ï A program organized as streams of records flowing through kernels
ï Example, stereo depth extraction
20
Stereo Depth Extraction Stream Program
SAD
Image 1 convolve convolve
Image 0 convolve convolve
Depth Map
Kernels exploit both instruction (ILP) and data (SIMD) level parallelism.
Streams expose producer-consumer locality.
Kernels can be partitioned across chips to exploit task parallelism.
The stream model exploits parallelism without the complexity of traditional parallel programming.
EE 273 Lecture 1 1/10/01
Copyright 1998,2000,2001 by W. J. Dally, All rights reserved 11
21
Why Organize an Application This Way?
ï Expose parallelism at three levelsñ ILP within kernelsñ DLP across stream elementsñ TLP across sub-streams and across kernelsñ Keeps ëeasyí parallelism easy
ï Expose locality in two waysñ Within a kernel ñ kernel localityñ Between kernels ñ producer-consumer localityñ This locality can be exploited independent of spatial or temporal
localityñ Put another way, stream programs make communication explicit
22
Streams expose Kernel Locality missed by Vectors
ï Streamsñ Traverse operations first
ï All operations for one record, then next record
ï Smaller working set of temporary values
ñ Store and access whole records as a unit
ï Spatial locality of memory references
ñ e.g., get contiguous record on gather/scatter
ï Vectorsñ Traverse records first
ï All records for one operation, then next operation
ï Large set of temporary valuesñ Group like-elements of records
into vectorsï Read one word of each record at
a timeñ No locality on gather/scatter
EE 273 Lecture 1 1/10/01
Copyright 1998,2000,2001 by W. J. Dally, All rights reserved 12
23
Example ñ Vertex Transform
x
y
z
w
t00x
t10x
t20x
t30x
t01y
t11y
t21y
t31y
t02z
t12z
t22z
t32z
t03w
t13w
t23w
t33w
xí
yí
zí
wí
×=
wzyx
T
wzyx
''''
input record intermediate results
result record
24
Vertex transform with Streams vs. Vectors
ï Small set of intermediate resultsñ enable small and fast LRFs
ï Large working set of intermediatesñ VL times larger (e.g. 64x)ñ Must use a large, slow global
VRF
EE 273 Lecture 1 1/10/01
Copyright 1998,2000,2001 by W. J. Dally, All rights reserved 13
25
What class of applications can be written as stream programs?
ï Media applications (signal, image, video, packet, and graphics processing) are naturally expressed in this style
ï Scientific applications can be efficiently cast as stream programs
ï Others?ñ This is an open question
ï Hypothesisñ Any application with a long run time (large operation count) has a
great deal of parallelism and hence can be cast as a stream program.
26
What is a Stream Processor?
ï A processor that is optimized to execute a stream program
ï Features includeñ Exploit parallelism
ï TLP with multiple processorsï DLP with multiple clusters within each processorï ILP with multiple ALUs within each cluster
ñ Exploit locality with a bandwidth hierarchyï Kernel locality within each clusterï Producer-consumer locality within each processor
ï Many different possible architectures
EE 273 Lecture 1 1/10/01
Copyright 1998,2000,2001 by W. J. Dally, All rights reserved 14
27
The Imagine Stream Processor
Stream Register File NetworkInterface
StreamController
Imagine Stream Processor
HostProcessor
Net
wor
k
AL
U C
lust
er 0
AL
U C
lust
er 1
AL
U C
lust
er 2
AL
U C
lust
er 3
AL
U C
lust
er 4
AL
U C
lust
er 5
AL
U C
lust
er 6
AL
U C
lust
er 7
SDRAMSDRAM SDRAMSDRAM
Streaming Memory System
Mic
roco
ntro
ller
28
Arithmetic Clusters
CU
Inte
rclu
ster
Net
wor
k
+
From SRF
To SRF
+ + * * /
Cross Point
Local Register File
EE 273 Lecture 1 1/10/01
Copyright 1998,2000,2001 by W. J. Dally, All rights reserved 15
29
A Bandwidth Hierarchy exploits locality and concurrency
ï VLIW clusters with shared controlï 41.2 32-bit floating-point operations per word of memory BW
2GB/s 32GB/s
SDRAM
SDRAM
SDRAM
SDRAMSt
ream
R
egis
ter F
ile
ALU Cluster
ALU Cluster
ALU Cluster
544GB/s
30
A Bandwidth Hierarchy exploits kernel and producer-consumer locality
Memory BW Global RF BW Local RF BWDepth Extractor 0.80 GB/s 18.45 GB/s 210.85 GB/s
MPEG Encoder 0.47 GB/s 2.46 GB/s 121.05 GB/s
Polygon Rendering 0.78 GB/s 4.06 GB/s 102.46 GB/s
QR Decomposition 0.46 GB/s 3.67 GB/s 234.57 GB/s
2GB/s 32GB/s
SDRAM
SDRAM
SDRAM
SDRAM
Stre
am
Reg
iste
r File
ALU ClusterALU Cluster
ALU Cluster
544GB/s
EE 273 Lecture 1 1/10/01
Copyright 1998,2000,2001 by W. J. Dally, All rights reserved 16
31
Stream Controller
ï 32 stream instructions that can be written to the stream controller by host processor
ï Acts as a scoreboardñ Keeps track of dependencies using a bit field for each stream
instruction
32
Pop Quiz
ï Can you have different data types in a stream?
ï How is memory access scheduling done?ñ Do streams need to be contiguous in memory?
ï In the convolve example, how is shared data of the partials managed?
EE 273 Lecture 1 1/10/01
Copyright 1998,2000,2001 by W. J. Dally, All rights reserved 17
33
Producer-Consumer Locality in the Depth Extractor
Memory/Global Data SRF/Streams Clusters/Kernels
row of pixels
previous partial sums
new partial sums
blurred row
previous partial sums
new partial sums
sharpened row
filtered row segment
filtered row segment
previous partial sums
new partial sumsdepth map row segment
Convolution(Gaussian)
Convolution(Laplacian)
SAD
1 : 23 : 317
34
Imagine gives high performance with low power and flexible programming
ï Matches capabilities of communication-limited technology to demands of signal and image processing applications
ï Performanceñ compound stream operations realize >10GOPS on key applicationsñ can be extended by partitioning an application across several Imagines
(TFLOPS on a circuit board)ï Power
ñ three-level register hierarchy gives 2-10GOPS/Wï Flexibility
ñ programmed in ìCîñ streaming modelñ conditional stream operations enable applications like sort
EE 273 Lecture 1 1/10/01
Copyright 1998,2000,2001 by W. J. Dally, All rights reserved 18
35
Bandwidth Demand of Applications
36
Die Photos
ï 21 M transistors / TI 0.15µm 1.5V CMOS / 16mm x 16mmï 300 MHz TTTT, hope for 400 MHz in labï Chips arrived 4/1/02, no fooling!
[Khailany et al., ICCD �02 (submitted)]
EE 273 Lecture 1 1/10/01
Copyright 1998,2000,2001 by W. J. Dally, All rights reserved 19
37
Performance demonstrated on signal and image processing
12.1
19.8
11.0
23.925.6
7.0
0
5
10
15
20
25
30G
OPS
depth mpeg qrd dct convolve fft
16-bit kernels16-bit
applications
floating-pointapplication
floating-pointkernel
38
Architecture of a Streaming Supercomputer
StreamProcessor64 FPUs
64GFLOPS
16 xDRDRAM2GBytes
38GBytes/s
20GBytes/s32+32 pairs
Node
On-Board Network
Node2
Node16 Board 2
16 Nodes1K FPUs1TFLOPS32GBytes
Intra-Cabinet Network(passive - wires only)
Board 64
160GBytes/s256+256 pairs
10.5" Teradyne GbX
Board
Cabinet
Inter-Cabinet Network
Cabinet 264 Boards1K Nodes64K FPUs64TFLOPS
2TBytes
E/OO/E
5TBytes/s8K+8K links
Ribbon Fiber
Cabinet 16
Bisection 64TBytes/s
All links 5Gb/s per pair or fiberAll bandwidths are full duplex
EE 273 Lecture 1 1/10/01
Copyright 1998,2000,2001 by W. J. Dally, All rights reserved 20
39
Streaming processor
Stream Execution Unit
Stream Register File
Scal
arPr
oces
sor
AddressGenerators
MemoryControl
On-
Chi
pM
emor
y
LocalDRDRAM(38 GB/s)
Net
wor
kIn
terfa
ce
NetworkChannels(20 GB/s)
Clu
ster
0
Clu
ster
1
Clu
ster
15
CommFP
MUL
FPMUL
FPADD
FPADD
FPDSQ
Regs
Regs
Regs
Regs
Clu
ster
Sw
itch
Regs
Regs
Regs
Regs
Regs
Regs
Regs
Regs
Scra
tch
Pad
Regs
Stream Execution Unit
Cluster
Stream Processor
To/From SRF(260 GB/s)
LRF BW = 1.5TB/s
40
Rough per-node budget
200200Processor chip
4$/M-GUPS (250/node)
15$/GFLOPS (64/node)
976Per-Node Cost
501Power
4950000Cabinet
1883000Board/Backplane
32020Memory chip
50200Router chip
Per NodeCostItem
Preliminary numbers, parts cost only, no I/O included.
EE 273 Lecture 1 1/10/01
Copyright 1998,2000,2001 by W. J. Dally, All rights reserved 21
41
Many open problems
ï A small samplingï Software
ñ Program transformationñ Program mappingñ Bandwidth optimizationñ Conditionalsñ Irregular data structures
ï Hardwareñ Alternative stream modelsñ Register organizationñ Bandwidth hierarchiesñ Memory organizationñ Short stream issuesñ ISA designñ Cluster organizationñ Processor organization
42
Example, Ingress Packet Functions
Finalize RouteClassify Pkt
Route Table1st Address
Calc
Rd
Route Table2nd Address
Calc
Rd
Route Table3rd Address
Calc
Rd
PolicingCalculation
ComputeStatisticsUpdates
F&A F&ACompute
Packet QueueAddress
Wr
Extract HeaderFields
Packet Checks
EE 273 Lecture 1 1/10/01
Copyright 1998,2000,2001 by W. J. Dally, All rights reserved 22
43
Split into Kernels
Finalize RouteClassify Pkt
Route Table1st Address
Calc
Rd
Route Table2nd Address
Calc
Rd
Route Table3rd Address
Calc
Rd
PolicingCalculation
ComputeStatisticsUpdates
F&A F&ACompute
Packet QueueAddress
Wr
Extract HeaderFields
Packet Checks
44
Software Pipeline The Loop
i-6 i-4 i-2 iArithmeticClusters
MemoryChannel 1
MemoryChannel 2
i-1i-3i-5
i-7
i-7
Each block represents a kernel ormemory operation on a strip of packets