Post on 16-Apr-2018
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 1
EE482SLecture 1
Stream Processor Architecture
April 4, 2002
William J. DallyComputer Systems Laboratory
Stanford Universitybilld@csl.stanford.edu
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 2
Today’s Class Meeting
• What is EE482C?– Material covered– Course format– Assignments– Scribing
• What is Stream Processor Architecture?– What problem is being solved
• Performance scaling, power efficiency, bandwidth bottlenecks– What is a stream program?– What is a stream processor
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 3
What is EE482C?
• New course in EE482 sequence (advanced comp. arch.)– EE482A – superscalar architecture– EE482B – interconnection networks– EE482C – stream processor architecture
• Course format– Readings – typically one or two papers per class meeting
• Read the paper before the meeting for which it is listed• E.g., read Rixner et al. before 4/9/02
– Mix of lecture and discussion in class meetings• Be prepared to discuss each reading
– Two programming assignments• One in StreamC/KernelC, one in Brook
– Class project • Original research on stream architecture or programming
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 4
Where to find more information
• Course policy sheet• Web page http://cva.stanford.edu/ee482s• Class schedule
– Subject to change
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 5
What is a Stream Processor?
• A stream program is a computation organized as streams of records operated on by computation kernels.
• A stream processor is optimized to exploit the locality and concurrency of stream programs
• More later
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 6
Stream Processing is becoming pervasive
InsightsPeter Huber, 01.07.02A new type of computing--"stream computing"--has emerged to handle the back end of sonar, radar, X-ray sources and certain broadband applications such as voice-over Internet and digital TV.
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 7
Motivation – Why do we need a stream processor?
• Application demand• Power• Bandwidth• Performance Scaling
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 8
Application Pull
• Emerging media applications demand 10s of GOPS to TOPs with low power.– Video compression– Image understanding– Signal processing– Graphics
• Scientific applications also need TOPs of performance with reasonable cost
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 9
Conventional Processors No Longer Scale Performance by 50% each year
1e+0
1e+1
1e+2
1e+3
1e+4
1e+5
1e+6
1e+7
1980 1990 2000 2010 2020
Perf (ps/Inst)52%/year
19%/year
ps/gate 19%Gates/clock 9%
Clocks/inst 18%
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 10
Future potential of novel architecture is large (1000 vs 30)
1e-41e-31e-21e-11e+01e+11e+21e+31e+41e+51e+61e+7
1980 1990 2000 2010 2020
Perf (ps/Inst)Linear (ps/Inst)
52%/year
74%/year
19%/year30:1
1,000:1
30,000:1
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 11
Clock Scaling: Historical and Projected
0
1
10
100
1000
10000
1970
1972
1974
1976
1978
1980
1982
1984
1986
1988
1990
1992
1994
1996
1998
2000
2002
2004
2006
2008
2010
2012
2014
Clo
ck S
peed
(MH
z)
8FO416FO4Intel Processors
375 FO4
70 FO4
85 FO4115 FO4
230 FO4185 FO4
175 FO4
53 FO4
40 FO446 FO4
25 FO4
9 FO4
15 FO4
10µm 3µm 1µm1.5µm 0.8µm 0.35µm 0.18µm0.13µm 0.07µm
0.1µm 0.05µm0.035µm
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 12
Anant Agarwal (MIT) Panel at HPCA ‘02
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 13
Pentium III vs. Pentium IV
2x108108# Grids
12KBytes16KBytesL1 D$ Capacity
42 million24 millionTransistor Count
217mm2106mm2Die Size
0.350.45SpecInt/MHz
524454SpecInt2000
1.5GHz (10.4 FO4)1GHz (15 FO4)Clock Rate
2010Pipeline Stages
180nm180nmTechnology
Pentium IVPentium III
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 14
Why do Special-Purpose Processors Perform Well?
Lots (100s) of ALUs Fed by dedicated wires/memories
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 15
Care and Feeding of ALUs
Instr.CacheIP
Instruction Bandwidth
DataBandwidth
IR
Regs
‘Feeding’ Structure Dwarfs ALU
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 16
Stream Programs make communication explicit
• This reduces energy and delay
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 17
Energy is a matter of distance (interconnect)
1.1nJ 130pJExecute a uP instruction (SB-1)
10pJ 0.6pJ32b Register Read
1.9nJ 1.9nJTransfer 32b off chip (200M HSTL)1.3nJ 400pJTransfer 32b off chip (2.5G CML)
100pJ 17pJTransfer 32b across chip (10mm)50pJ 3pJRead 32b from 8KB RAM
5pJ 0.3pJ32b ALU Operation
Energy(0.13um) (0.05um)
Operation
300: 20:1 off-chip to global to local ratio in 20021300: 56:1 in 2010
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 18
Interconnect dominates delay
2800ps 4600psTransfer 32b across chip (20mm)
325ps 125ps32b Register Read
1400ps 2300psTransfer 32b across chip (10mm)780ps 300psRead 32b from 8KB RAM
650ps 250ps32b ALU Operation
Delay(0.13um) (0.05um)
Operation
2:1 global on-chip comm to operation delay9:1 in 2010
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 19
What is a Stream Program?
• A program organized as streams of records flowing through kernels
• Example, stereo depth extraction
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 20
Stereo Depth Extraction Stream Program
Kernels can be partitioned across chips to exploit task parallelism.
Kernels exploit both instruction (ILP) and data (SIMD) level parallelism.
SAD
convolve
Streams expose producer-consumer locality.
Image 1 convolve
Image 0 convolve convolve
Depth Map
The stream model exploits parallelism without the complexity of traditional parallel programming.
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 21
Why Organize an Application This Way?
• Expose parallelism at three levels– ILP within kernels– DLP across stream elements– TLP across sub-streams and across kernels– Keeps ‘easy’ parallelism easy
• Expose locality in two ways– Within a kernel – kernel locality– Between kernels – producer-consumer locality– This locality can be exploited independent of spatial or temporal
locality– Put another way, stream programs make communication explicit
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 22
Streams expose Kernel Locality missed by Vectors
• Streams– Traverse operations first
• All operations for one record, then next record
• Smaller working set of temporary values
– Store and access whole records as a unit
• Spatial locality of memory references
– e.g., get contiguous record on gather/scatter
• Vectors– Traverse records first
• All records for one operation, then next operation
• Large set of temporary values– Group like-elements of records
into vectors• Read one word of each record at
a time– No locality on gather/scatter
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 23
Example – Vertex Transform
×=
wzyx
T
wzyx
''''
input record
result recordintermediate results
x
y
z
w
t00x
t10x
t20x
t30x
t01y
t11y
t21y
t31y
t02z
t12z
t22z
t32z
t03w
t13w
t23w
t33w
x’
y’
z’
w’
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 24
Vertex transform with Streams vs. Vectors
• Small set of intermediate results– enable small and fast LRFs
• Large working set of intermediates– VL times larger (e.g. 64x)– Must use a large, slow global
VRF
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 25
What class of applications can be written as stream programs?
• Media applications (signal, image, video, packet, and graphics processing) are naturally expressed in this style
• Scientific applications can be efficiently cast as stream programs
• Others?– This is an open question
• Hypothesis– Any application with a long run time (large operation count) has a
great deal of parallelism and hence can be cast as a stream program.
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 26
Example, Ingress Packet Functions
Finalize RouteClassify Pkt
Route Table1st Address
Calc
Rd
Route Table2nd Address
Calc
Rd
Route Table3rd Address
Calc
Rd
PolicingCalculation
ComputeStatisticsUpdates
F&A F&ACompute
Packet QueueAddress
Wr
Extract HeaderFields
Packet Checks
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 27
Split into Kernels
Finalize RouteClassify Pkt
Route Table1st Address
Calc
Rd
Route Table2nd Address
Calc
Rd
Route Table3rd Address
Calc
Rd
PolicingCalculation
ComputeStatisticsUpdates
F&A F&ACompute
Packet QueueAddress
Wr
Extract HeaderFields
Packet Checks
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 28
Software Pipeline The Loop
i-6 i-4 i-2 iArithmeticClusters
MemoryChannel 1
MemoryChannel 2
i-1i-3i-5
i-7
i-7
Each block represents a kernel ormemory operation on a strip of packets
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 29
Is it hard to write stream programs?
• We will let you be the judge of that.• It does constrain how you write a program• Depends on the quality of the programming tools.
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 30
What is a Stream Processor?
• A processor that is optimized to execute a stream program
• Features include– Exploit parallelism
• TLP with multiple processors• DLP with multiple clusters within each processor• ILP with multiple ALUs within each cluster
– Exploit locality with a bandwidth hierarchy• Kernel locality within each cluster• Producer-consumer locality within each processor
• Many different possible architectures
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 31
The Imagine Stream Processor
Stream Register File NetworkInterface
StreamController
Imagine Stream Processor
HostProcessorA
LU
Clu
ster
0
AL
U C
lust
er 1
AL
U C
lust
er 2
AL
U C
lust
er 3
AL
U C
lust
er 4
AL
U C
lust
er 5
AL
U C
lust
er 6
AL
U C
lust
er 7
SDRAMSDRAM SDRAMSDRAM
Streaming Memory SystemM
icro
cont
rolle
r
Net
wor
k
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 32
Arithmetic Clusters
CU
Inte
rclu
ster
Net
wor
k
+
From SRF
To SRF
+ + * * /
Cross Point
Local Register File
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 33
A Bandwidth Hierarchy exploits locality and concurrency
2GB/s 32GB/s
SDRAM
SDRAM
SDRAM
SDRAM
Stre
am
Reg
iste
r File
ALU Cluster
ALU Cluster
ALU Cluster
544GB/s
• VLIW clusters with shared control• 41.2 32-bit floating-point operations per word of memory BW
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 34
A Bandwidth Hierarchy exploits kernel and producer-consumer locality
Memory BW Global RF BW Local RF BWDepth Extractor 0.80 GB/s 18.45 GB/s 210.85 GB/s
MPEG Encoder 0.47 GB/s 2.46 GB/s 121.05 GB/s
Polygon Rendering 0.78 GB/s 4.06 GB/s 102.46 GB/s
QR Decomposition 0.46 GB/s 3.67 GB/s 234.57 GB/s
2GB/s 32GB/s
SDRAM
SDRAM
SDRAM
SDRAMSt
ream
R
egis
ter F
ile
ALU ClusterALU Cluster
ALU Cluster
544GB/s
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 35
Producer-Consumer Locality in the Depth Extractor
Memory/Global Data SRF/Streams Clusters/Kernels
new partial sums
sharpened row
Convolution(Laplacian)
row of pixels
previous partial sumsConvolution(Gaussian)
new partial sums
blurred row
previous partial sums
filtered row segment
filtered row segment
previous partial sums
new partial sumsdepth map row segment
SAD
1 : 23 : 317
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 36
Bandwidth Demand of Applications
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 37
Overview
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 38
Die Photos
• 21 M transistors / TI 0.15µm 1.5V CMOS / 16mm x 16mm• 300 MHz TTTT, hope for 400 MHz in lab• Chips arrived 4/1/02, no fooling!
[Khailany et al., ICCD ’02 (submitted)]
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 39
Performance demonstrated on signal and image processing
12.1
19.8
11.0
23.925.6
7.0
0
5
10
15
20
25
30
GO
PS
depth mpeg qrd dct convolve fft
16-bit kernels16-bit
applications
floating-pointapplication
floating-pointkernel
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 40
Initial studies indicate that it also applies to solving PDEs and ODEs
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 41
Architecture of a Streaming Supercomputer
StreamProcessor64 FPUs
64GFLOPS
16 xDRDRAM2GBytes
38GBytes/s
20GBytes/s32+32 pairs
Node
On-Board Network
Node2
Node16 Board 2
16 Nodes1K FPUs1TFLOPS32GBytes
Intra-Cabinet Network(passive - wires only)
Board 64
160GBytes/s256+256 pairs
10.5" Teradyne GbX
Board
Cabinet
Inter-Cabinet Network
Cabinet 264 Boards1K Nodes64K FPUs64TFLOPS
2TBytes
E/OO/E
5TBytes/s8K+8K links
Ribbon Fiber
Cabinet 16
Bisection 64TBytes/s
All links 5Gb/s per pair or fiberAll bandwidths are full duplex
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 42
Streaming processor
Stream Execution Unit
Stream Register File
Scal
arPr
oces
sor
AddressGenerators
MemoryControl
On-
Chi
pM
emor
y
LocalDRDRAM(38 GB/s)
Net
wor
kIn
terfa
ce
NetworkChannels(20 GB/s)
Clu
ster
0
Clu
ster
1
Clu
ster
15
CommFP
MUL
FPMUL
FPADD
FPADD
FPDSQ
Regs
Regs
Regs
Regs
Clu
ster
Sw
itch
Regs
Regs
Regs
Regs
Regs
Regs
Regs
Regs
Scra
tch
Pad
Regs
Stream Execution Unit
Cluster
Stream Processor
To/From SRF(260 GB/s)
LRF BW = 1.5TB/s
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 43
Rough per-node budget
200200Processor chip
4$/M-GUPS (250/node)
15$/GFLOPS (64/node)
976Per-Node Cost
501Power
4950000Cabinet
1883000Board/Backplane
32020Memory chip
50200Router chip
Per NodeCostItem
Preliminary numbers, parts cost only, no I/O included.
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 44
Many open problems
• A small sampling• Software
– Program transformation– Program mapping– Bandwidth optimization– Conditionals– Irregular data structures
• Hardware– Alternative stream models– Register organization– Bandwidth hierarchies– Memory organization– Short stream issues– ISA design– Cluster organization– Processor organization
Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 45
Next Time
• Discuss Imagine paper• Discuss the stream programming model