The Raw Tiled Processor Architecture Is the future of ...
Transcript of The Raw Tiled Processor Architecture Is the future of ...
![Page 1: The Raw Tiled Processor Architecture Is the future of ...](https://reader031.fdocuments.us/reader031/viewer/2022022621/62195dc62675756b032404cf/html5/thumbnails/1.jpg)
The Raw Tiled Processor Architecture
Is the future of architecture in tiles?
Anant AgarwalMIT
http://www.cag.csail.mit.edu/raw
![Page 2: The Raw Tiled Processor Architecture Is the future of ...](https://reader031.fdocuments.us/reader031/viewer/2022022621/62195dc62675756b032404cf/html5/thumbnails/2.jpg)
A tiled processor architecture prototype: the Raw microprocessor
October 02
![Page 3: The Raw Tiled Processor Architecture Is the future of ...](https://reader031.fdocuments.us/reader031/viewer/2022022621/62195dc62675756b032404cf/html5/thumbnails/3.jpg)
Embedded system:1020 Element Microphone Array
![Page 4: The Raw Tiled Processor Architecture Is the future of ...](https://reader031.fdocuments.us/reader031/viewer/2022022621/62195dc62675756b032404cf/html5/thumbnails/4.jpg)
1020 Node Beamformer Demo. 2 People moving about and
talking. Track movement of people using camera (vision group) and display on monitor. Beamformer focuses on speech of one person. Select another person using mouse. Beamformer switches focus to the speech of the other person
![Page 5: The Raw Tiled Processor Architecture Is the future of ...](https://reader031.fdocuments.us/reader031/viewer/2022022621/62195dc62675756b032404cf/html5/thumbnails/5.jpg)
The opportunity
20MIPS cpu1987
![Page 6: The Raw Tiled Processor Architecture Is the future of ...](https://reader031.fdocuments.us/reader031/viewer/2022022621/62195dc62675756b032404cf/html5/thumbnails/6.jpg)
2007
The billion transistor chip
The opportunity
![Page 7: The Raw Tiled Processor Architecture Is the future of ...](https://reader031.fdocuments.us/reader031/viewer/2022022621/62195dc62675756b032404cf/html5/thumbnails/7.jpg)
Enables seeking out new application domains
o Redefine our notion of a “general purpose processor”
o Imagine a single-chip handheld that is a speech drivencellphone, camera, PDA, MP3 player, video engine
o Imagine a single-chip PC that is also a 10G router, wireless access point, graphics engine
o While running the gamut of existing desktop binaries
![Page 8: The Raw Tiled Processor Architecture Is the future of ...](https://reader031.fdocuments.us/reader031/viewer/2022022621/62195dc62675756b032404cf/html5/thumbnails/8.jpg)
But, where is general purpose computing today?
Other…
Encryption
Sound
Ethernet
Wireless
Graphics 0.25TFLOPS
X86Pentium IV
ASICs
![Page 9: The Raw Tiled Processor Architecture Is the future of ...](https://reader031.fdocuments.us/reader031/viewer/2022022621/62195dc62675756b032404cf/html5/thumbnails/9.jpg)
How does the ASIC do it?
Lots of ALUs, lots of registers, lots of small memoriesHand-routed, short wires
Lower power (everything close by)Stream data model for high throughput
But, not general purpose
memmem
mem
mem
mem
![Page 10: The Raw Tiled Processor Architecture Is the future of ...](https://reader031.fdocuments.us/reader031/viewer/2022022621/62195dc62675756b032404cf/html5/thumbnails/10.jpg)
Our challenge
How to exploit ASIC-like features Lots of resources like ALUs and memoriesApplication-specific routing of short wires
While being “general purpose”ProgrammableAnd even running ILP-based sequential programs
One Approach: Tiled Processor Architecture (TPA)
![Page 11: The Raw Tiled Processor Architecture Is the future of ...](https://reader031.fdocuments.us/reader031/viewer/2022022621/62195dc62675756b032404cf/html5/thumbnails/11.jpg)
Tiled Processor Architecture (TPA)
SMEM
SWITCHPC
Short, prog. wires
DMEMIMEM
REGPC
FPU
ALU
Lots of ALUs and regs
Tile
Lower power
Programmable. Supports ILP and Streams
![Page 12: The Raw Tiled Processor Architecture Is the future of ...](https://reader031.fdocuments.us/reader031/viewer/2022022621/62195dc62675756b032404cf/html5/thumbnails/12.jpg)
A Prototype TPA: The Raw Microprocessor
The Raw ChipSoftware-scheduled interconnects (can use static or dynamic routing –but compiler determines instruction placement and routes)
Tile
Disk stream
Video1
RDRAM
Packet stream
A Raw Tile
SMEM
SWITCHPC
DMEMIMEM
REGPC
FPU
ALU
Raw Switch
PC
SMEM[Billion transistor IEEE Computer Issue ’97]
![Page 13: The Raw Tiled Processor Architecture Is the future of ...](https://reader031.fdocuments.us/reader031/viewer/2022022621/62195dc62675756b032404cf/html5/thumbnails/13.jpg)
Tight integration of interconnect
The Raw ChipTile
Disk stream
Video1
RDRAM
Packet stream
A Raw Tile
SMEM
SWITCHPC
DMEMIMEM
REGPC
FPU
ALUIF RFD
A TL
M1
F P
E
U WB
r26
r27
r25
r24
InputFIFOsfromStaticRouter
r26
r27
r25r24
OutputFIFOstoStaticRouter
0-cycle“local bypassnetwork”
M2
TV
F4
Point-to-point bypass-integratedcompiler-orchestratedon-chip networks
![Page 14: The Raw Tiled Processor Architecture Is the future of ...](https://reader031.fdocuments.us/reader031/viewer/2022022621/62195dc62675756b032404cf/html5/thumbnails/14.jpg)
Tile 11
fmul r24, r3, r4
softwarecontrolledcrossbar
softwarecontrolledcrossbar
fadd r5, r3, r25
route P->E route W->P
Tile 10
How to “program the wires”
![Page 15: The Raw Tiled Processor Architecture Is the future of ...](https://reader031.fdocuments.us/reader031/viewer/2022022621/62195dc62675756b032404cf/html5/thumbnails/15.jpg)
The result of orchestrating the wires
CustomDatapathPipeline
mem
mem
mem
httpd
C programILP computation
MPI program
Zzzz
![Page 16: The Raw Tiled Processor Architecture Is the future of ...](https://reader031.fdocuments.us/reader031/viewer/2022022621/62195dc62675756b032404cf/html5/thumbnails/16.jpg)
Perspective
We have replacedBypass paths, ALU-reg bus, FPU-Int. bus, reg-cache-bus, cache-mem bus, etc.
With a general, point-to-point, routed interconnect called:
Scalar operand network (SON)Fundamentally new kind of network optimized for both scalar and stream transport
![Page 17: The Raw Tiled Processor Architecture Is the future of ...](https://reader031.fdocuments.us/reader031/viewer/2022022621/62195dc62675756b032404cf/html5/thumbnails/17.jpg)
Programming models and software for tiled processor architectures
o Conventional scalar programs (C, C++, Java)Or, how to do ILP
o Stream programs
![Page 18: The Raw Tiled Processor Architecture Is the future of ...](https://reader031.fdocuments.us/reader031/viewer/2022022621/62195dc62675756b032404cf/html5/thumbnails/18.jpg)
Scalar (ILP) program mapping
v2.4 = v2seed.0 = seedv1.2 = v1pval1 = seed.0 * 3.0pval0 = pval1 + 2.0tmp0.1 = pval0 / 2.0pval2 = seed.0 * v1.2tmp1.3 = pval2 + 2.0pval3 = seed.0 * v2.4tmp2.5 = pval3 + 2.0pval5 = seed.0 * 6.0pval4 = pval5 + 2.0tmp3.6 = pval4 / 3.0pval6 = tmp1.3 - tmp2.5v2.7 = pval6 * 5.0pval7 = tmp1.3 + tmp2.5v1.8 = pval7 * 3.0v0.9 = tmp0.1 - v1.8v3.10 = tmp3.6 - v2.7tmp2 = tmp2.5v1 = v1.8;tmp1 = tmp1.3v0 = v0.9tmp0 = tmp0.1v3 = v3.10tmp3 = tmp3.6v2 = v2.7
E.g., Start with a C program, and several transformations later:
Lee, Amarasinghe et al, “Space-time scheduling”, ASPLOS ‘98
Existing languages will work
![Page 19: The Raw Tiled Processor Architecture Is the future of ...](https://reader031.fdocuments.us/reader031/viewer/2022022621/62195dc62675756b032404cf/html5/thumbnails/19.jpg)
pval5=seed.0*6.0
pval4=pval5+2.0
tmp3.6=pval4/3.0
tmp3=tmp3.6
v3.10=tmp3.6-v2.7
v3=v3.10
v2.4=v2
pval3=seed.o*v2.4
tmp2.5=pval3+2.0
tmp2=tmp2.5
pval6=tmp1.3-tmp2.5
v2.7=pval6*5.0
v2=v2.7
seed.0=seed
pval1=seed.0*3.0
pval0=pval1+2.0
tmp0.1=pval0/2.0
tmp0=tmp0.1
v1.2=v1
pval2=seed.0*v1.2
tmp1.3=pval2+2.0
tmp1=tmp1.3
pval7=tmp1.3+tmp2.5
v1.8=pval7*3.0
v1=v1.8v0.9=tmp0.1-v1.8
v0=v0.9
Scalar program mappingv2.4 = v2seed.0 = seedv1.2 = v1pval1 = seed.0 * 3.0pval0 = pval1 + 2.0tmp0.1 = pval0 / 2.0pval2 = seed.0 * v1.2tmp1.3 = pval2 + 2.0pval3 = seed.0 * v2.4tmp2.5 = pval3 + 2.0pval5 = seed.0 * 6.0pval4 = pval5 + 2.0tmp3.6 = pval4 / 3.0pval6 = tmp1.3 - tmp2.5v2.7 = pval6 * 5.0pval7 = tmp1.3 + tmp2.5v1.8 = pval7 * 3.0v0.9 = tmp0.1 - v1.8v3.10 = tmp3.6 - v2.7tmp2 = tmp2.5v1 = v1.8;tmp1 = tmp1.3v0 = v0.9tmp0 = tmp0.1v3 = v3.10tmp3 = tmp3.6v2 = v2.7
Graph
Program code
![Page 20: The Raw Tiled Processor Architecture Is the future of ...](https://reader031.fdocuments.us/reader031/viewer/2022022621/62195dc62675756b032404cf/html5/thumbnails/20.jpg)
pval5=seed.0*6.0
pval4=pval5+2.0
tmp3.6=pval4/3.0
tmp3=tmp3.6
v3.10=tmp3.6-v2.7
v3=v3.10
v2.4=v2
pval3=seed.o*v2.4
tmp2.5=pval3+2.0
tmp2=tmp2.5
pval6=tmp1.3-tmp2.5
v2.7=pval6*5.0
v2=v2.7
seed.0=seed
pval1=seed.0*3.0
pval0=pval1+2.0
tmp0.1=pval0/2.0
tmp0=tmp0.1
v1.2=v1
pval2=seed.0*v1.2
tmp1.3=pval2+2.0
tmp1=tmp1.3
pval7=tmp1.3+tmp2.5
v1.8=pval7*3.0
v1=v1.8
v0.9=tmp0.1-v1.8
v0=v0.9
Program graph clustering
![Page 21: The Raw Tiled Processor Architecture Is the future of ...](https://reader031.fdocuments.us/reader031/viewer/2022022621/62195dc62675756b032404cf/html5/thumbnails/21.jpg)
Placement
seed.0=seed
pval1=seed.0*3.0
pval0=pval1+2.0
tmp0.1=pval0/2.0
tmp0=tmp0.1
v1.2=v1
pval2=seed.0*v1.2
tmp1.3=pval2+2.0 tmp1=tmp1.3
pval7=tmp1.3+tmp2.5
v1.8=pval7*3.0
v1=v1.8
v0.9=tmp0.1-v1.8
v0=v0.9
pval
5=se
ed.0
*6.0
pval
4=pv
al5+
2.0
tmp3
.6=p
val4
/3.0
tmp3
=tmp
3.6
v3.1
0=tm
p3.6
-v2.
7
v3=v
3.10
v2.4
=v2
pval
3=se
ed.o
*v2.
4
tmp2
.5=p
val3
+2.0
tmp2
=tmp
2.5
pval
6=tm
p1.3
-tmp
2.5
v2.7
=pva
l6*5
.0v2
=v2.
7
Tile1Tile2
Tile3Tile4
![Page 22: The Raw Tiled Processor Architecture Is the future of ...](https://reader031.fdocuments.us/reader031/viewer/2022022621/62195dc62675756b032404cf/html5/thumbnails/22.jpg)
Routing
Processor code
seed.0=recv()pval5=seed.0*6.0pval4=pval5+2.0tmp3.6=pval4/3.0tmp3=tmp3.6v2.7=recv()v3.10=tmp3.6-v2.7v3=v3.10
route(W,S,t)
route(W,S)
route(S,t)
Tile2
v2.4=v2seed.0=recv(0)pval3=seed.o*v2.4tmp2.5=pval3+2.0tmp2=tmp2.5send(tmp2.5)
tmp1.3=recv()pval6=tmp1.3-tmp2.5v2.7=pval6*5.0Send(v2.7)v2=v2.7
route(N,t)
route(t,E)route(E,t)
route(t,E)
Tile3
v1.2=v1seed.0=recv()pval2=seed.0*v1.2tmp1.3=pval2+2.0 send(tmp1.3)tmp1=tmp1.3tmp2.5=recv()pval7=tmp1.3+tmp2.5v1.8=pval7*3.0v1=v1.8tmp0.1=recv()v0.9=tmp0.1-v1.8v0=v0.9
route(N,t)
route(t,W)route(W,t)
route(N,t)route(W,N)
Tile4
seed.0=seedsend(seed.0)pval1=seed.0*3.0pval0=pval1+2.0tmp0.1=pval0/2.0send(tmp0.1)tmp0=tmp0.1
route(t,E,S)
route(t,E)
Tile1
Switch code
![Page 23: The Raw Tiled Processor Architecture Is the future of ...](https://reader031.fdocuments.us/reader031/viewer/2022022621/62195dc62675756b032404cf/html5/thumbnails/23.jpg)
Instruction Scheduling
seed.0=recv()pval5=seed.0*6.0
pval4=pval5+2.0tmp3.6=pval4/3.0
tmp3=tmp3.6
v2.7=recv()v3.10=tmp3.6-v2.7
v3=v3.10
route(W,t)
route(W,S)
route(S,t)
send(seed.0)pval1=seed.0*3.0
pval0=pval1+2.0
tmp0.1=pval0/2.0
send(tmp0.1)tmp0=tmp0.1
route(t,E)
route(t,E)
v2.4=v2
seed.0=recv(0)pval3=seed.o*v2.4
tmp2.5=pval3+2.0
tmp2=tmp2.5send(tmp2.5)
tmp1.3=recv()pval6=tmp1.3-tmp2.5
v2.7=pval6*5.0
Send(v2.7)v2=v2.7
route(N,t)
route(t,E)
route(E,t)
route(t,E)
v1.2=v1
seed.0=recv()pval2=seed.0*v1.2
tmp1.3=pval2+2.0
send(tmp1.3)tmp1=tmp1.3
tmp2.5=recv()pval7=tmp1.3+tmp2.5
v1.8=pval7*3.0
v1=v1.8
v0.9=tmp0.1-v1.8
v0=v0.9
route(N,t)
route(N,t)
route(W,N)
seed.0=seed
route(W,S)
route(W,S)
tmp0.1=recv()
route(t,W)
route(W,t)
route(W,N)
route(t,E)
Tile1Tile3 Tile4 Tile2
time
![Page 24: The Raw Tiled Processor Architecture Is the future of ...](https://reader031.fdocuments.us/reader031/viewer/2022022621/62195dc62675756b032404cf/html5/thumbnails/24.jpg)
class BeamFormer extends Pipeline {void init(numChannels, numBeams) {
add(new SplitJoin() { void init() {
setSplitter(Duplicate());for (int i=0; i<numChannels; i++) {
add(new FIR1(N1));
add(new FIR2(N2));}setJoiner(RoundRobin()); }});
}add(new SplitJoin() {
void init() {setSplitter(Duplicate());for (int i=0; i<numBeams; i++) {
add(new VectorMult());
add(new FIR3(N3));
add(new Magnitude());
add(new Detect());
}setJoiner(Null()); }});
}
StreamIt: Stream Language and Compiler
Splitter
FIRFilter FIRFilterFIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter
FIRFilter FIRFilterFIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter
Joiner
Splitter
Detector
Magnitude
FirFilter
Vector Mult
Detector
Magnitude
FirFilter
Vector Mult
Detector
Magnitude
FirFilter
Vector Mult
Detector
Magnitude
FirFilter
Vector Mult
Joiner
e.g., BeamFormer Amarasinghe et al.
![Page 25: The Raw Tiled Processor Architecture Is the future of ...](https://reader031.fdocuments.us/reader031/viewer/2022022621/62195dc62675756b032404cf/html5/thumbnails/25.jpg)
mem
+mem
+mem
+mem
+mem
+mem
+mem
+
mem
+mem
+mem
+mem
+
mem
+
Search
FRM
FIR
Control
Raw Beamformer Layout (by hand)
![Page 26: The Raw Tiled Processor Architecture Is the future of ...](https://reader031.fdocuments.us/reader031/viewer/2022022621/62195dc62675756b032404cf/html5/thumbnails/26.jpg)
Raw die photo
.18 micron process, 16 tiles, 425MHz, 18 Watts (vpenta)Of course, custom IC designed by industrial design team
could do much better
![Page 27: The Raw Tiled Processor Architecture Is the future of ...](https://reader031.fdocuments.us/reader031/viewer/2022022621/62195dc62675756b032404cf/html5/thumbnails/27.jpg)
Raw motherboard
![Page 28: The Raw Tiled Processor Architecture Is the future of ...](https://reader031.fdocuments.us/reader031/viewer/2022022621/62195dc62675756b032404cf/html5/thumbnails/28.jpg)
More Experimental Systems
Systems Online or in PipelineRaw Workstation Raw-based 1020 Microphone ArrayRaw 802.11a/g wireless system
(collab with Engim)Raw Gigabit IP routerRaw graphics systemRaw supercomputing fabric
![Page 29: The Raw Tiled Processor Architecture Is the future of ...](https://reader031.fdocuments.us/reader031/viewer/2022022621/62195dc62675756b032404cf/html5/thumbnails/29.jpg)
Empirical EvaluationCompare to P3 implemented in similar technology
Parameter Raw (IBM ASIC) P3 (Intel)Litho 180 nm 180 nmProcess CMOS 7SF P858Metal Layers Cu 6 Al 6FO1 Delay 23 ps 11 psDielectric k 4.1 3.55Design Style Standard Cell
SA27E ASICFull custom
Initial Freq 425 MHz 500-733 MHz (use 600)Die Area 331 mm2 106 mm2
Raw #s from cycle-accurate simulator validated against real chip-- FPGA mem controller in Raw -- Raw SW i-caching adds 0-30% ovhd
![Page 30: The Raw Tiled Processor Architecture Is the future of ...](https://reader031.fdocuments.us/reader031/viewer/2022022621/62195dc62675756b032404cf/html5/thumbnails/30.jpg)
[ISCA’04]
Perf
orman
ce
Architecture Space
Performance Results~10x parallelism~ 4x ld/store elim~ 4x stream mem bw
![Page 31: The Raw Tiled Processor Architecture Is the future of ...](https://reader031.fdocuments.us/reader031/viewer/2022022621/62195dc62675756b032404cf/html5/thumbnails/31.jpg)
Raw IP Router
0
2
4
6
8
10
12
14
DC10
Moto
rola (
C-Po
rt)EZ
Chip
NP-2
IBM Po
werN
PInt
el IXP
2800
Vites
se PR
ISM IQ
2200
Cisco
Toas
ter III
Raw
Route
r
Gb/sec
![Page 32: The Raw Tiled Processor Architecture Is the future of ...](https://reader031.fdocuments.us/reader031/viewer/2022022621/62195dc62675756b032404cf/html5/thumbnails/32.jpg)
VersaBench
Sharing a benchmark set to stress versatility of processors
Categories of programs:ILP – Desktop and ScientificStreamsThroughput oriented serversBit-level embedded
www.cag.csail.mit.edu/versabench
![Page 33: The Raw Tiled Processor Architecture Is the future of ...](https://reader031.fdocuments.us/reader031/viewer/2022022621/62195dc62675756b032404cf/html5/thumbnails/33.jpg)
Summary
Raw: single chip for ILP and streaming
Scalar operand network is key toILP and streams
Tiled processor chip demonstrated in 2002
www.cag.csail.mit.edu/raw
![Page 34: The Raw Tiled Processor Architecture Is the future of ...](https://reader031.fdocuments.us/reader031/viewer/2022022621/62195dc62675756b032404cf/html5/thumbnails/34.jpg)