Dataflow: A Complement to Superscalar
Embed Size (px)
description
Transcript of Dataflow: A Complement to Superscalar

Dataflow: A Complement to Superscalar
Mihai Budiu – Microsoft Research
Pedro V. Artigas – Carnegie Mellon University
Seth Copen Goldstein – Carnegie Mellon University
2005

2
Computer Architecture-- A Simplified History --
1967 1990
superscalar
dataflow
2005

3
This Work
• Re-evaluate dataflow– Same workloads as superscalar
(C programs: Mediabench, Spec)
– Modern performance analysis tool(whole-program critical path)
• Use of superscalar mechanisms in dataflow

4
Why Study Dataflow
• Naturally exploit ILP• Potentially very high ILP• Simple, regular
microarchitecture• Very low power
[1/1000 superscalar]• Suitable for stream processing

5
Outline
• Motivation• ASH: A Static Dataflow Model
• Explaining bottlenecks• Conclusions

6
Application-Specific Hardware
C program
Compiler
Dataflow IR

7
Computation Dataflow
x = a & 7;...
y = x >> 2;
Program
&
a 7
>>
2
x
IR
a
Circuits
&7
>>2
Operations Nodes Pipeline stages
Variables Def-use edges Channels (wires)
Pure dataflow: no program counter

8
Basic Computation=Pipeline Stage
data
valid
ack
latch+

9
Control Flow => Data Flow
datapredicate
Merge (label)
Gateway
data
data
Split (branch)p
!

10
i
+1< 100
0
*
+
sum
0
Loops
int sum=0, i;
for (i=0; i < 100; i++)
sum += i*i;
return sum;return sum; !
ret

11
Comparison: Idealized Simulation
• Compared to 4-wide OOO SimpleScalar• Same operation latencies• Same memory hierarchy (LSQ, L1, L2)• not free

12
Obvious!
ASH runs at full dataflow speed,and has no resource limitations, so CPU cannot do any better(if compilers equally good)

13
SpecInt95, ASH vs 4-way OOO
-50
-40
-30
-20
-10
0
10
20
300
99
.go
12
4.m
88
ksim
12
9.c
om
pre
ss
13
0.li
13
2.ij
pe
g
13
4.p
erl
14
7.v
ort
ex
Pe
rce
nt
slo
we
r /
fas
ter

14
Outline• Motivation• ASH: A Static Dataflow Model• Dissection: explaining bottlenecks
• Conclusions

15
The Scalpel
C CASH ASH SimulatorASH
tracedrawings
Dynamic Critical Path
Automaticanalysis

16
The (Loop) Body
for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
break;
SpecINT95: 124.m88ksim, init_processor()

17
Dynamic Critical Path
for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
break;
load predicate
loop predicate
sizeof(X[j])
definition

18
MIPS gcc CodeLOOP:
L1: beq $v0,$a1,EXIT ; X[j].r == i
L2: addiu $v1,$v1,20 ; &X[j+1].r
L3: lw $v0,0($v1) ; X[j+1].r
L4: addiu $a0,$a0,1 ; j++
L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF
EXIT:
L1=>L2=>L3=>L5=>L14-instructions loop-carried dependence
for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
break;

19
If Branch Prediction Correct
L1=>L2=>L3=>L5=>L1for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
break;
LOOP:
L1: beq $v0,$a1,EXIT ; X[j].r == i
L2: addiu $v1,$v1,20 ; &X[j+1].r
L3: lw $v0,0($v1) ; X[j+1].r
L4: addiu $a0,$a0,1 ; j++
L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF
EXIT:

20
SpecInt95, perfect prediction
-60
-40
-20
0
20
40
60
09
9.g
o
12
4.m
88
ksim
12
9.c
om
pre
ss
13
0.li
13
2.ij
pe
g
13
4.p
erl
14
7.v
ort
ex
Pe
rce
nt
slo
we
r/fa
ste
r
Speed-up
prediction
no data

21
Critical Path with Prediction
Loads are notspeculative
for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
break;

22
Prediction + Load Speculation
~4 cycles!Load not pipelined(self-anti-dependence)
ack edge
for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
break;

23
OOO Pipe Snapshot
IF DA EX WB CT
L3 L3 L3
registerrenaming
LOOP:
L1: beq $v0,$a1,EXIT ; X[j].r == i
L2: addiu $v1,$v1,20 ; &X[j+1].r
L3: lw $v0,0($v1) ; X[j+1].r
L4: addiu $a0,$a0,1 ; j++
L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF
EXIT:

24
Conclusions: Limitations of Static Dataflow
1. dataflow state is “more” distributed
2. “control” dependences still limit ILP
3. nontrivial to squash distributed speculation
4. good prediction may need global information
5. self-antidependences can be critical
(removed by register renaming)
6. distributed computation => more remote accesses
7. more synchronization in dataflow (“join” is not free)

25

26
Unrolling Does Not Help
for(i = 0; i < 64; i++) {
for (j = 0; X[j].r != 0xF; j+=2) {
if (X[j].r == i)
break;
if (X[j+1].r == 0xF)
break;
if (X[j+1].r == i)
break;
}
Y[i] = X[j].q;
}
when 1 iteration

27
How Performance Is Evaluated
C
Unlimited ILPstatic dataflow
LSQL18K
L21/4M
Mem
2
8
72
SimpleScalar
CASH
gcc

28
Last-Arrival Events
+
data
valid
ack
• Event enabling the generation of a result• May be an ack• Critical path=collection of last-arrival edges

29
Dynamic Critical Path
3. Some edges may repeat 2. Trace back along
last-arrival edges
1. Start from last node
back back to talk

30
History
Out-of-orderBranch predSpeculation
TomasulloIBM 360
1967
ThorntonCDC 1964
KarpGraph model
1966
SmithBr pred1981
FisherVLIW
CockeSuperscalar
1985
SmithPrecise spec
1988
DennisDataflow lang
1974
BurgerTRIPS2001
OskinWaveScalar
2003
ArvindTagged-token
1977
PapadopoulosMonsoon
1988