Dataflow: A Complement to Superscalar

Dataflow: A Complement to Superscalar

Mihai Budiu – Microsoft Research

Pedro V. Artigas – Carnegie Mellon University

Seth Copen Goldstein – Carnegie Mellon University

2005

2

Computer Architecture-- A Simplified History --

1967 1990

superscalar

dataflow

2005

3

This Work

• Re-evaluate dataflow– Same workloads as superscalar

(C programs: Mediabench, Spec)

– Modern performance analysis tool(whole-program critical path)

• Use of superscalar mechanisms in dataflow

4

Why Study Dataflow

• Naturally exploit ILP• Potentially very high ILP• Simple, regular

microarchitecture• Very low power

[1/1000 superscalar]• Suitable for stream processing

5

Outline

• Motivation• ASH: A Static Dataflow Model

• Explaining bottlenecks• Conclusions

6

Application-Specific Hardware

C program

Compiler

Dataflow IR

7

Computation Dataflow

x = a & 7;...

y = x >> 2;

Program

&

a 7

>>

2

x

IR

a

Circuits

&7

>>2

Operations Nodes Pipeline stages

Variables Def-use edges Channels (wires)

Pure dataflow: no program counter

8

Basic Computation=Pipeline Stage

data

valid

ack

latch+

9

Control Flow => Data Flow

datapredicate

Merge (label)

Gateway

data

data

Split (branch)p

!

10

i

+1< 100

0

*

+

sum

0

Loops

int sum=0, i;

for (i=0; i < 100; i++)

sum += i*i;

return sum;return sum; !

ret

11

Comparison: Idealized Simulation

• Compared to 4-wide OOO SimpleScalar• Same operation latencies• Same memory hierarchy (LSQ, L1, L2)• not free

12

Obvious!

ASH runs at full dataflow speed,and has no resource limitations, so CPU cannot do any better(if compilers equally good)

13

SpecInt95, ASH vs 4-way OOO

-50

-40

-30

-20

-10

0

10

20

300

99

.go

12

4.m

88

ksim

12

9.c

om

pre

ss

13

0.li

13

2.ij

pe

g

13

4.p

erl

14

7.v

ort

ex

Pe

rce

nt

slo

we

r /

fas

ter

14

Outline• Motivation• ASH: A Static Dataflow Model• Dissection: explaining bottlenecks

• Conclusions

15

The Scalpel

C CASH ASH SimulatorASH

tracedrawings

Dynamic Critical Path

Automaticanalysis

16

The (Loop) Body

for (j = 0; X[j].r != 0xF; j++)

if (X[j].r == i)

break;

SpecINT95: 124.m88ksim, init_processor()

17


for (j = 0; X[j].r != 0xF; j++)

if (X[j].r == i)

break;

load predicate

loop predicate

sizeof(X[j])

definition

18

MIPS gcc CodeLOOP:

L1: beq $v0,$a1,EXIT ; X[j].r == i

L2: addiu $v1,$v1,20 ; &X[j+1].r

L3: lw $v0,0($v1) ; X[j+1].r

L4: addiu $a0,$a0,1 ; j++

L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF

EXIT:

L1=>L2=>L3=>L5=>L14-instructions loop-carried dependence

for (j = 0; X[j].r != 0xF; j++)

if (X[j].r == i)

break;

19

If Branch Prediction Correct

L1=>L2=>L3=>L5=>L1for (j = 0; X[j].r != 0xF; j++)

if (X[j].r == i)

break;

LOOP:


L2: addiu $v1,$v1,20 ; &X[j+1].r

L3: lw $v0,0($v1) ; X[j+1].r

L4: addiu $a0,$a0,1 ; j++


EXIT:

20

SpecInt95, perfect prediction

-60

-40

-20

0

20

40

60

09

9.g

o

12

4.m

88

ksim

12

9.c

om

pre

ss

13

0.li

13

2.ij

pe

g

13

4.p

erl

14

7.v

ort

ex

Pe

rce

nt

slo

we

r/fa

ste

r

Speed-up

prediction

no data

21

Critical Path with Prediction

Loads are notspeculative

for (j = 0; X[j].r != 0xF; j++)

if (X[j].r == i)

break;

22

Prediction + Load Speculation

~4 cycles!Load not pipelined(self-anti-dependence)

ack edge

for (j = 0; X[j].r != 0xF; j++)

if (X[j].r == i)

break;

23

OOO Pipe Snapshot

IF DA EX WB CT

L3 L3 L3

registerrenaming

LOOP:


L2: addiu $v1,$v1,20 ; &X[j+1].r

L3: lw $v0,0($v1) ; X[j+1].r

L4: addiu $a0,$a0,1 ; j++


EXIT:

24

Conclusions: Limitations of Static Dataflow

1. dataflow state is “more” distributed

2. “control” dependences still limit ILP

3. nontrivial to squash distributed speculation

4. good prediction may need global information

5. self-antidependences can be critical

(removed by register renaming)

6. distributed computation => more remote accesses

7. more synchronization in dataflow (“join” is not free)

26

Unrolling Does Not Help

for(i = 0; i < 64; i++) {

for (j = 0; X[j].r != 0xF; j+=2) {

if (X[j].r == i)

break;

if (X[j+1].r == 0xF)

break;

if (X[j+1].r == i)

break;

}

Y[i] = X[j].q;

}

when 1 iteration

27

How Performance Is Evaluated

C

Unlimited ILPstatic dataflow

LSQL18K

L21/4M

Mem

2

8

72

SimpleScalar

CASH

gcc

28

Last-Arrival Events

+

data

valid

ack

• Event enabling the generation of a result• May be an ack• Critical path=collection of last-arrival edges

29


3. Some edges may repeat 2. Trace back along

last-arrival edges

1. Start from last node

back back to talk

30

History

Out-of-orderBranch predSpeculation

TomasulloIBM 360

1967

ThorntonCDC 1964

KarpGraph model

1966

SmithBr pred1981

FisherVLIW

CockeSuperscalar

1985

SmithPrecise spec

1988

DennisDataflow lang

1974

BurgerTRIPS2001

OskinWaveScalar

2003

ArvindTagged-token

1977

PapadopoulosMonsoon

1988

Dataflow: A Complement to Superscalar

Documents

Transcript of Dataflow: A Complement to Superscalar