Download - João M. P. Cardoso

A Data-Driven Approach for Pipelining Sequences

of Data-Dependent LOOPs

João M. P. Cardoso

ITIV, University of Karlsruhe, July 2, 2007

Portugal

2

Motivation

Many applications have sequences tasks• E.g., in image and video processing

algorithms

Contemporary FPGAs• Plenty of room to accommodate highly

specialized complex architectures• Time to creatively “use available

resources” than to simply “save resources”

3

Motivation

Computing Stages• Sequentially

Task A Task B Task C

TIME

4

Motivation

Computing Stages• Concurrently

TIME

Task A

Task B

Task C

5

Outline

Objective Loop Pipelining Producer/Consumer Computing Stages Pipelining Sequences of Loops Inter-Stage Communication Experimental Setup and Results Related Work Conclusions Future Work

6

Objectives

To speed-up applications with multiple and data-dependent stages • each stage seen as a set of nested

loops

How?• Pipelining those sequences of data-

dependent stages using fine-grain synchronization schemes

• Taking advantage of field-custom computing structures (FPGAs)

7

Loop Pipelining Attempt to overlap

loop iterations Significant

speedups are achieved

But how to pipeline sequences of loops?

I1 I2 I3 I4

I1

I2

I3

I4

time

...

...

8

Computing Stages

Sequentially

Producer:

...A[2]A[1]A[0]

Consumer:

A[0]A[1]A[2]...

9

Computing Stages

Concurrently• Ordered producer/consumer pairs

• Send/receive

Producer:...A[2]A[1]A[0]

Consumer:A[0]A[1]A[2]...

A[3

]

...

A[2

]

A[1

]

A[0

]

FIFO with N stages

10

Computing Stages

Concurrently• Unordered producer/consumer pairs

• Empty/Full table

0

1 A[1]

0

0

0

1 A[5]

0

0

Producer:...A[3]A[5]A[1] Consumer:

A[3]A[1]A[5]...

Em

pty/full

data

11

Main Idea

FDCT

Execution of Loops 1, 2 Execution of Loop 3

time

Loop 1 Loop 2

Loop 3

Global FSM

Data Input

Intermediatedata

Data output

Intermediate data array

0 1 2 3 4 5 6 7

816243240

4856

12

Main Idea

FDCT• Out-of-order producer/consumer pairs• How to overlap computing stages?

0 1 2 3 4 5 6 7

8

16243240

4856

0 1 2 3 4 5 6 7

8

16243240

4856

13

Main Idea Pipelined FDCT

Intermediate data( dual-port RAM )

Loop 1 Loop 2

Loop 3

FSM 1

FSM 2

Dual-port 1-bit table( empty/full )

Data input

Data output

Execution of Loops 1, 2

Execution of Loop 3

time

Intermediate data array

0 1 2 3 4 5 6 7

816243240

4856

14

Main Idea

TaskA

TaskB

Mem

ory

Mem

ory

Mem

ory

15

Possible Scenarios

Single write, single read• Accepted without code changes

Single write, multiple reads• Accepted without code changes (by

using an N-bit table)

Multiple writes, single read• Need code transformations

Multiple writes, multiple reads• Need code transformations

16

Inter-Stage Communication Responsible to:

• Communicate data between pipelined stages

• Flag data availability Solutions

• Perfect associative memory• Cost too high

• Memory for data plus 1-bit table (each cell represents full/empty information)

• Size of the data set to communicate

• Decrease size using hash-based solution

0

1 A[1]

0

0

0

1 A[5]

0

0

Em

pty/full

data

17

i_1 = 0;for (i=0; i<N*num_fdcts; i++){ //Loop 3 L1: f0 = tmp[i_1]; if(!tab[i_1]) goto L1; L2: f1 = tmp[1+i_1]; if(!tab[1+i_1]) goto L2; // remaining loads // computations … // stores i_1 += 8;}

…boolean tab[SIZE]={0, 0,…, 0};…for(i=0; i<num_fdcts; i++){ //Loop 1

for(j=0; j<N; j++){ //Loop 2

// loads // computations // stores tmp[48+i_1] = F6 >> 13; tab[48+i_1] = true; tmp[56+i_1] = F7 >> 13; tab[56+i_1] = true; i_1++; } i_1 += 56; }

Inter-Stage Communication

Memory plus 1-bit table

img

Loop 1 Loop 2

Dual-port memory:

tmp

Loop 3

dct_o

FSM 1 FSM 2

Dual-port 1-

bit table: tab

data connections address connections

18

i_1 = 0;for (i=0; i<N*num_fdcts; i++){ //Loop 3 L1: f0 = tmp[H(i_1)]; if(!tab[H(i_1)]) goto L1; L2: f1 = tmp[H(1+i_1)]; if(!tab[H(1+i_1)]) goto L2; // remaining loads // computations … // stores i_1 += 8;}

…boolean tab[SIZE]={0, 0,…, 0};…for(i=0; i<num_fdcts; i++){ //Loop 1

for(j=0; j<N; j++){ //Loop 2

// loads // computations // stores tmp[H(48+i_1)] = F6 >> 13; tab[H(48+i_1)] = true; tmp[H(56+i_1)] = F7 >> 13; tab[H(56+i_1)] = true; i_1++; } i_1 += 56; }


Hash-based solution:

img

Loop 1 Loop 2

Dual-port memory:

tmp

Loop 3

dct_o

FSM 1 FSM 2

Empty/full table: tab

data connections address connections

H H

H

H

19

Inter-Stage Communication Hash-based solution

• We did not want to include additional delays in the load/store operations

• Use H(k) = k MOD m• When m is a multiple of 2*N,• H(k) can be implemented by just using the

least log2(m) significant bits of K to address the cache (translates to simple interconnections)

A[5]1

0

0

0

0

0

A[1]1

0

H H

A[5]1

0

0

0

0

0

A[1]1

0

20


Hash-based solution: H(k) = k MOD m Single read

(L=1) R = 1 = 0

a) writeb) read

c) empty/full update

L N

M

data_in address_in

H

address_out data_out

H

hit/miss

T

(a)

(b)

(c)

(a)

(b)

R (a)

21


Hash-based solution: H(k) = k MOD m Multiple reads

(L>1) R = 11...1 (L) >>= R

a) writeb) read

c) empty/full update

L N

M

data_inaddress_in

H

address_out data_out

H

hit/miss

T

(a)

(b)

(c)

(a)

(b)

R (a)

22

Buffer size calculation

By monitoring behavior• of communication component

For each read and write • determine the size of the buffer

needed to avoid collisionsDone during RTL simulation

23

Java Code withdirectives

Front-End (includescompilation to JVM)

Library(FUs)

FU Models(HDL)

Java bytecodes

Nau

Logic Synthesis and Place andRoute (vendor-specific)

FU Models(Java)

SpecificReconfigurable

Hardware (FPGA)

Estimators

ControlUnits(XML)

DatapathUnits (XML)

RTG (XML)

XSL Transformers

Experimental Setup

Compilation flow• Uses our previous work on compiling

algorithms in a Java subset to FPGAs

24

Experimental Setup

Simulation back-end

fsm.xmldatapath.xmldatapath.xml fsm.xml rtg.xml

to dotty to dottyto hds to java to javato vhdl to vhdl

datapath.hds fsm.java rtg.java

fsm.class rtg.classHADES

Library of Operators

(JAVA)

I/O data( RAMs and Stimulus )

XSLTs

ANT build file

25

Experimental Results Benchmarks

Algorithm

# Stages #loops

Description

fdct 2 {s1,s2} 3 Fast DCT (Discrete Cosine Transform)

fwt2D 4 {s1,s2,s3,s4}

8 Forward Haar Wavelet

RGB2gray+

histogram

2 {s1,s2} 2 Transforms an RGB image to a gray image with 256 levels and determines the histogram of the gray image

Smooth +

sobel,3

versions:(a)(b)(c)

2 {s1,s2} 6 Smooth image operation based on 33 windows being the resultant image input to the sobel edge detector. (a): original code; (b): two innermost loops of the smooth algorithm fully unrolled (scalar replacement of the array with coefficients); (c): the same as (b) plus elimination of redundant array references in the original code of sobel.

26

Experimental Results

FDCT (speed-up achieved by Pipelining Sequences of Loop)

1.00

1.20

1.40

1.60

1.80

2.00

1 2 3 4 5 6 7 8 16 32 40 48 56 64 128

256

512

1024

# 8x8 blocks

Sp

ee

du

p

27

Experimental ResultsAlgorithm

Input data size

Stages#cc w/o

PSL

Speed-up Upper –Bound

#cc w/ PSLSpeed-

up

fdct 800600(s1,s2)(s1)(s2)

3,930,0051,950,0031,920,003

2.02 1,830,215 2.02

Fwt2D 512512(s1,s2,s3,s4)(s1,s2)(s3,s4)

4,724,7452,362,3732,362,373

2.00 3,664,917 1.29

RGB2gray +

histogram

800600

(s1,s2)(s1)(s2)

6,720,0252,880,0153,840,015

1.75 3,840,007 1.75

Smooth + sobel

(a)800600

(s1,s2)(s1)(s2)

49,634,00932,929,47316,606,951

1.51 32,929,489 1.51

Smooth + sobel

(b)800600

(s1,s2)(s1)(s2)

30,068,64513,364,10916,606,951

1.81 16,640,509 1.81

Smooth + sobel

(c)800600

(s1,s2)(s1)(s2)

25,773,80913,364,10911,862,791

1.92 13,364,117 1.92

28

Experimental Results What does happen with buffer sizes?

128

480000

480000

480000

2621442

2048

131072

56

1

120000

1198

1 10 100 1000 10000 100000 1000000

smooth + sobel (a)

RGB2gray + histogram (a)

fwt2D

fdct

table size (no hash function) buffer size used (simple hash function) buffer minimum size (perfect hash)

29


Adjust latency of tasks in order to balance pipeline stages:• Slowdown tasks with higher latency• Optimization of slower tasks in order to

reduce their latency

Slowdown of producer tasks usually reduces the size of the inter-stage buffers

30

131072

1

480000

480000

480000

480000

480000

4800002048

2048

8192

2

131072

6001

120000

1198

95110

1198

1 10 100 1000 10000 100000 1000000

smooth + sobel (a)

smooth + sobel (b)

smooth + sobel (c)


RGB2gray + histogram (b)

RGB2gray + histogram (c)

table size (no hash function) buffer size used (simple hash function) buffer minimum size (perfect hash)


Buffer sizes

+1 cycle per iteration of the producer

+2 cycles per iteration of the producer

original

Optimizations in the producer

+Optimizations in the consumer

original

31


Buffer sizes

41.5%

41.5%

8.4%

27.4%

50.0%

50.0%

26.7%

56.3%

234

4

59

234

4

240000

131072

3750

0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0%

smooth + sobel (a)

smooth + sobel (b)

smooth + sobel (c)


RGB2gray + histogram (b)

RGB2gray + histogram (c)

fwt2D

fdct

1 10 100 1000 10000 100000 1000000 1000000010000000

010000000

00

overhead related to optimal size reduction related to original

32


1.14

1.000.96

1.131.131.00 1.00 1.00 1.03 0.99 1.00 0.99

1

10

100

1000

10000

fdct

fdct-

hash

fdct-

table

sm

ooth

+sobel

sm

ooth

+sobel-hash

sm

ooth

+sobel-ta

ble

RG

B2gra

y+

his

togra

m

RG

B2gra

y+

his

togra

m-h

ash

RG

B2gra

y+

his

togra

m-t

able

fwt2

D

fwt2

D-h

ash

fwt2

D-t

able

FP

GA

reso

urc

es

0.0

0.2

0.4

0.6

0.8

1.0

1.2

# FFs # 4-LUTS # Slices Normalized Freq.

Resources and Frequency (Spartan-3 400)

33

Related Work

Previous approach (Ziegler et al.)• Coarse-grained communication and synchronization

scheme• FIFOs are used to communicate data between

pipelining stages• Width of FIFO stages dependent on

producer/consumer ordering• Less applicable

A[0]A[1]A[2]A[3]...

Producer: Consumer:

A[0]A[1]A[2]A[3]...

A[0]A[1]...

A[0]A[1]A[2]A[3]...

A[1]A[0]A[3]A[2]...

A[0]A[1]

A[2]A[3]

...

A[0]A[1]A[2]A[3]A[4]A[5]...

A[0]A[3]A[1]A[4]A[2]A[5]...

A[0]A[1]A[2]A[3]A[4]

A[5]A[6]A[7]A[8]A[9]

...

time

34

Conclusions We presented a scheme to accelerate

applications, pipelining sequences of loops• I.e., Before the end of a stage (set of nested loops)

a subsequent stage (set of nested loops) can start executing based on data already produced

Data-driven scheme is used based on empty/full tables• A scheme to reduce the size of the memory

buffers for inter-stage pipelining (using a simple hash function)

Depending on the consumer/producer ordering, speedups close to theoretical ones are achieved• as if stages are concurrently and independently

executed

35

Future Work Research other hash functions Study slowdown effects Apply the technique in the context of

Multi-Core Systems

Processor Core

A

LN

Mdata

_in

addr

ess_

in

H

addr

ess_

out

data

_out

H

hit

/mis

s

T

(a)

(b)

(c)

(a)

(b)

R(a

)

Processor Core

BMem

ory

Mem

ory

36

Acknowledgments Work partially funded by

• CHIADO - Compilation of High-Level Computationally Intensive Algorithms to Dynamically Reconfigurable COmputing Systems

• Portuguese Foundation for Science and Technology (FCT), POSI and FEDER, POSI/CHS/48018/2002

Based on the work done by Rui Rodrigues

In collaboration with Pedro C. Diniz

37

technologyfrom seed

A Data-Driven Approach for Pipelining

Sequences of Data-Dependent Loops

38

Buffer Monitor

FDCT

0

10

20

30

40

50

60

0 50 100 150 200 250 300

clock cycles

elem

ents

0

0.5

1

1.5

2

2.5

3

3.5

buffer size store load(hit) load(miss)

39

Buffer Monitor

fwt2D

0

0,2

0,4

0,6

0,8

1

1,2

0 20 40 60 80 100

clock cycles

ele

me

nts

0

0,5

1

1,5

2

2,5

3

3,5

buffer size load(miss) load(hit) store

40

Buffer MonitorRGB2gray + histogram

0

2

4

6

8

10

12

0

18

36

54

72

90

10

8

12

6

14

4

16

2

18

0

19

8

21

6

23

4

25

2

27

0

28

8

30

6

32

4

34

2

36

0

clock cycles

ele

me

nts

0

0,5

1

1,5

2

2,5

3

3,5

buffer size store load(miss) load(hit)

41

Buffer Monitor

RGB2gray + histogram (modified)

0

1

2

3

4

5

6

0

18

36

54

72

90

10

8

12

6

14

4

16

2

18

0

19

8

21

6

23

4

25

2

27

0

28

8

30

6

32

4

34

2

36

0

37

8

clock cycles

ele

me

nts

0

0,5

1

1,5

2

2,5

3

3,5


42

Buffer MonitorSmooth + Sobel a)

0

5

10

15

20

25

30

0

11

3

22

6

33

9

45

2

56

5

67

8

79

1

90

4

10

17

11

30

12

43

13

56

14

69

15

82

16

95

18

08

19

21

20

34

21

47

22

60

23

73

clock cycles

ele

me

nts

0

0,5

1

1,5

2

2,5

3

3,5


43

Buffer Monitor

Smooth + Sobel a)

0

2

4

6

8

10

12

14

1

11

4

22

8

34

2

45

6

57

0

68

4

79

8

91

2

10

26

11

40

12

54

13

68

14

82

15

96

17

10

18

24

19

38

20

52

21

66

22

80

23

94

clock cycles

ele

me

nts

0

0,5

1

1,5

2

2,5

3

3,5