A Data-Driven Approach for Pipelining Sequences
of Data-Dependent LOOPs
João M. P. Cardoso
ITIV, University of Karlsruhe, July 2, 2007
Portugal
2
Motivation
Many applications have sequences tasks• E.g., in image and video processing
algorithms
Contemporary FPGAs• Plenty of room to accommodate highly
specialized complex architectures• Time to creatively “use available
resources” than to simply “save resources”
3
Motivation
Computing Stages• Sequentially
Task A Task B Task C
TIME
4
Motivation
Computing Stages• Concurrently
TIME
Task A
Task B
Task C
5
Outline
Objective Loop Pipelining Producer/Consumer Computing Stages Pipelining Sequences of Loops Inter-Stage Communication Experimental Setup and Results Related Work Conclusions Future Work
6
Objectives
To speed-up applications with multiple and data-dependent stages • each stage seen as a set of nested
loops
How?• Pipelining those sequences of data-
dependent stages using fine-grain synchronization schemes
• Taking advantage of field-custom computing structures (FPGAs)
7
Loop Pipelining Attempt to overlap
loop iterations Significant
speedups are achieved
But how to pipeline sequences of loops?
I1 I2 I3 I4
I1
I2
I3
I4
time
...
...
8
Computing Stages
Sequentially
Producer:
...A[2]A[1]A[0]
Consumer:
A[0]A[1]A[2]...
9
Computing Stages
Concurrently• Ordered producer/consumer pairs
• Send/receive
Producer:...A[2]A[1]A[0]
Consumer:A[0]A[1]A[2]...
A[3
]
...
A[2
]
A[1
]
A[0
]
FIFO with N stages
10
Computing Stages
Concurrently• Unordered producer/consumer pairs
• Empty/Full table
0
1 A[1]
0
0
0
1 A[5]
0
0
Producer:...A[3]A[5]A[1] Consumer:
A[3]A[1]A[5]...
Em
pty/full
data
11
Main Idea
FDCT
Execution of Loops 1, 2 Execution of Loop 3
time
Loop 1 Loop 2
Loop 3
Global FSM
Data Input
Intermediatedata
Data output
Intermediate data array
0 1 2 3 4 5 6 7
816243240
4856
12
Main Idea
FDCT• Out-of-order producer/consumer pairs• How to overlap computing stages?
0 1 2 3 4 5 6 7
8
16243240
4856
0 1 2 3 4 5 6 7
8
16243240
4856
13
Main Idea Pipelined FDCT
Intermediate data( dual-port RAM )
Loop 1 Loop 2
Loop 3
FSM 1
FSM 2
Dual-port 1-bit table( empty/full )
Data input
Data output
Execution of Loops 1, 2
Execution of Loop 3
time
Intermediate data array
0 1 2 3 4 5 6 7
816243240
4856
14
Main Idea
TaskA
TaskB
Mem
ory
Mem
ory
Mem
ory
15
Possible Scenarios
Single write, single read• Accepted without code changes
Single write, multiple reads• Accepted without code changes (by
using an N-bit table)
Multiple writes, single read• Need code transformations
Multiple writes, multiple reads• Need code transformations
16
Inter-Stage Communication Responsible to:
• Communicate data between pipelined stages
• Flag data availability Solutions
• Perfect associative memory• Cost too high
• Memory for data plus 1-bit table (each cell represents full/empty information)
• Size of the data set to communicate
• Decrease size using hash-based solution
0
1 A[1]
0
0
0
1 A[5]
0
0
Em
pty/full
data
17
i_1 = 0;for (i=0; i<N*num_fdcts; i++){ //Loop 3 L1: f0 = tmp[i_1]; if(!tab[i_1]) goto L1; L2: f1 = tmp[1+i_1]; if(!tab[1+i_1]) goto L2; // remaining loads // computations … // stores i_1 += 8;}
…boolean tab[SIZE]={0, 0,…, 0};…for(i=0; i<num_fdcts; i++){ //Loop 1
for(j=0; j<N; j++){ //Loop 2
// loads // computations // stores tmp[48+i_1] = F6 >> 13; tab[48+i_1] = true; tmp[56+i_1] = F7 >> 13; tab[56+i_1] = true; i_1++; } i_1 += 56; }
Inter-Stage Communication
Memory plus 1-bit table
img
Loop 1 Loop 2
Dual-port memory:
tmp
Loop 3
dct_o
FSM 1 FSM 2
Dual-port 1-
bit table: tab
data connections address connections
18
i_1 = 0;for (i=0; i<N*num_fdcts; i++){ //Loop 3 L1: f0 = tmp[H(i_1)]; if(!tab[H(i_1)]) goto L1; L2: f1 = tmp[H(1+i_1)]; if(!tab[H(1+i_1)]) goto L2; // remaining loads // computations … // stores i_1 += 8;}
…boolean tab[SIZE]={0, 0,…, 0};…for(i=0; i<num_fdcts; i++){ //Loop 1
for(j=0; j<N; j++){ //Loop 2
// loads // computations // stores tmp[H(48+i_1)] = F6 >> 13; tab[H(48+i_1)] = true; tmp[H(56+i_1)] = F7 >> 13; tab[H(56+i_1)] = true; i_1++; } i_1 += 56; }
Inter-Stage Communication
Hash-based solution:
img
Loop 1 Loop 2
Dual-port memory:
tmp
Loop 3
dct_o
FSM 1 FSM 2
Empty/full table: tab
data connections address connections
H H
H
H
19
Inter-Stage Communication Hash-based solution
• We did not want to include additional delays in the load/store operations
• Use H(k) = k MOD m• When m is a multiple of 2*N,• H(k) can be implemented by just using the
least log2(m) significant bits of K to address the cache (translates to simple interconnections)
A[5]1
0
0
0
0
0
A[1]1
0
H H
A[5]1
0
0
0
0
0
A[1]1
0
20
Inter-Stage Communication
Hash-based solution: H(k) = k MOD m Single read
(L=1) R = 1 = 0
a) writeb) read
c) empty/full update
L N
M
data_in address_in
H
address_out data_out
H
hit/miss
T
(a)
(b)
(c)
(a)
(b)
R (a)
21
Inter-Stage Communication
Hash-based solution: H(k) = k MOD m Multiple reads
(L>1) R = 11...1 (L) >>= R
a) writeb) read
c) empty/full update
L N
M
data_inaddress_in
H
address_out data_out
H
hit/miss
T
(a)
(b)
(c)
(a)
(b)
R (a)
22
Buffer size calculation
By monitoring behavior• of communication component
For each read and write • determine the size of the buffer
needed to avoid collisionsDone during RTL simulation
23
Java Code withdirectives
Front-End (includescompilation to JVM)
Library(FUs)
FU Models(HDL)
Java bytecodes
Nau
Logic Synthesis and Place andRoute (vendor-specific)
FU Models(Java)
SpecificReconfigurable
Hardware (FPGA)
Estimators
ControlUnits(XML)
DatapathUnits (XML)
RTG (XML)
XSL Transformers
Experimental Setup
Compilation flow• Uses our previous work on compiling
algorithms in a Java subset to FPGAs
24
Experimental Setup
Simulation back-end
fsm.xmldatapath.xmldatapath.xml fsm.xml rtg.xml
to dotty to dottyto hds to java to javato vhdl to vhdl
datapath.hds fsm.java rtg.java
fsm.class rtg.classHADES
Library of Operators
(JAVA)
I/O data( RAMs and Stimulus )
XSLTs
ANT build file
25
Experimental Results Benchmarks
Algorithm
# Stages #loops
Description
fdct 2 {s1,s2} 3 Fast DCT (Discrete Cosine Transform)
fwt2D 4 {s1,s2,s3,s4}
8 Forward Haar Wavelet
RGB2gray+
histogram
2 {s1,s2} 2 Transforms an RGB image to a gray image with 256 levels and determines the histogram of the gray image
Smooth +
sobel,3
versions:(a)(b)(c)
2 {s1,s2} 6 Smooth image operation based on 33 windows being the resultant image input to the sobel edge detector. (a): original code; (b): two innermost loops of the smooth algorithm fully unrolled (scalar replacement of the array with coefficients); (c): the same as (b) plus elimination of redundant array references in the original code of sobel.
26
Experimental Results
FDCT (speed-up achieved by Pipelining Sequences of Loop)
1.00
1.20
1.40
1.60
1.80
2.00
1 2 3 4 5 6 7 8 16 32 40 48 56 64 128
256
512
1024
# 8x8 blocks
Sp
ee
du
p
27
Experimental ResultsAlgorithm
Input data size
Stages#cc w/o
PSL
Speed-up Upper –Bound
#cc w/ PSLSpeed-
up
fdct 800600(s1,s2)(s1)(s2)
3,930,0051,950,0031,920,003
2.02 1,830,215 2.02
Fwt2D 512512(s1,s2,s3,s4)(s1,s2)(s3,s4)
4,724,7452,362,3732,362,373
2.00 3,664,917 1.29
RGB2gray +
histogram
800600
(s1,s2)(s1)(s2)
6,720,0252,880,0153,840,015
1.75 3,840,007 1.75
Smooth + sobel
(a)800600
(s1,s2)(s1)(s2)
49,634,00932,929,47316,606,951
1.51 32,929,489 1.51
Smooth + sobel
(b)800600
(s1,s2)(s1)(s2)
30,068,64513,364,10916,606,951
1.81 16,640,509 1.81
Smooth + sobel
(c)800600
(s1,s2)(s1)(s2)
25,773,80913,364,10911,862,791
1.92 13,364,117 1.92
28
Experimental Results What does happen with buffer sizes?
128
480000
480000
480000
2621442
2048
131072
56
1
120000
1198
1 10 100 1000 10000 100000 1000000
smooth + sobel (a)
RGB2gray + histogram (a)
fwt2D
fdct
table size (no hash function) buffer size used (simple hash function) buffer minimum size (perfect hash)
29
Experimental Results
Adjust latency of tasks in order to balance pipeline stages:• Slowdown tasks with higher latency• Optimization of slower tasks in order to
reduce their latency
Slowdown of producer tasks usually reduces the size of the inter-stage buffers
30
131072
1
480000
480000
480000
480000
480000
4800002048
2048
8192
2
131072
6001
120000
1198
95110
1198
1 10 100 1000 10000 100000 1000000
smooth + sobel (a)
smooth + sobel (b)
smooth + sobel (c)
RGB2gray + histogram (a)
RGB2gray + histogram (b)
RGB2gray + histogram (c)
table size (no hash function) buffer size used (simple hash function) buffer minimum size (perfect hash)
Experimental Results
Buffer sizes
+1 cycle per iteration of the producer
+2 cycles per iteration of the producer
original
Optimizations in the producer
+Optimizations in the consumer
original
31
Experimental Results
Buffer sizes
41.5%
41.5%
8.4%
27.4%
50.0%
50.0%
26.7%
56.3%
234
4
59
234
4
240000
131072
3750
0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0%
smooth + sobel (a)
smooth + sobel (b)
smooth + sobel (c)
RGB2gray + histogram (a)
RGB2gray + histogram (b)
RGB2gray + histogram (c)
fwt2D
fdct
1 10 100 1000 10000 100000 1000000 1000000010000000
010000000
00
overhead related to optimal size reduction related to original
32
Experimental Results
1.14
1.000.96
1.131.131.00 1.00 1.00 1.03 0.99 1.00 0.99
1
10
100
1000
10000
fdct
fdct-
hash
fdct-
table
sm
ooth
+sobel
sm
ooth
+sobel-hash
sm
ooth
+sobel-ta
ble
RG
B2gra
y+
his
togra
m
RG
B2gra
y+
his
togra
m-h
ash
RG
B2gra
y+
his
togra
m-t
able
fwt2
D
fwt2
D-h
ash
fwt2
D-t
able
FP
GA
reso
urc
es
0.0
0.2
0.4
0.6
0.8
1.0
1.2
# FFs # 4-LUTS # Slices Normalized Freq.
Resources and Frequency (Spartan-3 400)
33
Related Work
Previous approach (Ziegler et al.)• Coarse-grained communication and synchronization
scheme• FIFOs are used to communicate data between
pipelining stages• Width of FIFO stages dependent on
producer/consumer ordering• Less applicable
A[0]A[1]A[2]A[3]...
Producer: Consumer:
A[0]A[1]A[2]A[3]...
A[0]A[1]...
A[0]A[1]A[2]A[3]...
A[1]A[0]A[3]A[2]...
A[0]A[1]
A[2]A[3]
...
A[0]A[1]A[2]A[3]A[4]A[5]...
A[0]A[3]A[1]A[4]A[2]A[5]...
A[0]A[1]A[2]A[3]A[4]
A[5]A[6]A[7]A[8]A[9]
...
time
34
Conclusions We presented a scheme to accelerate
applications, pipelining sequences of loops• I.e., Before the end of a stage (set of nested loops)
a subsequent stage (set of nested loops) can start executing based on data already produced
Data-driven scheme is used based on empty/full tables• A scheme to reduce the size of the memory
buffers for inter-stage pipelining (using a simple hash function)
Depending on the consumer/producer ordering, speedups close to theoretical ones are achieved• as if stages are concurrently and independently
executed
35
Future Work Research other hash functions Study slowdown effects Apply the technique in the context of
Multi-Core Systems
Processor Core
A
LN
Mdata
_in
addr
ess_
in
H
addr
ess_
out
data
_out
H
hit
/mis
s
T
(a)
(b)
(c)
(a)
(b)
R(a
)
Processor Core
BMem
ory
Mem
ory
36
Acknowledgments Work partially funded by
• CHIADO - Compilation of High-Level Computationally Intensive Algorithms to Dynamically Reconfigurable COmputing Systems
• Portuguese Foundation for Science and Technology (FCT), POSI and FEDER, POSI/CHS/48018/2002
Based on the work done by Rui Rodrigues
In collaboration with Pedro C. Diniz
37
technologyfrom seed
A Data-Driven Approach for Pipelining
Sequences of Data-Dependent Loops
38
Buffer Monitor
FDCT
0
10
20
30
40
50
60
0 50 100 150 200 250 300
clock cycles
elem
ents
0
0.5
1
1.5
2
2.5
3
3.5
buffer size store load(hit) load(miss)
39
Buffer Monitor
fwt2D
0
0,2
0,4
0,6
0,8
1
1,2
0 20 40 60 80 100
clock cycles
ele
me
nts
0
0,5
1
1,5
2
2,5
3
3,5
buffer size load(miss) load(hit) store
40
Buffer MonitorRGB2gray + histogram
0
2
4
6
8
10
12
0
18
36
54
72
90
10
8
12
6
14
4
16
2
18
0
19
8
21
6
23
4
25
2
27
0
28
8
30
6
32
4
34
2
36
0
clock cycles
ele
me
nts
0
0,5
1
1,5
2
2,5
3
3,5
buffer size store load(miss) load(hit)
41
Buffer Monitor
RGB2gray + histogram (modified)
0
1
2
3
4
5
6
0
18
36
54
72
90
10
8
12
6
14
4
16
2
18
0
19
8
21
6
23
4
25
2
27
0
28
8
30
6
32
4
34
2
36
0
37
8
clock cycles
ele
me
nts
0
0,5
1
1,5
2
2,5
3
3,5
buffer size store load(miss) load(hit)
42
Buffer MonitorSmooth + Sobel a)
0
5
10
15
20
25
30
0
11
3
22
6
33
9
45
2
56
5
67
8
79
1
90
4
10
17
11
30
12
43
13
56
14
69
15
82
16
95
18
08
19
21
20
34
21
47
22
60
23
73
clock cycles
ele
me
nts
0
0,5
1
1,5
2
2,5
3
3,5
buffer size store load(miss) load(hit)
43
Buffer Monitor
Smooth + Sobel a)
0
2
4
6
8
10
12
14
1
11
4
22
8
34
2
45
6
57
0
68
4
79
8
91
2
10
26
11
40
12
54
13
68
14
82
15
96
17
10
18
24
19
38
20
52
21
66
22
80
23
94
clock cycles
ele
me
nts
0
0,5
1
1,5
2
2,5
3
3,5
buffer size store load(miss) load(hit)
Top Related