Algorithmic Transformations

59
ECE734 VLSI Arrays for Digital Signal Processing Algorithmic Transformations

description

Algorithmic Transformations. Goals. The goal: Get the DSP algorithm in an amenable form before heading off to synthesize the design on the selected platform (FPGA or PDSP) No changes to the actual algorithms, just changes to the way the algorithms are prepared for implementation. - PowerPoint PPT Presentation

Transcript of Algorithmic Transformations

Page 1: Algorithmic Transformations

ECE734 VLSI Arrays for Digital Signal Processing

Algorithmic Transformations

Page 2: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 2 ECE734 VLSI Arrays for Digital Signal Processing

Goals

• The goal: Get the DSP algorithm in an amenable form before heading off to synthesize the design on the selected platform (FPGA or PDSP)

• No changes to the actual algorithms, just changes to the way the algorithms are prepared for implementation.

• This will require understanding aspects of – timing, – pipelining, – parallelism

Page 3: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 3 ECE734 VLSI Arrays for Digital Signal Processing

Overview

• Algorithm Representations and Iteration Bound• Parallelism and Pipelining• Retiming• Unfolding• Folding

Page 4: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 4 ECE734 VLSI Arrays for Digital Signal Processing

Page 5: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 5 ECE734 VLSI Arrays for Digital Signal Processing

Page 6: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 6 ECE734 VLSI Arrays for Digital Signal Processing

Page 7: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 7 ECE734 VLSI Arrays for Digital Signal Processing

Data Flow Graph

• Node: – Computation

– Associated with a computing time.

• Direct edge: – data path and delay

• Delay: iteration count

• Example

y(n) = a*y(n-1) + b*u(n)

• The delay of 1 u.t. indicates that to compute y(n+1) in the next iteration depends on result y(n) of the present iteration.

• Delay labeled with D or positive integer on edges

Page 8: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 8 ECE734 VLSI Arrays for Digital Signal Processing

DFG

• Intra-iteration dependency – A direct edge without any

delay

• Inter-iteration dependency– Direct edge with 1 or more

delays

• Node computing delay labeled with parenthesis.

• Critical path: longest path between registers

• Example: critical path delay = 4+2+2 = 8 t.u.

• Recursive DFG: contains loops. Must have at least one delay element along any loop. Otherwise, the algorithm is NON-computable!

D D

M0 M1 M2

A0 A1

x(n)

y(n)

(4) (4) (4)

(2) (2)

Page 9: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 9 ECE734 VLSI Arrays for Digital Signal Processing

Loop bound and Iteration bound

• T{A-B-A} = (2+4)/2 = 3 t.u.

• T = max{(2+4)/2, (2+4+5)/1}

= max{3, 11} = 11

all loops

ii loop

loopi

i loop

loop

t

Td

T Max T

A B C

D

2D

(2) (4) (5)

A B

2D

(2) (4)

Page 10: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 10 ECE734 VLSI Arrays for Digital Signal Processing

Page 11: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 11 ECE734 VLSI Arrays for Digital Signal Processing

Page 12: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 12 ECE734 VLSI Arrays for Digital Signal Processing

Solution

• To achieve high-speed, the length of the critical path can be reduced by pipelining and parallel processing

Page 13: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 13 ECE734 VLSI Arrays for Digital Signal Processing

Overview

• Algorithm Representations and Iteration Bound• Parallelism and Pipelining• Retiming• Unfolding• Folding

Page 14: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 14 ECE734 VLSI Arrays for Digital Signal Processing

Basic Ideas

• Parallel processing • Pipelined processing

a1 a2 a3 a4

b1 b2 b3 b4

c1 c2 c3 c4

d1 d2 d3 d4

a1 b1 c1 d1

a2 b2 c2 d2

a3 b3 c3 d3

a4 b4 c4 d4

P1

P2

P3

P4

P1

P2

P3

P4

time

Colors: different types of operations performeda, b, c, d: different data streams processed

Less inter-processor communicationComplicated processor hardware

time

More inter-processor communicationSimpler processor hardware

Page 15: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 15 ECE734 VLSI Arrays for Digital Signal Processing

Data Dependence

• Parallel processing requires NO data dependence between processors

• Pipelined processing will involve inter-processor communication

P1

P2

P3

P4

P1

P2

P3

P4

time time

Page 16: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 16 ECE734 VLSI Arrays for Digital Signal Processing

Usage of Pipelined Processing

• By inserting latches or registers between combinational logic circuits, the critical path can be shortened.

• Consequence: – reduce clock cycle time,

– increase clock frequency.

• Suitable for DSP applications that have (infinity) long data stream.

• Method to incorporate pipelining: Cut-set retiming

• Cut set: – A cut set is a set of edges of

a graph. If these edges are removed from the original graph, the remaining graph will become two separate graphs.

• Retiming:– The timing of an algorithm is

re-adjusted while keeping the partial ordering of execution unchanged so that the results correct

Page 17: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 17 ECE734 VLSI Arrays for Digital Signal Processing

Pipelining

Page 18: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 18 ECE734 VLSI Arrays for Digital Signal Processing

Pipelining of FIR filters

Page 19: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 19 ECE734 VLSI Arrays for Digital Signal Processing

Pipelining

Page 20: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 20 ECE734 VLSI Arrays for Digital Signal Processing

Fine-grain pipelining

To further reduce TM.

Critical Path = Max {TM1, TM2, TA}

Page 21: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 21 ECE734 VLSI Arrays for Digital Signal Processing

Graphic Transpose Theorem

• The transfer function of a signal flow graph remain unchanged if – The directions of each arc is reversed– The input and output labels are switched.

z1 z1x[n]

y[n]h[2]h[1]h[0]

z1 z1y[n]

x[n]h[2]h[1]h[0]

u[n]

=?

Page 22: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 22 ECE734 VLSI Arrays for Digital Signal Processing

Data broadcast structure

• Algorithm transform may lead to pipelined structure without adding additional delays.

• Given a FIR filter SFG

• Critical path TM+2TA

• Use graph transposition theorem:– Reverse all arcs– Reverse input/output

• We obtain

• Critical path Max(TM, TA)

• No additional delay added!

Page 23: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 23 ECE734 VLSI Arrays for Digital Signal Processing

Block Processing

• One form of vectorized parallel processing of DSP algorithms. (Not the parallel processing in most general sense)

• Block vector: [x(3k) x(3k+1) x(3k+2)]

• Clock cycle: can be 3 times longer

• Original (FIR filter):

• Rewrite 3 equations at a time:

• Define block vector• Block formulation:

(3 ) (3 ) (3 1) (3 2)

(3 1) (3 1) (3 ) (3 1)

(3 2) (3 2) (3 1) (3 )

y k x k x k x k

y k a x k b x k c x k

y k x k x k x k

( ) ( ) ( 1)

( 2)

y n a x n b x n

c x n

(3 )

( ) (3 1)

(3 2)

x k

k x k

x k

x

0 0 0

( ) 0 ( ) 0 0 ( 1)

0 0 0

a c b

k b a k c k

c b a

y x x

Page 24: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 24 ECE734 VLSI Arrays for Digital Signal Processing

Block Processing

Page 25: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 25 ECE734 VLSI Arrays for Digital Signal Processing

General approach for block processing

Page 26: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 26 ECE734 VLSI Arrays for Digital Signal Processing

Page 27: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 27 ECE734 VLSI Arrays for Digital Signal Processing

Timing Comparison

• Pipelining

• Block processing

1 2 3 4x(1) x(2) x(3) x(4)

y(1) y(2) y(3) y(4)

1 2 3 4 5 6 7 8x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(7)

MAC

1 2 3 4 5 6 7 8

y(1) y(2) y(3) y(4) y(5) y(6) y(7) y(7)Add

a y(1)

Mul

1 1 3 3 5 5 7 7

2 2 4 4 6 6 8 8x(2) x(4) x(6) x(8)

x(1) x(3) x(5) x(7)

Page 28: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 28 ECE734 VLSI Arrays for Digital Signal Processing

Overview

• Algorithm Representations and Iteration Bound• Parallelism and Pipelining• Retiming• Unfolding• Folding

Page 29: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 29 ECE734 VLSI Arrays for Digital Signal Processing

Definitions

• RetimingRetiming is a mapping from a given DFG, G to a retimed DFT, Gr such that the corresponding transfer function of G and Gr differ by a pure delay zL.

• Purposes– To facilitate pipelining to reduce clock cycle

time– To reduce number of registers needed.

Page 30: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 30 ECE734 VLSI Arrays for Digital Signal Processing

Cut Set Retiming

Page 31: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 31 ECE734 VLSI Arrays for Digital Signal Processing

Cut set delay transfer

Page 32: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 32 ECE734 VLSI Arrays for Digital Signal Processing

Cut-set delay transfer failure

Page 33: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 33 ECE734 VLSI Arrays for Digital Signal Processing

Cut-set Retiming

• Feed-forward cut-set:

• Feed-back cut-set

• Delay transfer theorem– Adding arbitrary non-

negative number of delays to each edge of a feed-forward cut-set of a DFG will not alter its output, except the output timing will be delayed.

– Transfer the same amount of delays from edges of the same direction across a feed-back cut set of a DFG to all edges of opposing edges across the same cut set will not alter the output, but its timing.

Page 34: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 34 ECE734 VLSI Arrays for Digital Signal Processing

Feed-forward Cut-Set Retiming

• Consider the FIR digital filter and its DFG:

y(n) = b0x(n) + b1x(n1)

• Critical path length = TM+TA

• Select a cut set • Insert a delay each to each

edge in the cut set.

• Retiming:

ynew(n) = b0x(n) + b1x(n2)

ynew(n) = y(n

• Critical path = Max(TM, TA)

X X

+

Dx(n) x(n1)

y(n)

b1b0

X X

+

Dx(n) x(n1)

y(n)

b1b0

DD

Page 35: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 35 ECE734 VLSI Arrays for Digital Signal Processing

Feed-back Cut Set Retiming

• Consider an IIR digital filter

y(n) = a·y(n-2) + x(n)

loop bound = (TM+TA)/2

clock cycle = TM+TA

• Shift 1 delay to the other edge across a feed-back cut set

• Filter remains unchanged.

loop bound = (TM+TA)/2

clock cycle = Max(TM ,TA)

+

2D

x(n) y(n)

a

+

D

x(n) y(n)

a

D

Page 36: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 36 ECE734 VLSI Arrays for Digital Signal Processing

Feed-back Cut Set Retiming

• Consider an IIR digital filter

y(n) = ay(n-1) + x(n)

loop bound = (TM+TA)

throughput = 1/(TM+TA)

+

D

x(n) y(n)

a

x(2k-1)=x(k)

x(2k) = 0

Clock period = (TM+TA)

Throughput = 1/[2(TM+TA)]

+

2D

x(m) y(m)

a

Page 37: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 37 ECE734 VLSI Arrays for Digital Signal Processing

Time scaling

Page 38: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 38 ECE734 VLSI Arrays for Digital Signal Processing

Slowing down the input rate

Page 39: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 39 ECE734 VLSI Arrays for Digital Signal Processing

Loss of Efficiency

Page 40: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 40 ECE734 VLSI Arrays for Digital Signal Processing

Slowdown + Retiming

Start with

y(n) = a y(n-1) + x(n)

clock cycle = Max(TM ,TA)

Throughput = 1/[2max(TM,TA)]

Start with y(n) = a y(n-2) + x(n)

loop bound = (TM+TA)/2

clock cycle = Max(TM ,TA)

throughput = 1/ Max(TM ,TA)

+

D

x(m) y(m)

a

D

+

D

x(n) y(n)

a

D

Page 41: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 41 ECE734 VLSI Arrays for Digital Signal Processing

Slow Down for Cut-Set Retiming

Page 42: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 42 ECE734 VLSI Arrays for Digital Signal Processing

Example of retiming

• Node delay = 1 t.u.• Before retiming:

– Critical path: a3 a4 a5 a6

– Clock cycle time = 4– 2 delay units

• After cut-set retiming – Critical path: a3 a5, a4 a6– Clock cycle time = 2– 6 delay units

• After additional retiming– Critical path: none– Clock cycle time = 1– 11 delay units

D

D

a1

a2

a3

a4

a5

a6

D

D

a1

a2

a3

a4

a5

a6D

DD

D

2D

D

a1

a2

a3

a4

a5

a6D

2DD

2DD

D

Page 43: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 43 ECE734 VLSI Arrays for Digital Signal Processing

Node Retiming

• Transfer delay through a node in DFG:

• r(v) = # of delays transferred from out-going edges to incoming edges of node v w(e) = # of delays on edge e

• wr(e) = # of delays on edge e after retiming

• Retiming equation:

subject to wr(e) 0.

• Let p be a path from v0 to vk

then

v v

3D

D2D

3D

D2D

r(v) = 2 ( ) ( ) ( ) ( )rw e w e r v r u

1

0

1

10

0

( ) ( )

( ) ( ) ( )

( ) ( ) ( )

k

r r ii

k

i i ii

k

w p w e

w e r v r v

w p r v r v

v0e0 v1

e1 … vkek

u ve

p

Page 44: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 44 ECE734 VLSI Arrays for Digital Signal Processing

Invariant Properties

1. Retiming does NOT change the total number of delays for each cycle.

2. Retiming does not change loop bound or iteration bound of the DFG

3. If the retiming values of every node v in a DFG G are added to a constant integer j, the retimed graph Gr will not be affected. That is, the weights (# of delays) of the retimed graph will remain the same.

Page 45: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 45 ECE734 VLSI Arrays for Digital Signal Processing

Node Retiming Examples

r(2) = 1

1 2

1

2

( ) ( ) ( 1) ( 1)

( ) ( 1)

( ) ( 2)

y n x n w n w n

w n a y n

w n b y n

( ) ( ) ( 1)

( ) ( 1) ( 2)

y n x n w n

w n a y n b y n

Page 46: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 46 ECE734 VLSI Arrays for Digital Signal Processing

DFG Illustration of the Example

T = max. {(1+2+1)/2, (1+2+1)/3} = 2Cr. Path delay = 2+1 = 3 t.u

T = max. {(1+2+1)/2, (1+2+1)/3} = 2Cr. Path Delay = max{2,2,1+1} = 2 t.u

Page 47: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 47 ECE734 VLSI Arrays for Digital Signal Processing

Retiming for Minimizing Clock Period

• Note that retiming will NOT alter iteration bound T.

• Iteration bound is the theoretical minimum clock period to execute the algorithm.

• Let edge e connect node u to node v. If the node computing time t(u) + t(v) > T, then clock period T > T. For such an edge, we require that

• To generalize, for any path from v0 to vk, we have

• In other words, for any possible critical path in the DFG that is larger than T, we require wr(e) 1.

0( ) ( ) ( ) ( )r kw p w p r v r v

0

( ) ( ) ,

( ) 1

k

ii

r

t p t v T

w p

If

then we require .

( ) 1rw e

Page 48: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 48 ECE734 VLSI Arrays for Digital Signal Processing

Retiming Example Revisited

wr(e21) 0, since t(2)+t(1) = 2 = T.

wr(e13) 1, since t(1)+t(3) = 3 > T.

wr(e14) 1, since t(1)+t(4) = 3 > T.

wr(e32) 1, since t(3)+t(2) = 3 > T.

wr(e42) 1, since t(4)+t(2) = 3 > T.

Use eq. wr(euv) = w(e) + r(v) – r(u),

w(e21) + r(1) – r(2) = 1 + r(1) – r(2) 0

w(e13) + r(3) – r(1) = 1 + r(3) – r(1) 1

w(e14) + r(4) – r(1) = 2 + r(4) – r(1) 1

w(e32) + r(2) – r(3) = 0 + r(2) – r(3) 1

w(e42) + r(2) – r(4) = 0 + r(2) – r(4) 1

2T

Page 49: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 49 ECE734 VLSI Arrays for Digital Signal Processing

Solution continues

• Since the retimed graph Gr remain the same if all node retiming values are added by the same constant. We thus can set r(1) = 0.

• The inequalities become

1 – r(2) 0 or r(2) 1

1 + r(3) 1 or r(3) 0

2 + r(4) 1 or r(4) –1

r(2) – r(3) 1 or r(3) r(2) 1r(2) – r(4) 1 or r(2) r(4) 1

• Since

one must have r(2) = 1. • This implies r(3) 0. But we

also have r(3) 0. Hence r(3)=0.

• These leave –1 r(4) 0. • Hence the two sets of

solutions are:

r(3) = 0, r(2) = 1, and r(4) = 0 or 1.

1 (2) (3) 1 0 1 1r r

Page 50: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 50 ECE734 VLSI Arrays for Digital Signal Processing

Systematic Solutions

Given a systems of inequalities:

r(i) – r(j) k; 1 i,j N

Construct a constraint graph:1. Map each r(i) to node i. Add

a node N+1.

2. For each inequality

r(i) – r(j) k,

draw an edge eji

such that w(eji) = k.

1. Draw N edges eN+1,i = 0.

a) The system of inequalities has a solution if and only if the constraint graph contains no negative cycles

b) If a solution exists, one solution is where ri is the minimum length path from the node N+1 to the node i.

Shortest path algorithms: Bellman-Ford algorithm

Floyd-Warshall algorithm

Page 51: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 51 ECE734 VLSI Arrays for Digital Signal Processing

Overview

• Algorithm Representations and Iteration Bound• Parallelism and Pipelining• Retiming• Unfolding• Folding

Page 52: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 52 ECE734 VLSI Arrays for Digital Signal Processing

Definitions

• Unfolding is the process of unfolding a loop so that several iterations are unrolled into the same iteration.

• Also known as– Loop unrolling (in compilers

for parallel programs)

– Block processing

• Applications– Reducing sampling period to

achieve iteration bound (desired throughput rate) T.

– Parallel (block processing) to execute several iterations concurrently.

– Digit-serial or bit-serial processing

Page 53: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 53 ECE734 VLSI Arrays for Digital Signal Processing

• Block processing formulation• J = 3, 9/J = 3 (an integer)

– X(k) = [x(3k) x(3k+1) x(3k+2)]T

– Y(k) = [y(3k) y(3k+1) y(3k+2)]T

– Y(k) = a*Y(k3 ) + X(k)

• J = 2, 9/J = ? (not an integer)– X(k) = [x(2k) x(2k+1)]T

– Y(k) = [y(2k) y(2k+1)]T

– Y(k) = a*Y(k? ) + X(k)

An example

• Before unfolding:For n = 0 to N-1, y(n)=a*y(n-9)+x(n)end

• Unfolding once (J = 2)For k = 0 to N/2-1, y(2k)=a*y(2k-9)+x(2k) y(2k+1)=a*y(2k-8)+x(2k+1)end

• Unfolding twice (J = 3)For k = 0 to N/3-1, y(3k)=a*y(3k-9)+x(3k) y(3k+1)=a*y(3k-8)+x(3k+1) y(3k+2)=a*y(3k-7)+x(3k+2)end

Page 54: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 54 ECE734 VLSI Arrays for Digital Signal Processing

Unfolding the DFG

• Rewrite the algorithm formulation:

y(2k)=a*y(2k-9)+x(2k)

y(2k+1)=a*y(2k-8)+x(2k+1)

y(2k)=a*y(2(k-5)+1)+x(2k)

y(2k+1)=a*y(2(k-4))+x(2k+1)• After J-folded unfolding, the clock

period T = J Ts, where Ts is the data sampling period.

T=Ts

T=J Ts

Page 55: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 55 ECE734 VLSI Arrays for Digital Signal Processing

General DFG Unfolding Method

• Define

• Step 1. For each node U in original DFG, draw J nodes {Ui; 0 iJ-1} in the unfolded DFG

• Step 2. For each edge from U to V with w delays, draw J edges from Ui to V(i+w)%J with (i+w)/J delays

% / , ,

x x

x x

a b a b a b a b

largest integer that ;

Smallest integer that ;

are integers

9 0,1,237

10 34

ii w i

iJ

Page 56: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 56 ECE734 VLSI Arrays for Digital Signal Processing

Another DFG Unfolding Example

Q

S

T

R

3D2D

Q0

S0

T0

R0

Q1

S1

T1

R1

J=2

T=3

i w (i+w)%J

0 0 0 0

0 2 0 1

0 3 1 1

1 0 1 0

1 2 1 1

1 3 0 2

( ) /i w J

Step 1. Duplicate J copies of each node

Page 57: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 57 ECE734 VLSI Arrays for Digital Signal Processing

Another DFG Unfolding Example

Q

S

T

R

3D2D

Q0

S0

T0

R0

Q1

S1

T1

R1

J=2

T=3

i w (i+w)%J

0 0 0 0

0 2 0 1

0 3 1 1

1 0 1 0

1 2 1 1

1 3 0 2

( ) /i w J

Step 2. Add all edges with 0 delay on them.

Page 58: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 58 ECE734 VLSI Arrays for Digital Signal Processing

Another DFG Unfolding Example

Q

S

T

R

3D2D

Q0

S0

T0

R0

D

Q1

S1

T1

R1

D

D 2D

J=2

T=3

T=6

i w (i+w)%J

0 0 0 0

0 2 0 1

0 3 1 1

1 0 1 0

1 2 1 1

1 3 0 2

( ) /i w J

Step 3. Use table on the left to figure out edges with delays.

Page 59: Algorithmic Transformations

(C)2002-2004 Yu Hen Hu 59 ECE734 VLSI Arrays for Digital Signal Processing

Properties of Unfolding

• Unfolding preserves the number of registers (delays) in a DFG

• For a loop with w delays in a DFG that has been unfolded J times, it leads to

– g.c.d.(w, J) loops in the unfolded DFG, with each of these loops containing

– w/(g.c.d.(w,J)) delays and– J/(g.c.d.(w,J)) copies of each

node that appear in the original loop.

• Unfolding a DFG with iteration bound T results in a J-folded DFG with iteration bound JT.

• A path with w (< J) delays in a DFG will lead to J-w paths with no delays, and w paths with 1 delay each in the J-unfolded DFG.

• Any path in the original DFT containing J or more delays leads to J paths 2ith 1 or more delay in each path. Therefore, it can not create a critical path in the J-unfolded DFT

• Any clock period that can be achieved by retiming a J-unfolded DFG can be achieved by retiming the original DFG and followed by J-unfolding.