ADSP Lecture2 - Unfolding ([email protected]) 2-1
VLSI Signal ProcessingVLSI Signal ProcessingVLSI Signal ProcessingVLSI Signal ProcessingLecture 2 Unfolding Lecture 2 Unfolding
TransformationTransformation
ADSP Lecture2 - Unfolding ([email protected]) 2-2
Multiple-Data Processing• Create a program with more than one
iteration, e.g. J loops unrolling• Example: Loop unrolling + software pipelining
1
2
3
4
5
6
7
8
clock cycle operation
1
2
3
1
2
3
1
2
1
1
1
2
2
2
3
3
3
1
2
3
4
5
6
7
8
clock cycle
ADSP Lecture2 - Unfolding ([email protected]) 2-3
Basic Ideas• Parallel
processing• Pipelined
processing
a1 a2 a3 a4
b1 b2 b3 b4
c1 c2 c3 c4
d1 d2 d3 d4
a1 b1 c1 d1
a2 b2 c2 d2
a3 b3 c3 d3
a4 b4 c4 d4
P1
P2
P3
P4
P1
P2
P3
P4
time time
ADSP Lecture2 - Unfolding ([email protected]) 2-4
Data Dependence• Parallel processing
requires NO data dependence between processors
• Pipelined processing will involve inter-processor communication
P1
P2
P3
P4
P1
P2
P3
P4
time time
ADSP Lecture2 - Unfolding ([email protected]) 2-5
Parallel Processing•
• In a J-unfolded system, each delay is J-slow. That is, if input to a delay element is x(kJ+m), then the output is x((k-1)J+m) = x(kJ+m-J)
ADSP Lecture2 - Unfolding ([email protected]) 2-6
Parallel Processing• Block processing
– the number of inputs processed in a clock cycle is referred to as the block size
– at the k-th clock cycle, three inputs x(3k), x(3k+1), and x(3k+2) are processed simultaneously to generate y(3k), y(3k+1), and y(3k+2)
S e ria l toP a ra lle l
C o nve rte r
S IS Ox(n) y(n)
M IM O
x(3k ) y(3k )
x(3 k+1 )
x(3 k+2 )
y(3 k+1 )
y(3 k+2 )
P ara lle l toS eria l
C o nve rte rx(n) y(n)
ADSP Lecture2 - Unfolding ([email protected]) 2-7
I/O Conversion• Serial to parallel converter
• Parallel to serial converter
3 k
D D
T/3T/3
s a m p lin g p e rio d
y(3k )y(3 k+1 )y(3 k+2 )
y(n)
x(n) D D
x(3k)x(3 k+1 )x(3 k+2 )
T/3T/3
s a m p lin g p e rio d
ADSP Lecture2 - Unfolding ([email protected]) 2-9
Mathematical Formulation
• e.g. y(n) = ay(n-9) + x(n)• 2-parallel
Y(2k) = ay(2k-9) + x(2k)Y(2k+1) = ay(2k-8) + x (2k+1)
• In 2-parallel SDFG, one active clock edge leads two samplesY(2k) = ay(2(k-5)+1) + x(2k)Y(2k+1) = ay(2(k-4)+0) + x(2k+1)
• Dependency with less than # parallelism of sample delays can be implemented with internal routing
ADSP Lecture2 - Unfolding ([email protected]) 2-10
Unfolding the DFG
T=Ts
T=J Ts
Not trivial, even for a simple graph
ADSP Lecture2 - Unfolding ([email protected]) 2-11
Block Processing for FIR Filter
• One form of vectorized parallel processing of DSP algorithms. (Not the parallel processing in most general sense)
• Block vector: [x(3k) x(3k+1) x(3k+2)]• Clock cycle: can be 3 times longer• Original (FIR filter):
• Rewrite 3 equations at a time: )2()1()()( ncxnbxnaxny
(3 ) (3 ) (3 1) (3 2)
(3 1) (3 1) (3 ) (3 1)
(3 2) (3 2) (3 1) (3 )
y k x k x k x k
y k a x k b x k c x k
y k x k x k x k
ADSP Lecture2 - Unfolding ([email protected]) 2-13
Block Processing for IIR Digital Filter
• Original formulation:
• Rewrite:
• Vector formulation:
( ) ( 2) ( )y n a y n x n n: sample period
k: processor period
Tsample≠Tclk
)12()12()12(
)2()22()2(
kxkayky
kxkayky
)()1()(
)12(
)2()( ,
)12(
)2()(
kkak
kx
kxk
kx
kxk
xyy
yx
ADSP Lecture2 - Unfolding ([email protected]) 2-14
Block IIR Filter
D
D
S/P P/S+
+
x(2k)
x(2k+1)
y(2k+1)
y(2k)x(n) y(n)
y(2(k1))
y(2(k1)+1)
clock period not equal to sampling period
ADSP Lecture2 - Unfolding ([email protected]) 2-15
Timing Comparison
• Pipelining
• Block processing
1 2 3 4x(1) x(2) x(3) x(4)
y(1) y(2) y(3) y(4)
1 2 3 4 5 6 7 8x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(7)
MAC
1 2 3 4 5 6 7 8
y(1) y(2) y(3) y(4) y(5) y(6) y(7) y(7)Add
a y(1)
Mul
1 1 3 3 5 5 7 7
2 2 4 4 6 6 8 8x(2) x(4) x(6) x(8)
x(1) x(3) x(5) x(7)
ADSP Lecture2 - Unfolding ([email protected]) 2-16
Definitions• Unfolding is the process of unfolding a loop so
that several iterations are unrolled into the same iteration.
• Also known as (a.k.a.)– Loop unrolling (in compilers for parallel programs)– Block processing
• Applications– Reducing sampling period to achieve iteration bound
(desired throughput rate) T.
– Parallel (block processing) to execute several iterations concurrently.
– Digit-serial or bit-serial processing
ADSP Lecture2 - Unfolding ([email protected]) 2-17
Unfolding the DFG• y(n)=ay(n-9)+x(n)
• Rewrite the algorithm formulation: y(2k)=ay(2k-9)+x(2k)y(2k+1)=ay(2k-8)+x(2k+1)
y(2k)=ay(2(k-5)+1)+x(2k)y(2k+1)=ay(2(k-4))+x(2k+1)
• After J-folded unfolding, the clock period T = J Ts, where Ts is the data sampling period.
ADSP Lecture2 - Unfolding ([email protected]) 2-18
Timing Diagram
• Above timing diagram is obtained assuming that the sampling period Ts remains unchanged. Thus, the clock period T is increased J-fold.
• Since 9/2 is not an integer, output (y(0), y(1)) will be needed by two different future iterations, 4T and 5T later.
y(0) y(1) y(2) y(3) y(4) y(5) y(6) y(7) y(8) y(9) y(10) y(11) y(12) y(13)
T=Ts
y(0) y(2) y(4) y(6) y(8) y(10) y(12)
y(1) y(3) y(5) y(7) y(9) y(11) y(13)
T=2Ts
9 T
4T5T
9 T
ADSP Lecture2 - Unfolding ([email protected]) 2-19
Another DFG Unfolding Example
Q
S
T
R
3D2D
Q0
S0
T0
R0
Q1
S1
T1
R1
J=2
T=3
i w(i+w)%J
0 0 0 0
0 2 0 1
0 3 1 1
1 0 1 0
1 2 1 1
1 3 0 2
( ) /i w J
Step 1. Duplicate J copies of each node
ADSP Lecture2 - Unfolding ([email protected]) 2-20
Another DFG Unfolding Example
Q
S
T
R
3D2D
Q0
S0
T0
R0
Q1
S1
T1
R1
J=2
T=3
i w(i+w)%J
0 0 0 0
0 2 0 1
0 3 1 1
1 0 1 0
1 2 1 1
1 3 0 2
( ) /i w J
Step 2. Add all edges with 0 delay on them.
ADSP Lecture2 - Unfolding ([email protected]) 2-21
Another DFG Unfolding Example
Q
S
T
R
3D2D
Q0
S0
T0
R0
D
Q1
S1
T1
R1
D
D 2D
J=2
T=3
T=6
i w(i+w)%J
0 0 0 0
0 2 0 1
0 3 1 1
1 0 1 0
1 2 1 1
1 3 0 2
( ) /i w J
Step 3. Use table on the left to figure out edges with delays.
ADSP Lecture2 - Unfolding ([email protected]) 2-22
Unfolding Transformation• For each node U in the original DFG, draw J node U0, U1,…, UJ-1• For each edge UV with w delays in the original DFG, draw the J edge
s UiV(i + w)%J with floor[(i+w)/J] delays for i=0,1,…, J-1
Example
• Unfolding of an edge with w delays in the original DFG produces J-w edges with no delays and w edges with 1delay in J-unfolded DFG for w < J
• Unfolding preserves precedence constraints of a DSP algorithm
ADSP Lecture2 - Unfolding ([email protected]) 2-24
Delay Preservation• Unfolding preserves the number of delays in a DFG• Let , where
11
11
111
mJ
Jw
mJ
Jm
J
nJnJm
J
nJw
mJ
JJm
J
nJnJm
J
nJw
mJ
w
nJmw Nnm 0, 10 Jn
w
nJm
nmnJm
J
Jw
J
nJw
J
nJw
J
w
1
11
ADSP Lecture2 - Unfolding ([email protected]) 2-25
Example• Unfold the following DFG using folding factor 2 and 5
A B C E
D
7 DD
2 D
3 D
A 0 B 0 C 0 E 0
D 0
A 1 B 1 C 1 E 1
D 1
D
3 D
4 D
D
D
2 D
D
A 0 B 0 C 0 E 0 D 0
A 1 B 1 C 1 E 1 D 1
A 2 B 2 C 2 E 2 D 2
A 3 B 3 C 3 E 3 D 3
A 4 B 4 C 4 E 4 D 4
DD
D
D
2 D
2 D
D
DD
D
D
2 - unfo ld e d D F G5 - unfo ld e d D F G
ADSP Lecture2 - Unfolding ([email protected]) 2-26
Properties of Unfolding• Unfolding preserves the
number of registers (delays) in a DFG
• For a loop with w delays in a DFG that has been unfolded J times, it leads to – g.c.d.(w, J) loops in the
unfolded DFG, with each of these loops containing
W/(g.c.d.(w,J)) delays and J/(g.c.d.(w,J)) copies of
each node that appear in the original loop.
• Unfolding a DFG with iteration bound T results in a J-folded DFG with iteration bound JT.
• A path with w (< J) delays in a DFG will lead to J-w paths with no delays, and w paths with 1 delay each in the J-unfolded DFG.
• Any clock period that can be achieved by retiming a J-unfolded DFG can be achieved by retiming the original DFG and followed by J-unfolding.
ADSP Lecture2 - Unfolding ([email protected]) 2-27
When a Loop is Unfolded• A loop ℓ with w delays in a DFG • Travel the loop A~>A p times also a loop with pw delays • In J-unfolded DFG, consider the path AiA(i+pw)%J . It is a loop if
i=(i+ pw)%J. This implies that J | pw• The smallest p = J/gcd(J, w). That is, in J-unfolded DFG, one c
an travel the loop A~>A J/gcd(J, w) times.• Recall that there are totally J copies of node A. Hence, there a
re J/(J/gcd(J,w))=gcd(J, w) loops and each loop contains w/ gcd(J, w) delays.
• The iteration bound in J-unfolded DFG is then
JTw
tJ
wjw
twj
J
Tl
l
l
l
l
ll
lmax
),gcd(
),gcd(max'
ADSP Lecture2 - Unfolding ([email protected]) 2-28
When a Path is Unfolded• If w<J, then a path containing w delays within a DFG will lea
d to (J-w) paths with no delays and w paths with 1 delay in the J-unfolded DFG.
• If w≥J, then the path leads to J paths with one or more delays in the J-unfolded DFG. This implies that these paths are not critical.
• Assume that the critical path of the J-unfolded DFG is c. If D(U,V)≥c, then Wr(UV)=W(UV)+r(V)-r(U) ≥ J
• Any feasible clock cycle period that can be obtained by retiming the J-unfolded DFG can be achieved by retiming the original DFG directly and followed by J-unfolding.
ADSP Lecture2 - Unfolding ([email protected]) 2-29
When a Path is Unfolded• Suppose r’ is a legal retiming for the J-unfolded DFG, GJ, wh
ich leads to critical path c.• Let r(U) = i r’(Ui), 0≤i≤J-1.
– r is a feasible retiming for the original DFG, G.– The retiming leads to a critical path c
constraintpath critical
)( if ,1'' )2(
constraint feasible '' )1(
then,path critical a toleads and for retiming legal is ' Since
in delays with edgean Consider
)%()%(
)%(
cVUDJ
wiVrUr
J
wiVrUr
cGr
GwVU
JwiiJwii
Jwii
J
0≤i≤J-1
i
JVUWVrUr
wVrUr
),()()( )2(
)()( )1(
ADSP Lecture2 - Unfolding ([email protected]) 2-30
Sample Period Reduction• Case1: A node in the DFG having
computation time greater than T∞
• Case2: Iteration bound is not an integer
• Case3: Longest node computation is larger than the iteration T∞, and T∞ is not an integer
ADSP Lecture2 - Unfolding ([email protected]) 2-31
Case 1• Critical path dominates, since a node
computation time is more than iteration bound
Retiming cannot be used to reduce sample period
ADSP Lecture2 - Unfolding ([email protected]) 2-32
Sample Period Reduction• Rule of Thumb: used be should unfolding
TtU
T∞=6,Tcritical=6
ADSP Lecture2 - Unfolding ([email protected]) 2-33
Case 2• Iteration period cannot not achieve the
iteration bound
ADSP Lecture2 - Unfolding ([email protected]) 2-36
Parallel Processing• Parallel processing can be
performed by unfolding
ADSP Lecture2 - Unfolding ([email protected]) 2-38
Top Related