Exploiting Fast Carry Chains of FPGAs for Designing Compressor TreesHadi P. AfsharPhilip BriskPaolo Ienne
Multi-input Additions are Fundamental DSP and Multimedia Application
– FIR filters, Motion Estimation,…
Parallel Multipliers Flow Graph Transformation
2
DD DD DD
ΣΣ
DD
FIR Filter
3
Flow Graph Transformationstep 3
>>
&
delta
7
&4
=
0+step 1
>>
&
2
=
0step 2
>>
&
1
=
0
vpdiff
step 3
>>
=
delta
1
0
step 2
>>0
=
delta
2
0
step 1
>>0
=
delta
4
&0
step 0
>>0
vpdiff
∑
+Compressor Tree
ADPCM
+
+
& &
BEFORE AFTER
Compressor vs. Adder Tree
4
CPACPA
CSACSA CSACSACSA
CSA CSACSACSA
CPACPA
CPACPA
CPACPA
CPACPA
Compressor Tree Adder Tree
Compressors are better than Adder
Trees in VLSI
Compressors are better than Adder
Trees in VLSI
But Adder Trees are better than
Compressors in FPGA!
But Adder Trees are better than
Compressors in FPGA!
- Slow intra LUT routing- Poor LUT utilization- Low logic density
- Slow intra LUT routing- Poor LUT utilization- Low logic density
But Compressor Trees can be
faster and smaller
if
Properly Designed
5
Better Compressors on FPGA Generalized Parallel Counter (GPC) is the basic block More logic density Fewer logic levels Less pressure on the routing
6
CPACPA
GPCGPCGPCGPC
GPCGPCGPC
Overview Arithmetic Concepts
Hybrid Design Approach– Bottom-up– Top-down
Experiments
Conclusion
7
Parallel Counters Parallel Counter– Count # of input bits set to 1– Output is a binary value– 3:2 − Full Adder– 2:2 − Half Adder
Generalized Parallel Counter (GPC)– Input bits can have different bit position– Eg. (3, 3; 4) GPC
8
m
n
m:n counter
n = log2(m+1)
∑
Compressor Trees on FPGAs We propose GPCs as the basic blocks for
compressor trees
– Why?
1.GPCs map well onto FPGA logic cells
2.GPCs are flexible
9
GPC Mapping Example
10
(3,5;4)(3,4;4)(0,5;3)
5 Counters3 GPCs
Overview Arithmetic Concepts
Hybrid Design Approach– Bottom-up– Top-down
Experiments
Conclusion
11
Hybrid Design Approach
12
FPGA Architectural
Characteristics
FPGA Architectural
Characteristics
Compressor Tree Specification
Compressor Tree Specification
Atom Level GPC HDL Library
Atom Level GPC HDL Library
GPC Mapped HDL Netlist
GPC Mapped HDL Netlist
Place and Route
Place and Route
Bottom-UpBottom-Up
Top-DownTop-Down
ResultResult
FPGA Logic Cell Altera Stratix-II/III/V
13
Logic Array Block (LAB)
Adaptive Logic Module (ALM)
Reg
Reg
Comb.Logic
+
+
1234
5678
FPGA Logic Cell
14
ALM Configuration Modes– Normal– Extended– Arithmetic– Shared Arithmetic
4-LUT
4-LUT
4-LUT
4-LUT
+
+
4-LUT
4-LUT
4-LUT
4-LUT
+
+
Bottom-up Design
15
F0F1F2
F0
F1
F2
6:3 GPC
What if we have bigger GPCs like
7:3 GPC?
What if we have bigger GPCs like
7:3 GPC?
Can we exploit the carry chain and dedicated
adders for building GPCs?
Can we exploit the carry chain and dedicated
adders for building GPCs?
LAB0 LAB1
GPC Design Example
16
a5
a4a3a2a1
a0
FAFAHAHA
FAFA
FAFA
s0c0s1c1
z0
z1z2
(0, 6; 3) GPC
C(a1,a2,a3)
C(a4,a5)
S(a1,a2,a3)
S(a4,a5)
a0
a0
s0
s1
c0
c1
0
0
a0
z0
z1
z2
0
0
ALM0
ALM1
+
+
+
+
z2
GPC Placement
17
+
+
+
+
GPC Boundary
+
+
GPC Boundary
Zero value on the carry
0
Logic separation between carry and sum
{cout,s} = cin+ a + a = cin+ 2a cout = a and s = cin
a
a
cin
cout
s
19
GPCi
+LUT
+LUT
+LUT
+LUT
+LUT
+LUT
+LUT
+LUT
+LUT
+LUT
+LUT
+LUT
+LUT
+LUT
+LUT
+LUT
GPCi+1
GPCi
GPCi+1
Top-down Heuristic
20
{Build_GPC_library(); repeat{ while (col_indx<max_col_indx) { if(columns[col_indx] > H) Map_by_GPC(); else col_indx++; } lsb_to_msb_covering(); Connect_GPCs_IOs(); Propagate_comb_delay(); Generate_next_stage_dots();
} until three rows of dots remains;}
Step1:
Step2:
Step3:
Mapping_algorithm(Integer : M, Integer : W, Array of Integers : columns )
(0, H; log2H)
Major Step of Heuristic
21Process columns from LSB to MSB
Mapped to (0, H; log2H) GPCs
Height < H
Delay Balancing
22
CP1 = z1d+a0d
z1d > z4d > z6d
a0d > a2d > a5d
z2 z1 z0z5 z4 z3z8 z7 z6
a5 a2 a0
CP2 = max(z1d+a5d, z4d+a2d, z6d+a0d)
Overview Arithmetic Constructs
Hybrid Design Approach– Bottom-up– Top-down
Experiments
Conclusion
23
Experiments Bottom-up design– Atom-level design by Verilog Quartus Module (VQM)
format Top-down– Heuristic: C++– Output: Structural VHDL
Quartus-II Altera tool Benchmarks– DCT, FIR, ME, G721– Multiplier– Horner Polynomial– Video Mixer
24
Experiments Mapping methods– Ternary– LUT Only– Arith1: Arithmetic mode, without delay balancing– Arith2: Arithmetic mode, with delay balancing
25
Delay (ns)
26
-27% +2%
Area (ALM)
27
+47% +18%
Area (LAB)
28
-4.5%
Overview Arithmetic Concepts
Hybrid Design Approach– Bottom-up– Top-down
Experiments
Conclusion
29
30
Conclusion
Conventional wisdom has held that adder trees outperform compressor trees on FPGAs– Ternary adder trees were a major selling point
Conventional wisdom is wrong!– GPCs map nicely onto FPGA logic cells
Carry-chain
– Compressor trees on FPGAs, are faster than adder trees when built from GPCs
Top Related