Download - Exploiting Fast Carry Chains of FPGAs for Designing Compressor Trees

Exploiting Fast Carry Chains of FPGAs for Designing Compressor TreesHadi P. AfsharPhilip BriskPaolo Ienne

Multi-input Additions are Fundamental DSP and Multimedia Application

– FIR filters, Motion Estimation,…

Parallel Multipliers Flow Graph Transformation

2

DD DD DD

ΣΣ

DD

FIR Filter

3

Flow Graph Transformationstep 3

>>

&

delta

7

&4

=

0+step 1

>>

&

2

=

0step 2

>>

&

1

=

0

vpdiff

step 3

>>

=

delta

1

0

step 2

>>0

=

delta

2

0

step 1

>>0

=

delta

4

&0

step 0

>>0

vpdiff

∑

+Compressor Tree

ADPCM

+

+

& &

BEFORE AFTER

Compressor vs. Adder Tree

4

CPACPA

CSACSA CSACSACSA

CSA CSACSACSA

CPACPA

CPACPA

CPACPA

CPACPA

Compressor Tree Adder Tree

Compressors are better than Adder

Trees in VLSI

Compressors are better than Adder

Trees in VLSI

But Adder Trees are better than

Compressors in FPGA!

But Adder Trees are better than

Compressors in FPGA!

- Slow intra LUT routing- Poor LUT utilization- Low logic density

- Slow intra LUT routing- Poor LUT utilization- Low logic density

But Compressor Trees can be

faster and smaller

if

Properly Designed

5

Better Compressors on FPGA Generalized Parallel Counter (GPC) is the basic block More logic density Fewer logic levels Less pressure on the routing

6

CPACPA

GPCGPCGPCGPC

GPCGPCGPC

Overview Arithmetic Concepts

Hybrid Design Approach– Bottom-up– Top-down

Experiments

Conclusion

7

Parallel Counters Parallel Counter– Count # of input bits set to 1– Output is a binary value– 3:2 − Full Adder– 2:2 − Half Adder

Generalized Parallel Counter (GPC)– Input bits can have different bit position– Eg. (3, 3; 4) GPC

8

m

n

m:n counter

n = log2(m+1)

∑

Compressor Trees on FPGAs We propose GPCs as the basic blocks for

compressor trees

– Why?

1.GPCs map well onto FPGA logic cells

2.GPCs are flexible

9

GPC Mapping Example

10

(3,5;4)(3,4;4)(0,5;3)

5 Counters3 GPCs



Experiments

Conclusion

11

Hybrid Design Approach

12

FPGA Architectural

Characteristics

FPGA Architectural

Characteristics

Compressor Tree Specification

Compressor Tree Specification

Atom Level GPC HDL Library

Atom Level GPC HDL Library

GPC Mapped HDL Netlist

GPC Mapped HDL Netlist

Place and Route

Place and Route

Bottom-UpBottom-Up

Top-DownTop-Down

ResultResult

FPGA Logic Cell Altera Stratix-II/III/V

13

Logic Array Block (LAB)

Adaptive Logic Module (ALM)

Reg

Reg

Comb.Logic

+

+

1234

5678

FPGA Logic Cell

14

ALM Configuration Modes– Normal– Extended– Arithmetic– Shared Arithmetic

4-LUT

4-LUT

4-LUT

4-LUT

+

+

4-LUT

4-LUT

4-LUT

4-LUT

+

+

Bottom-up Design

15

F0F1F2

F0

F1

F2

6:3 GPC

What if we have bigger GPCs like

7:3 GPC?

What if we have bigger GPCs like

7:3 GPC?

Can we exploit the carry chain and dedicated

adders for building GPCs?

Can we exploit the carry chain and dedicated

adders for building GPCs?

LAB0 LAB1

GPC Design Example

16

a5

a4a3a2a1

a0

FAFAHAHA

FAFA

FAFA

s0c0s1c1

z0

z1z2

(0, 6; 3) GPC

C(a1,a2,a3)

C(a4,a5)

S(a1,a2,a3)

S(a4,a5)

a0

a0

s0

s1

c0

c1

0

0

a0

z0

z1

z2

0

0

ALM0

ALM1

+

+

+

+

z2

GPC Placement

17

+

+

+

+

GPC Boundary

+

+

GPC Boundary

Zero value on the carry

0

Logic separation between carry and sum

{cout,s} = cin+ a + a = cin+ 2a cout = a and s = cin

a

a

cin

cout

s

19

GPCi

+LUT

+LUT

+LUT

+LUT

+LUT

+LUT

+LUT

+LUT

+LUT

+LUT

+LUT

+LUT

+LUT

+LUT

+LUT

+LUT

GPCi+1

GPCi

GPCi+1

Top-down Heuristic

20

{Build_GPC_library(); repeat{ while (col_indx<max_col_indx) { if(columns[col_indx] > H) Map_by_GPC(); else col_indx++; } lsb_to_msb_covering(); Connect_GPCs_IOs(); Propagate_comb_delay(); Generate_next_stage_dots();

} until three rows of dots remains;}

Step1:

Step2:

Step3:

Mapping_algorithm(Integer : M, Integer : W, Array of Integers : columns )

(0, H; log2H)

Major Step of Heuristic

21Process columns from LSB to MSB

Mapped to (0, H; log2H) GPCs

Height < H

Delay Balancing

22

CP1 = z1d+a0d

z1d > z4d > z6d

a0d > a2d > a5d

z2 z1 z0z5 z4 z3z8 z7 z6

a5 a2 a0

CP2 = max(z1d+a5d, z4d+a2d, z6d+a0d)

Overview Arithmetic Constructs


Experiments

Conclusion

23

Experiments Bottom-up design– Atom-level design by Verilog Quartus Module (VQM)

format Top-down– Heuristic: C++– Output: Structural VHDL

Quartus-II Altera tool Benchmarks– DCT, FIR, ME, G721– Multiplier– Horner Polynomial– Video Mixer

24

Experiments Mapping methods– Ternary– LUT Only– Arith1: Arithmetic mode, without delay balancing– Arith2: Arithmetic mode, with delay balancing

25

Delay (ns)

26

-27% +2%

Area (ALM)

27

+47% +18%

Area (LAB)

28

-4.5%



Experiments

Conclusion

29

30

Conclusion

Conventional wisdom has held that adder trees outperform compressor trees on FPGAs– Ternary adder trees were a major selling point

Conventional wisdom is wrong!– GPCs map nicely onto FPGA logic cells

Carry-chain

– Compressor trees on FPGAs, are faster than adder trees when built from GPCs