A Flexible DSP Block to Enhance FGPA Arithmetic Performance
description
Transcript of A Flexible DSP Block to Enhance FGPA Arithmetic Performance
A Flexible DSP Block to Enhance FGPA Arithmetic Performance
Hadi Parandeh-Afshar Alessandro CevreroPanagiotis Athanasopoulous Philip BriskYusuf LeblebiciPaolo Ienne
Ecole Politechique Federale De lausanne (EPFL)University of California Riverside (UCR)
{[email protected]}[email protected]
LAP EPFLLSM, LAP EPFLLSM, LAP EPFL UCRLSM EPFLLAP EPFL
Motivation and contribution
New DSP block for high performance FPGAs Increased flexibility
Enchance FPGA arithmetic performance
Programmable Compressor
Tree
Programmable Compressor
Tree
PPGPPG
Bypassable PPG
Motivation and contribution
Data flow transformation automatically expose compressor tree
19
E1 E2M1 M2
1948
4
S1 S2
out
sign
xor
negS1 S2
xor
E1 E2
19 19
M2M1
48 1
4
out
not
sign
andFused multiply-addition operations cannot use current DSP blocks in a
single-cycle
Arithmetic transformations
E1 E2
DSP blocks cannot accelerate multi-operand addition
(a) (b)
[Verma et al , TCAD 08]
Outline
Related work Limitations
DSP Block Architecture
Experimental methodology
Results
Conclusions
FPGA commentary Logic cells with dedicated addition circuitry and fast carry
chains Compressor tree synthesis on 6-LUT FPGAs
[Parandeh-Afshar et. al, ASPDAC 08, DATE 08, FPL 09]
IP cores [Xilinx, Altera] FP cores [Beauchamp et al., TVLSI 08] DSP Blocks [Altera Stratix III-IV]
Σ
9 9
9 9
9 9
9 9
FPGA commentary Logic cells with dedicated addition circuitry and fast carry
chains Compressor tree synthesis on 6 LUTs FPGAs
[Parandeh-Afshar et al, DATE 08, ASPDAC 08, FPL 09]
IP cores [Xilinx, Altera] FP cores [Beauchamp et al., TVLSI 08] DSP Blocks [Altera Stratix III-IV]
Σ
9 9
9 9
9 9
9 9
Field Programmable Compressor Tree (FPCT)
User-configurable multi operand adder Compressor tree + bypassable CPA
15
16
15
CSlice
6
128 = 816 input bits
48 = 86 output bits
Carry-in
1515
Carry-out
[Cevrero et al, FPGA 08, TRETS 09]
FPCT limitations
PPG soft logic
Soft-Logic 9x9-bit PPG (81 LUTs)
82 wires
1
FPCT
18 bit output
9x9-bit signed multiplier [Baugh Wooley]
FPCT limitations
PPG soft logic Low input utilization for multipliers
Soft-Logic 9x9-bit PPG (81 LUTs)
82 wires
1
FPCT
18 bit output
9x9-bit signed multiplier [Baugh Wooley]
222 2 333
C0C1C2C3C4C5C6
64% input utilization
DSP block architecture
4 11
FPCT(8 CSlices)
128
48
½-FPCT(4 CSlices)
DSP block architecture
4
½-FPCT(4 CSlices)
AA
BBB
PPGPPG*
55
61
21
15
3
0
3
0
9018
128
11
61
6
Two 9x9 signed PPGs One modified to support larger multiplier
Hard compression circuits ‘A’ and ‘B’ Efficient Synthesis of large multipliers
½-FPCT(4 CSlices)
DSP block architecture
4
½-FPCT(4 CSlices)
AA
BBB
PPGPPG*
55
61
21
15
3
0
3
0
9018
128
11
61
6
Two 9x9 signed PPGs One modified to support larger multiplier
Hard compression circuits ‘A’ and ‘B’ Efficient Synthesis of large multipliers
522233
Fixed
Logic (A)
Fixed
Logic (B)
C1C2C3C4
½-FPCT(4 CSlices)
DSP block architecture
4
½-FPCT(4 CSlices)
AA
BBB
PPGPPG*
55
61
21
15
3
0
3
0
9018
128
11
61
6
Only 8% larger that traditional FPCT in 90nm CMOS (ARTISAN cell library with TSMC process)
Two 9x9 signed PPGs One modified to support larger multiplier
Hard compression circuits ‘A’ and ‘B’ Efficient Synthesis of large multipliers
Experimental methodology
Virtual Embedded blocks (VEB) [Ho et al, FCCM 06] Define a preplaced soft IP core: F*
Same area and I/0 as our DSP
Input Pins
Output Pins
IP
IP
IP
Experimental methodologyInput Pins
Output Pins
F*
F*
F*
Virtual Embedded blocks (VEB) [Ho et al, FCCM 06] Define a preplaced soft IP core: F*
Same area and I/0 as our DSP Replace our DSP block with F* Map benchmark on Stratix II Extract F* delay
Estimated proposed DSP block delay ASIC design flow (90nm CMOS)
Experimental methodologyInput Pins
Output Pins
New-DPS
New-DPS
New-DPS
Virtual Embedded blocks (VEB) [Ho et al, FCCM 06] Define a preplaced soft IP core: F*
Same area and I/0 as our DSP Replace our DSP block with F* Map benchmark on Stratix II Extract F* delay
Estimated proposed DSP block delay ASIC design flow (90nm CMOS)
For each proposed DSP block in the circuit Subtract delay of F* Add proposed DSP block delay
Results
ns
Critical Path Delay
Ternary
GPC [Parandeh-Afshar et al, ASPDAC 08]
Stratix II DSP Block
FPCT w/ Soft PPG
Proposed DSP Block
0
2
4
6
8
10
12
m9x9 m10x10 m12x12 m18x18 m20x20
Results
0
1
2
3
4
5
6
7
8
9
m9x9 m10x10 m12x12 m18x18 m20x20
Stratix II DSP Block
FPCT w/ Soft PPG
Proposed DSP Block
Normalized Area (to Stratix II DSP block area)
Conclusion
New DSP block proposed Accelerate multiplication and multi-operand addition
More flexibility Competitive with Stratix II DSP block
Intends to replace compressor tree in existing DSP block
Only 8% area overhead respect to original FPCT