AMIN FARMAHININ-FARAHANI CHARLES TSEN KATHERINE COMPTON FPGA Implementation of a 64-bit BID-Based...

AMIN FARMAHININ-FARAHANICHARLES TSEN

KATHERINE COMPTON

FPGA Implementation of a 64-bit BID-Based Decimal Floating

Point Adder/Subtractor

OUTLINE

Introduction and OverviewBaseline ImplementationFPGA-based Optimizations

Multiplier Constant Tables Multiplexers

Results Conclusion

Introduction

It is difficult to represent 0.1 in BFP. (closest single precision is 0.1000003814697265625)

FPGA’s are a potential solution to add hardware-based DFP engines to existing compute clusters without replacing the systems. Allows them to accelerate DFP calculations without replacing their computing infrastructure.

This was the first presentation of a BID-based DFP adder for FPGA’s

The basic idea in this paper was to take an adder implemented in HDL for standard cells and improve it for the Xilinx Virtex 5.

Intro: 3 Rounding Scenarios

Important to note because it changes the number of clock cycles required.

Case 1: The A exponent does not equal B exponent and the intermediate significand is no larger than our chosen rounder size.

Case 2: Aexp = BexpCase 3: The intermediate significand is too

large for the rounder.

Baseline Implementation

Synthesized using the original HDL to a Xilinx Virtex 5.

Rounder block is largest component. 12 DSP48E blocks for the multiplier used for alignment and rounding.

Several 64bit 2:1 muxes inefficiently use LUT resources.

There are several constant tables that could be optimized.

Rounder

Rounder Block

Three tables inside the rounder block to be optimized.

The 4 multiplexers referred to on last page.

CoreGen multipliers are slower and use more DSP48E blocks than the improved multipliers. This is because they use the DSP blocks instead of LUT’s to add partial products.

Another option is to adjust the size of the multiplier (ie increase the size so the case3 becomes case1)

Decimal Digit Counter Synthesis Results

Design LUTs FFs BRAMs Period

Baseline 137 69 2 4.31 ns

Merged BRAM

132 64 1 4.31 ns

LUT Based 187 126 0 2.86 ns

• We can merge two of the LUT’s that were originally two BRAM’s into one.• The other option is to implement the whole thing using LUT’s.•The Merged BRAM was chosen the time savings here does not effect the overall timing of the adder, so space is more important.•The other tables were implemented as LUT’s because it was not an efficient use of resources to implement in the BRAM.

Multiplexers

64-bit 2-to-1 MUX

LUT’s DSP48E’s Delay (ns)

LUT-Based 64 0 1.20

Combined LUT 32 0 1.78

DSP-Based 1 2 1.10

DSP-and-LUT 17 1 1.10• LUT’s use the default LUT-based implementation without combination.• If LUT’s are combined, routing congestion decreases the frequency of the result.

Control Signals

The baseline implementation had mostly active-low control signals and asynchronous reset.

The optimized design uses active high control signals and a single synchronous reset.

This change also reduces the resources used.

Overall Results

The larger multiplier has a slight frequency penalty compared to the smaller multipliers, but moves more input combinations from case3 to case1. Therefore, best multiplier size depends on the characteristics of the applications that use it.

If multiple BID adders are implemented on a single FPGA, the DSP48E blocks are the limiting resource; a Virtex 5 can fit at most five of the BID adders with a pipelined small multiplier, but up to sixteen of the BID adders that use the multi-cycle multiplier.

• This is because the multi-cycle multipliers use far fewer DSP48E blocks than the pipelined multipliers, and are thus a good choice for many parallel DFP units.• This only degrades BID adder frequency by approximately 2-3 MHz, but reduces the number of input combinations that would incur the worst case latency.

AMIN FARMAHININ-FARAHANI CHARLES TSEN KATHERINE COMPTON FPGA Implementation of a 64-bit BID-Based...

Documents

Transcript of AMIN FARMAHININ-FARAHANI CHARLES TSEN KATHERINE COMPTON FPGA Implementation of a 64-bit BID-Based...