A Flexible Implementation of High-Performance FIR

download A Flexible Implementation of High-Performance FIR

of 6

Transcript of A Flexible Implementation of High-Performance FIR

  • 7/31/2019 A Flexible Implementation of High-Performance FIR

    1/6

  • 7/31/2019 A Flexible Implementation of High-Performance FIR

    2/6

    A Flexible Implementation of High-PerformanceFIR Filter on Xilinx FPGAs

    Tien-Toan Do, Holger Kropp, Carsten Reuter, Peter Pirsch

    Laboratorium f ur Informationstechnologie,University of Hannover,

    Schneiderberg 32, 30167 Hannover, Germany

    {toan, kropp, reuter }@mst.uni-hannover.dehttp://www.mst.uni-hannover.de/

    Abstract. Finite impulse-response lters (FIR lters) are very com-monly used in digital signal processing applications and traditionallyimplemented using ASICs or DSP-processors. For FPGA implementa-tion, due to the high throughput rate and large computational powerrequired under real-time constraints, they are a challenging subject. In-deed, the limitation of resources on an FPGA, i. e. , logic blocks and ipops, and furthermore, the high routing delays, require compact imple-mentations of the circuits. Hence, in lookup table-based FPGAs, e. g.Xilinx FPGAs, FIR-lters were implemented usually using distributedarithmetic. However, such lters can only be used where the lter co-efficients are constant. In this paper, we present approaches for a moreexible FPGA implementation of FIR lters. Using pipelined multiplierswhich are carefully adapted to the underlying FPGA structure, our FIRlters do not require a predenition of the lter coefficients. Combining

    pipelined multipliers and parallely distributed arithmetic results in dif-ferent trade-offs between hardware cost and exibility of the lters. Weshow that clock frequencies of up to 50 MHz are achievable using XilinxXC 40xx 5 FPGAs.

    1 Introduction

    Belonging to the so called low-level DSP-algorithms, nite impulse-response l-tering represents a substantial part of digital signal processing. Low-level DSP-algorithms are characterized by their high regularity. Nevertheless, on the otherhand, they require a high computational performance. Yet, if the processing hasto be performed under real-time conditions, those algorithms have to deal withhigh throughput rates.

    An N tap FIR ltering algorithm can be expressed, like many other DSP-algorithms, by an arithmetic sum of products:

    y(i) =N 1

    k =0

    h (k) x (i k) (1)

  • 7/31/2019 A Flexible Implementation of High-Performance FIR

    3/6

    where y(i ) and x (i ) are the response and the input at time i , respectively; andh (k), for k = 0 , 1,...,N 1 are the lter coefficients.

    Hence, the implementation of an N tap FIR lter expressed mathematicallyin equation (1) requires the implementation of N multiplications, which are verycostly regarding hardware and computational time. However, in many cases of digital signal processing where symmetric FIR lters are required, the number of multiplications can be reduced. For the coefficients of such lters, the followingrelations are valid [1]:

    h (k) = h (N k 1), for k = 0 , 1, 2,...,N 1 (2)

    Utilizing relation (2) can almost halve the number of required multiplications.Thus, only symmetric FIR-lters are considered here.

    Further, lters whose coefficients are constant can be implemented at alow hardware cost using bit-plane-structures, distributed arithmetic (DA) [2]or lookup-table multipliers (LUTMULT) instead of conventional hardware mul-tipliers. Especially, for FPGAs where lookup tables (LUTs) are the underlyinglogic blocks, e. g., Xilinx FPGAs [3], DA techniques [4] and LUTMULT can beinvoked as a convenient way for low-cost realization of FIR-lters with constantcoefficients. Nevertheless, such lters would be not used, if the lter coefficientsshould be frequently varied. This is the case when, e. g., emulating algorithms,where inuences of such variations of the algorithm parameters on the qualityof the processed signals must be investigated.

    Hence, in this paper, we present approaches leading to an efficient, exibleand modular realization of symmetric FIR-lters on Xilinx XC 40xx FPGAs.FPGA-implementations of pipelined lters using parallely distributed arithmeticand implementation results will be discussed in sections 2. In section 3, the alter-

    native approach for an implementation using conventional hardware multiplierswhich are carefully adapted to the underlying FPGA structure is considered. InSection 4 concluding remarks will be provided.

    2 Distributed-Arithmetic FIR Filters

    In essence, distributed arithmetic (DA) is a computation technique that per-form multiplication using lookup table-based schemes [5]. DA-techniques permitcomputations in form of sum of products as expressed in equation (1) to bedecomposed into repetitive lookup table procedures, the results from which arethen accumulated to produce the nal result.

    Since Xilinx XC 4000 FPGAs are based on lookup tables, distributed arith-metic is a convenient way to implement the multiply-intensive algorithms likeFIR lters, provided that one of the multiplication operands is constant. The bitsof the other operand are then used as address lines for looking up a table whichis, in fact, a storage, e. g. ROM, RAM, where the potential products from themultiplication of the rst operand by the potential values of the second operandare stored (Fig. 1). FPGA-implementation of FIR lters using serial distributed

  • 7/31/2019 A Flexible Implementation of High-Performance FIR

    4/6

    arithmetic has been proposed in [4] and [6], where implementation results arealso described.

    We realize fully parallel DA FIR lters on Xilinx XC 4000 as depicted ingure 1 where an 8 tap 8 bit symmetric lter is sketched. To assure a compactrealisation of the circuit, the LUT sizes are tailored to the required precision forthe output data. So, for a given precision, the LUT sizes are not uniform [6], butdepend on the positions of the individual bits, i. e. LUTs for the less signicantbits are smaller. Furthermore, in order to obtain high performance, the ltersare pipelined after every 4 bit adder whose timing amounts about 18 ns to 20 nson a XC 4000 5. The number of required CLBs for the 8 tap 8 bit symmetricFIR lter depicted in guge 1, which can run at frequencies up to 50 MHz on aXC 4000 5, is 140. The latency of the above lter is 14 clock cycles

    2

    2

    2

    2

    2 2

    2 2

    9

    9

    9

    9

    8

    9

    5

    9

    17

    LUT

    LUT

    LUT

    LUT

    LUT

    LUT

    LUT

    LUT

    LUT

    4

    4

    4

    4

    4

    4

    17

    out8in

    2

    2 4

    13

    10

    9

    7

    6

    5

    4

    3

    2

    7

    11+

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    REG

    REG

    REG

    REG

    8

    8

    8

    8

    8

    8

    8

    8

    Fig. 1. Distributed-Arithmetic FIR Filter on Xilinx XC 4000

    While the fully precise ltering requires data stored in every LUT to be 10 bit-wide and outputs 19 bit data, the maximal absolute error (= 1024) caused byour 8 bit 8 tap FIR lter depicted in g. 1 is quite the same as it caused by thecoressponding Xilinx DA lter (= 1022), where the LUTs are uniformly wideand require 36 CLBs. The number of CLBs for all the LUTs in our design is 27.

    Hence, high-performance digital lters can be implemented at a low hard-ware cost on LUT-based FPGAs using DA technique. The main drawback of this approach is that DA technique requires the predenition of the lter coef-

  • 7/31/2019 A Flexible Implementation of High-Performance FIR

    5/6

    cients. In many application FPGAs for DSP, e. g. hardware emulation of DSPalgorithms, lters are needed which allow a frequent and exible modication of

    the lter coefficients.

    3 FIR Filters with Conventional Hardware Multipliers

    Though multipliers are costly, involving them is inevitable for lters whose co-efficients should be frequently varied. Hence, we have investigated an efficientFPGA-implementation of FIR-lters using pipelined array multipliers.

    For the processing at a sample rate comparable to that of the above DA lter,the 8 by 9 multipliers of the lter are two-rows pipelined [7] as illustratedin gure 2, and their structure is adapted and carefully mapped onto the targetarchitecture, i.e. Xilinx FPGA. Further, for the same precision as for the above

    DA lter, the eight right most product bits from the multiplication (max. absoluterror = 1023) are cut off. The lter has a latency of 13 clock cycles and requires390 CLBs. The achievable frequency for this lter on a XC 4000 5 is about45 MHz - 50 MHz. In comparison with a parallely distributed arithmetic FIR-

    9in 8

    MULT9 8

    9

    9 MULT9 8

    9

    9 MULT9 8

    9MULT

    9 8

    10

    9

    9

    11 out

    ADD3 ADD4

    &

    &

    &

    &

    &

    &

    &

    Register :

    &

    10

    REG

    REG

    REG

    REG

    +

    +

    +

    +

    +

    +

    +

    8

    8

    8

    8

    8

    8

    8

    8

    coeff.8

    coeff.8

    coeff.8

    coeff.8

    Fig. 2. FIR Filter with conventional multipliers on Xilinx XC 4000

    lter (Fig. 1), the hardware cost for the FIR lter with conventional hardwaremultipliers (Fig. 2) is increased, while the performance is quite the same.

  • 7/31/2019 A Flexible Implementation of High-Performance FIR

    6/6

    Because the achievable frequencies for the above DA lters and lters withconventional hardware multipliers are about the same, they can be combined in

    a hybrid approach leading to different trade-offs between hardware cost andexibility.

    4 Conclusions

    Using Xilinx XC40xx-5 FPGAs, clock frequencies up to about 45MHz - 50MHzfor FIR lters are achievable. While the DA technique approach leads to low-cost implementations of FIR lters on lookup table-based FPGAs, FIR lterswith conventional hardware multipliers are more exible. In spite of the highcost, such lters are desirable in many cases where the lter coefficients shouldbe frequently varied. An example for that is hardware emulation of algorithmswhere inuences of variations of the algorithm parameters, e.g., lter coefficients,

    on the processing have to be investigated. Combining the above approaches willlead to different trade-offs regarding hardware costs and exibility.

    Acknowledgment

    One of the authors, T.-T. Do, receives a scholarship from the German AcademicExchange Service (Deutscher Akademischer Austauschdienst - DAAD). He isgrateful to this organization for supporting his research.

    References

    1. A. V. Openheim, R. W. Schafer: Digital Signal Processing, Prentice Hall (1975)2. P. Pirsch: Architectures for Digital Signal Processing, John Wiley & Sons (1997)3. Xilinx Inc.: The Programmable Logic Data Book, (1996)4. L. Mintzer: FIR Filters with Field-Programmable Gate Arrays, IEEE Journal of

    VLSI Signal Processing (August 1993) 1191285. C. S. Burrus: Digital Filters Structures described by Distributed Arithmetic, IEEE

    Trans. on Circuits and Systems (1977), 6746806. Xilinx Inc.: Core Solutions, (May 1997)7. T.-T. Do, H. Kropp, M. Schwiegershausen, P. Pirsch: Implementation of Pipelined

    Multipliers on Xilinx FPGAs - A Case Study, 7th International Workshop on Field-Programmable Logic and Applications, Proceedings (1997)