An Efficient 256-Tap Parallel FIR Digital Filter … Efficient 256-Tap Parallel FIR Digital Filter...

7
Procedia Computer Science 54 (2015) 605 – 611 Available online at www.sciencedirect.com 1877-0509 © 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of organizing committee of the Eleventh International Multi-Conference on Information Processing-2015 (IMCIP-2015) doi:10.1016/j.procs.2015.06.070 ScienceDirect Eleventh International Multi-Conference on Information Processing-2015 (IMCIP-2015) An Efficient 256-Tap Parallel FIR Digital Filter Implementation Using Distributed Arithmetic Architecture Amita Nandal a,, T. Vigneswarn b , Ashwani K. Rana a and Arvind Dhaka c a Department of ECE, NIT Hamirpur b Department of ECE, VIT University, Chennai c Deparment of CSE, NIT Hamripur Abstract This paper discusses FPGA implementation of Finite Impulse Response (FIR) filters using Distributed Arithmetic (DA) which substitute multiply and accumulate operations with a series of Look-Up-Table (LUT) accesses. Parallel FIR digital filter can be used either for high speed or low-power applications. The distributed arithmetic provides a multiplication-free method for calculating inner products of fixed-point data, based on table lookups of pre calculated partial products. The implementation results are provided to demonstrate a high-speed and low power proposed architecture. The proposed filter is implemented in very high speed integrated circuit hardware description language (VHDL) and verified via simulation. The proposed method offers average reductions of 60% in the number of LUT, 40% reduction in occupied slices and 50% reduction in the number gates for parallel FIR filter implementation. Keywords: Distributed arithmetic; DSP; Finite impulse response; LUT; MAC and parallel filters. 1. Introduction Due to the intensive use of Finite Impulse Response (FIR) filters in video and communication systems, high performance in speed, area and power consumption is demanded. Basically, digital filters are used to modify the characteristic of signals in time and frequency domain and have been recognized as primary digital signal processing element. In DSP, the design methods were mainly focused in multiplier-based architectures to implement the Multiply-And-Accumulate (MAC) blocks that constitute the central piece in FIR filters and several functions. Fast parallel filter structures have been discussed in detail in 1–9 . Finite Impulse Response (FIR) filters are important building blocks for various Digital Signal Processing (DSP) applications. Recently, because of the increasing demand for video-signal processing and transmission, high-speed and high-order programmable FIR filters have frequently been used to perform adaptive pulse shaping and signal equalization on the received data in real time, such as ghost cancellation 10, 11 and channel equalization 12 . Hence, an efficient VLSI architecture for a high-speed programmable FIR filter is needed. Corresponding author. Tel.: 09736494921. E-mail address: amita [email protected] © 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of organizing committee of the Eleventh International Multi-Conference on Information Processing-2015 (IMCIP-2015)

Transcript of An Efficient 256-Tap Parallel FIR Digital Filter … Efficient 256-Tap Parallel FIR Digital Filter...

Procedia Computer Science 54 ( 2015 ) 605 – 611

Available online at www.sciencedirect.com

1877-0509 © 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).Peer-review under responsibility of organizing committee of the Eleventh International Multi-Conference on Information Processing-2015 (IMCIP-2015)doi: 10.1016/j.procs.2015.06.070

ScienceDirect

Eleventh International Multi-Conference on Information Processing-2015 (IMCIP-2015)

An Efficient 256-Tap Parallel FIR Digital Filter ImplementationUsing Distributed Arithmetic Architecture

Amita Nandala,∗, T. Vigneswarnb, Ashwani K. Ranaa and Arvind Dhakac

aDepartment of ECE, NIT HamirpurbDepartment of ECE, VIT University, Chennai

cDeparment of CSE, NIT Hamripur

Abstract

This paper discusses FPGA implementation of Finite Impulse Response (FIR) filters using Distributed Arithmetic (DA) whichsubstitute multiply and accumulate operations with a series of Look-Up-Table (LUT) accesses. Parallel FIR digital filter can be usedeither for high speed or low-power applications. The distributed arithmetic provides a multiplication-free method for calculatinginner products of fixed-point data, based on table lookups of pre calculated partial products. The implementation results are providedto demonstrate a high-speed and low power proposed architecture. The proposed filter is implemented in very high speed integratedcircuit hardware description language (VHDL) and verified via simulation. The proposed method offers average reductions of60% in the number of LUT, 40% reduction in occupied slices and 50% reduction in the number gates for parallel FIR filterimplementation.© 2015 The Authors. Published by Elsevier B.V.Peer-review under responsibility of organizing committee of the Eleventh International Multi-Conference on InformationProcessing-2015 (IMCIP-2015).

Keywords: Distributed arithmetic; DSP; Finite impulse response; LUT; MAC and parallel filters.

1. Introduction

Due to the intensive use of Finite Impulse Response (FIR) filters in video and communication systems, highperformance in speed, area and power consumption is demanded. Basically, digital filters are used to modify thecharacteristic of signals in time and frequency domain and have been recognized as primary digital signal processingelement. In DSP, the design methods were mainly focused in multiplier-based architectures to implement theMultiply-And-Accumulate (MAC) blocks that constitute the central piece in FIR filters and several functions. Fastparallel filter structures have been discussed in detail in1–9. Finite Impulse Response (FIR) filters are importantbuilding blocks for various Digital Signal Processing (DSP) applications. Recently, because of the increasing demandfor video-signal processing and transmission, high-speed and high-order programmable FIR filters have frequentlybeen used to perform adaptive pulse shaping and signal equalization on the received data in real time, such as ghostcancellation10, 11 and channel equalization12. Hence, an efficient VLSI architecture for a high-speed programmableFIR filter is needed.

∗Corresponding author. Tel.: 09736494921.E-mail address: amita−[email protected]

© 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).Peer-review under responsibility of organizing committee of the Eleventh International Multi-Conference on Information Processing-2015 (IMCIP-2015)

606 Amita Nandal et al. / Procedia Computer Science 54 ( 2015 ) 605 – 611

Fig. 1. Block diagram of conventional parallel FIR filtering.

In this work we are addressing optimizations in the multiplier block of FIR filters by changes in architectural levelimprovements using distributed arithmetic concept of digital filtering which improves device utilization.

Distributed Arithmetic (DA) is a high speed multiplication technique used for implementation of digital filters andsignal up conversions21, 22. The DA is bit serial word parallel approach where throughput rate does not depend on filterlength or data size.

This paper is organized as follows. Section 2 briefly describes background study. We present a distributed arithmeticbased filtering scheme in section 3. In section 4 we present our proposed FIR filter design flow. Section 5 shows thesynthesis results obtained by XILINX. Section 6 summarizes the conclusions and presents proposals for future work.

2. Background Study

2.1 Parallel FIR filters

A FIR filter can be mathematically expressed by the equation (1)13, 26.

y[n] =N−1∑

i=0

h[i ]x[n − i ] (1)

where X represents the input signal, H the filter coefficients, Y the output signal, Y [n] is the current output sample,and N is the number of taps of the filter. In the sequential implementation a set of Multiply-And-Accumulate (MAC)operations is performed for each sample of the input data signal, multiplying the N delayed input samples bycoefficients and summing up the results together to generate the output signal. In parallel implementations, we canhave two main architectures. The first one consists of unrolling of MAC loop where we have several delayed versionsof the input signal entering in a fully parallel multiplier block, followed by a summation block. The other one consistsof a multiplier block, which takes the same input signal and delivers each output to an input of a delayed summationblock. Fig. 1 shows the basic block diagram of conventional parallel FIR filtering.

Several techniques for optimizing the multiplier block of parallel FIR filters were proposed in the literature. Most ofthem consider the use the fixed-point representation and transposed form implementation, because it is easier to obtaincommon sub expressions to be shared along two or more multipliers in this form14, 15. Many consider the use of somekind of Signed Digit (SD) representation, mainly the Canonical Signed Digit (CSD) representation16, 17, which resultsin fewer non-zero digits in each coefficient, usually resulting in a smaller multiplier block. Previous research has beenshown reductions of more than 50%16 in the number of adders by using these techniques. The great advantage ofthese techniques is that the optimized filter has the same behavior of the original non-optimized one (i.e. same impulseresponse or transfer function). Another approach consists of representing each coefficient as a sum of power-of-twoterms and limiting the number of power-of-two terms in each coefficient17–20.

An FIR digital filter of degree N is described by the impulse response, hn, n = 0, 1, 2, . . . , N and its transferfunction is represented as follows:

H (z) =N∑

n=0

hnz−n (2)

607 Amita Nandal et al. / Procedia Computer Science 54 ( 2015 ) 605 – 611

Fig. 2. Concept of existing distributed arithmetic based filtering.

the traditional L-parallel FIR filter can be derived using polyphase decomposition3 as,

L−1∑

p=0

Yp(zL)z−p =

L−1∑

q=0

Xq(zL)z−qL−1∑

r=0

Hr(zL)z−r (3)

A two-parallel FIR filter can be expressed as

Y0 + z−1 H1 = (H0 + z−1 H1)(X0 + z−1 X1) (4)

= H0X0 + z−1(H0X1 + H1X0) + z−2 H1X1 (5)

Equation 4 and equation 5 implies that,

Y0 = H0X0 + z−2 H1X1 (6)

Y1 = H0X1 + H1X0 (7)

2.2 Distributed arithmetic based filtering scheme

Distributed Arithmetic was first brought up by Croisier23, and was extended to cover the signed data system. Then itwas introduced into FPGA design to save MAC blocks with the development of FPGA technology. High performanceFIR filter based on DA using LMS architecture is implemented in24, 25.

If h[n] is the filter coefficient and x[n] is the input sequence to be processed, the N-length FIR filter can be describedas final form of distributed arithmetic as,

y = −2B xB[n]h[n] +B−1∑

b=0

2bN−1∑

N=0

h[n]xb[n] (8)

Figure 2 shows the existing DA based filtering scheme27.

3. Proposed Work

The basic LUT-DA scheme on an FPGA would consist of three main components: the input registers, the 4-inputLUT unit and the shifter/accumulator unit. Additionally, it would require a control unit to manipulate the filteroperation, and an adder tree unit to perform addition on partial filter results. Applying this approach the 4-input LUTunit will not be directly accessed instead 2-input LUT is used based on multiplexer select. The concept of multiplexerbased DA filtering scheme is shown in Fig. 4.

The proposed DA based filter architecture is shown in Fig. 4 for 256-tap parallel FIR filter. This architecture usesthe concept of multiplexer based DA filtering scheme shown in Fig. 3. The particular 2-input LUT is selected whichrepresent all the possible sum combinations of filter coefficients. It implies about 50% reduction in the number ofLUT used with increased speed. There are two main aspects to be considered when designing a parallel filter, namelythe number of bits required for the signal and the required transfer function of the filter. The former one determinesthe word length of the entire data path. The later one is determined by two parameters, namely the number of taps,and the number of bits in each coefficient. The multipliers are the most expensive blocks in terms of area, delay,

608 Amita Nandal et al. / Procedia Computer Science 54 ( 2015 ) 605 – 611

Fig. 3. Multiplexer based DA filtering scheme.

Fig. 4. Proposed parallel digital FIR filter architecture.

and power in a FIR filter when considering custom implementation. In fact, even for a dedicated implementation ofconstant-coefficient multipliers, the amount of hardware needed is very high as we have several multipliers in theentire filter. To evaluate the performance of the proposed scheme, 4-tap, 8-tap, 16-tap, 32-tap, 64-tap, 128-tap and256-tap parallel FIR filters are implemented using VHDL and synthesis is carried out in XILINX-ISE8.1i. The resultsare compared with the conventional parallel FIR filtering scheme and existing DA based implementation.

4. Results and Discussion

The simulation has been done using MODEL SIM 6.4 and XILINX Integrated Software Environment (ISE) isused for performing synthesis and implementation of designs using ‘Spartan-3’ device. The power analysis has been

609 Amita Nandal et al. / Procedia Computer Science 54 ( 2015 ) 605 – 611

Fig. 5. Comparison of number of LUTs used. Fig. 6. Comparison of number of occupied slices.

Fig. 7. Comparison of number of gates used. Fig. 8. Comparison of delay.

done using XILINX XPOWER tool. The evaluation of device utilization using proposed DA architecture can becomprehended easily with the help of the results in graphs shown below.

Figure 5 reports the comparison of number of LUTs used among the various filter architectures designed usingexisting and proposed method. It is shown that proposed multiplexer based DA filter comprehends the existing DAbased parallel FIR filter. The number of LUTs are reduced by 66% using proposed method as compared to DAbased method for parallel FIR filter implementation. This reduction is 19% when compared with conventional filterimplementation.

Figure 6 plots the comparison of number of occupied slices among various parallel FIR filter architectures. It isshown that the proposed multiplexer based DA filter has 42% reduced number of slices as compared to existing DAbased filter architecture. This reduction is 20% when compared to conventional architecture.

Figure 7 reports the comparison of number of gates used among various filter architectures designed. It is shownthat the proposed multiplexer based DA filter has 55% reduced number of slices as compared to existing DA basedfilter architecture. This reduction is 14% when compared to conventional architecture.

Figure 8 reports the delay comparison among various filter architectures designed. It is shown that the proposedmultiplexer based DA filter is 10% faster as compared to existing DA based filter architecture. This improvement is5% when compared to conventional architecture.

610 Amita Nandal et al. / Procedia Computer Science 54 ( 2015 ) 605 – 611

Fig. 9. Comparison of power consumption. Fig. 10. Comparison of Power Delay Product (PDP).

Figure 9 reports the power consumption comparison among various filter architectures designed. It is shown that theproposed multiplexer based DA filter has 55% reduced power consumption as compared to existing DA based filterarchitecture. This reduction is 27% when compared to conventional architecture.

The overall performance metric (i.e. power, delay product) for proposed design is also improved as shown in Fig. 10.

5. Conclusion

We have presented an efficient multiplexer based DA scheme which is used to implement parallel FIR filters. Thedevice utilization of the proposed architecture is relatively less since it uses split LUT technique. Proposed method isimplemented for 4-tap to 256-tap parallel FIR filters and can be even extended for more taps. A high speed and less areaimplementation is achieved. The test results indicate that the designed filter using proposed distributed arithmetic canwork stable with high speed and can save almost 50 percent hardware resources. For various digital signal processingapplications the various parameters and order of filter can be changed accordingly.

References

[1] D. A. Parker and K. K. Parhi, Low-Area/Power Parallel FIR Digital Filter Implementations, Journal of VLSI Signal Process. Syst., vol. 17(1),pp. 75–92, (1997).

[2] J. G. Chung and K. K. Parhi, Frequency-Spectrum-Based Low-Area Low-Power Parallel FIR Filter Design, EURASIP J. Appl. SignalProcess., vol. 9, pp. 444–453, (2002).

[3] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation, (New York, Wiley), (1999).[4] Z. J. Mou and P. Duhamel, Short-Length FIR Filters and their use in Fast Nonrecursive Filtering, IEEE Trans. Signal Process., vol. 39(6),

pp. 1322–1332, (1991).[5] C. Cheng and K. K. Parhi, Hardware Efficient Fast Parallel FIR Filter Structures based on Iterated Short Convolution, IEEE Trans. Circuits

Syst., 51(8), pp. 1492–1500, (2004).[6] J. I. Acha, Computational Structures for Fast Implementation of L-Path and L-Block Digital Filters, IEEE Trans. Circuits Syst., 36(6),

pp. 805–812, (1989).[7] I. S. Lin and S. K. Mitra, Overlapped Block Digital Filtering, IEEE Trans. Circuits Syst. II: Analog Digit. Signal Process., vol. 43,

pp. 586–596, (1996).[8] C. Cheng and K. K. Parhi, Further Complexity Reduction of Parallel FIR Filters, Proc. IEEE Int. Symp. Circuits Syst., (Kobe, Japan),

pp. 1835–1838, (2005).[9] R. C. Agarwal and J. W. Cooley, New Algorithms for Digital Convolution, IEEE Trans. Acoust. Speech, Signal Process., vol. 25(5),

pp. 392–410, (1977).[10] B. Edwards, A. Corry, N. Weste and C. Greenberg, A Single Chip Ghost Canceller, IEEE 1992 Custom Integrated Circuits Conference,

pp. 26.5.1–26.5.14, (1992).

611 Amita Nandal et al. / Procedia Computer Science 54 ( 2015 ) 605 – 611

[11] J. R. Choi, L. H. Jang, S. W. Jung and J. H. Choi, Structured Design of a 288-tap FIR Filter by Optimized Partial Product Tree Compression,IEEE J. Solid-State Circuits, 32(3), pp. 468–476, (1997).

[12] D. J. Pearson, S. K. Reynolds, A. C. Megdanis, S. Gowda, K. R.Wrenner, M. Immediato, R. L. Galbraith and H. J. Shin, Digital FIR Filtersfor High Speed PRML Disk Read Channels, IEEE J. Solid-State Circuits, vol. 30(12), pp. 1517–1523, (1995).

[13] R. W. Hamming, Digital Filters (Prentice Hall, 3rd edition, 1989).[14] M. Potkonjak, M. B. Srivastava and A. Chandrakasan, Efficient Substitution of Multiple Constant Multiplication by Shifts and Addition using

Iterative Pairwise Matching. Proc. 31st ACM/IEEE Design Antomation Conf., pp. 189–194, (1994).[15] M. Mehendale, S. D. Sherlekar and G. Venkatesh, Synthesis of Multiplier-less FIR Filters with Minimum Number of Additions, Proc.

IEEE/ACM Int. Conf. Computer-Aided Design, pp. 668-671, (1995).[16] R. Pasko, P. Schaumont, V. Derudder, S. Vernalde and D. Iuraekova, A New Algorithm for Elimination of Common Subexpressions, IEEE

Trans. Computer-Aided Design, vol. 18, pp. 58–68 (1999).[17] H. Samueli, An Improved Search Algorithm for the Design of Multiplier-Less FIR Filters with Powers-of-Two Coefficients, IEE Trans.

Circuits Syst., vol. 36, pp. 1044–1047, (1989).[18] K. H. Chen and T. D. Chiueh, Design and Implementation of a Reconfigurable FIR Filter, Proc of 2003 Int. Symp. Circuits Systems, pp. 25–28,

(2003).[19] C. Lim, J. B. Evans and B. Liu, Decomposition of Binary Integers into Signed Power-of-Two Terms, IEEE Trans. Circuits Syst., vol. 38,

pp. 667–672 (1991).[20] J. Portela, E. Costa and J. Monteiro, Optimal Combination of Number of Taps and Coefficient Bit-Width for Low Power FIR Filter

Realization, IEEE European Conference on Circuit Theory and Design, pp. 145–148, (2003).[21] B. New, A Distributed Arithmetic Approach to Designing Scalable DSP Chips, EDN. (1995).[22] W. P. Burleson and L. L. Scharf, A VLSI Design Method for Distributed Arithmetic, VLSI Sig. Proc, vol. 2, (1991).[23] Croisier, D. J. Esteban, M. E. Levilion and V. Rizo, Digital Filter for PCM Encoded Signals, U.S. Patent, no. 3,777,130, (1973).[24] B. K. Mohanty and P. K. Meher, A High-Performance Energy-Efficient Architecture for FIR Adaptive Filter based on New Distributed

Arithmetic Formulation of Block LMS Algorithm, IEEE Transactions on Signal Processing, vol. 61(4), pp. 921–930, (2013).[25] E. Trakultritrung and E. Thanangchusin, Distributed Arithmetic LMS Adaptive Filter Implementation without Look-up Table, 9th

International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON),pp. 1–4, (2012).

[26] Yu-Chi Tsao and Ken Choi, Area-Efficient Parallel FIR Digital Filter Structures for Symmetric Convolutions based on Fast FIR Algorithm,IEEE Transactions on VLSI Systems, vol. 20(2), pp. 366–371, (2012).

[27] T. Vigneswaran and P. Subbaramani Reddy, Design of Digital FIR Filter based on Dynamic Distributed Arithmetic Algorithm, Indian Journalof Applied Sciences, vol. 7(9), pp. 2908–2910, (2007).