Post on 17-Sep-2020
66
CHAPTER 4
IMPLEMENTATION OF DIGITAL UPCONVERTER AND
DIGITAL DOWNCONVERTER FOR WiMAX SYSTEM
4.1 Introduction
FPGAs provide an ideal implementation platform for developing broadband wireless
systems such as WCDMA, WiMAX etc. To accelerate the performance of these
broadband systems, state of the art high end and high performance FPGAs are used.
FPGAs have gained rapid acceptance and growth over the past decade because they can be
applied to a very wide range of applications. Using logic blocks and programmable
routing resources, FPGAs can be configured to implement custom hardware functionality.
As FPGAs are completely reconfigurable, so they can be reprogrammed for new
applications. The development of high level design tools like system generator and DSP
builder has resulted in small design cycle.
As FPGAs are truly parallel in nature, different processing operations do not have
to compete for the same resources. Each independent processing task is assigned to a
dedicated section of the chip, and can function autonomously without any influence from
other logic blocks. FPGAs are available which can be used for dedicated DSP
applications. Thus the same filtering operations currently implemented in custom VLSI
devices can now be implemented in a FPGA device ( Sun, M.T. et.al, 1989).
Distributed Arithmetic (DA) can be explored to save resources in FPGA
implementation of DSP functions. DA can be used to trade memory for combinatory
elements, resulting in low cost look up table (LUT) based FPGAs implementation. Also
the designer can select a serial or parallel DA implementation to trade off speed and
resource utilization (Stanley A. White, 1989).
67
In this chapter FPGA implementation of DUC and DDC for WiMAX system have been
proposed using DA. Different configurations for serial and parallel implementations are presented
and compared. The resultant implementations are compared in terms of resource utilization for a
Stratix II GX device. DSP builder is used to implement pipelining and scaling of parameters.
Basics of DA architecture and methods to reduce the requirement of ROM are presented in section
4.2. Overview and architecture of Stratix II GX device are presented in section 4.3. Serial and
parallel implementations of FIR filter with DA architecture are explored in section 4.4.
Implementation of DUC and DDC is presented in sections 4.5 and 4.6 respectively.
4.2 Distributed Arithmetic Architecture
DA is a very efficient mechanism to trade combinational logic with memory for high
performance computation. DA can significantly help to save area in DSP hardware design.
When the number of elements in a vector is nearly the same as the word size, DA is quite
fast because it replaces the explicit multiplications by ROM look ups, which is an efficient
technique to implement on Field Programmable Gate Arrays (FPGAs) ( Sun, M.T. 1989).
Figure 4.1: Basic Architecture of Distributed Arithmetic
In DA, multiplications are reordered and mixed in such a way that the arithmetic becomes
68
distributed through the structure rather than being lumped. With the advent of FPGA technology
DA plays significant role to improve the system. The basic architecture for DA implementation
has been shown in figure 4.1. For the DA implementation no multipliers are required. So
accumulators, registers and read only memories (ROMs) are used for its implementation. The N
bit registers are used to store the input vectors. This is shown with the help of an example, in
which a general sum of product (SOP) equation that defines the response of linear, time invariant
networks (4.1) is implemented with DA architecture shown in figure 4.2.
1
0
( )M
n k k
k
y a b n
(4.1)
Where ny is the response of network at time n, ( )kb n is k
th input variable at time n and
ka is
weighing factor of kth input variable that is constant for all n, and so it remains time invariant
(Xilinx application note).
Because the coefficients are constants, so these values can be precomputed. The output
ny has only 2M possible values, which can be stored in a 2M
size ROM. The bit serial
input data can be used to directly address the ROM contents, which can be dropped into an
accumulator to obtain the inner sum. Additional control circuitry is required to handle
subtraction when the sign bit addresses the ROM (Chung, J. C., et al., 1998). The
accumulator output converges to the final result after N cycles. To show this process a FIR
filter implemented using the DA architecture is shown in figure 4.2. The input vector X
holds four elements that are four bits each. The ROM contains all 16 combinations of the
constant vector elementsiA . Each of the
iX elements is delivered one bit at a time, with
the MSB first. Every clock cycle, the register contains the sum of the left shifted version
of the previous register value and the current ROM contents. sT is the sign bit to control
69
Figure 4.2: FIR Filter using Distributed Arithmetic
the addition/subtraction operation. When sT is high, the accumulator subtracts the current
ROM contents from the left shifted version of the previous result and when it is low, the
accumulator will add the current ROM contents to previous result. After four cycles, the
register will have the final dot product. The only problem arises, is the increased size of
the required ROM, which grows exponentially with each added input address line. For
each element in a vector, there will be an address line. So there will be in total K address
lines resulting in 2K ROM.
This increased ROM size problem can be reduced by two methods (Ansari, Z.A.
2003). The first method is based on the ROM decomposition, which is shown in figure
4.3. In this memory will be partioned in smaller parts, and by using an additional adder, all
ROM outputs are added. The amount of memory is reduced from 2Nwords to 22 2
N
70
Figure 4.3: Reducing the memory using decomposition.
words, if the original memory is partitioned into two parts. For N =8, the number of words
to be store have reduced from82 = 256 to
42 2 = 32. Hence, this approach reduces the
memory significantly at the cost of an additional adder.
The second approach is based on a special coding of the ROM content. Memory
size can be halved by using the inventive scheme based on the identity
1
( )2
x x x (4.2)
In two's complement representation, a negative number is obtained by inverting all bits
and then adding a 1 to the least significant position of the original number .
The identity 4.2 can be rewritten as (White. A. Stainley, 1989)
1 1( 1
0 0
1 1
12 ( 2 2 )
2
d d
d
W WWk k
k k
k k
x x x x x
(4.3)
71
11 1
0 0
1
( )2 ( )2 2d
d
WWk
k k
k
x x x x x
(4.4)
Notice that k kx x can only take on the values -1 or +1. Using this expression, for FIR
filter equation yields
11 1
1 2 10 20 0
1
( , ,...., )2 ( , ,...., )2 (0,0,...,0)2d
d
WWk
k k k Nk k N
k
y F x x x F x x x F
(4.5)
Where 1 2
1
( , ,...., ) ( )N
k k k Nk i k k
i
F x x x a x x
The function kF is shown in Table 4.1 for N = 3.
Table 4.1: Address and Contents of ROM
1x 2x
3x kF
1y 2y A S
0 0 0 1 2 3a a a 0 0 A
0 0 1 1 2 3a a a 0 1 A
0 1 0 1 2 3a a a 1 0 A
0 1 1 1 2 3a a a 1 1 A
1 0 0 1 2 3a a a 1 1 S
1 0 1 1 2 3a a a 1 0 S
1 1 0 1 2 3a a a 0 1 S
1 1 1 1 2 3a a a 0 0 S
Notice that only half the values are needed, since the other half can be obtained by
changing the signs. To explore this redundancy, some address modification is done, shown
to the right in table 4.1 by using 4.6 and 4.7.
1 1 2y x x (4.6)
2 1 3y x x (4.7)
Here, variable 1x has been selected as the control signal.The add/sub control (i.e., 1x )
72
must also provide the correct addition/subtraction function when the sign bits are
accumulated. Therefore, following control signal is used to address the ROM:
1 signbitA S x x (4.8)
Where the control signal signbitx is zero at all times except when the sign bit arrives.
Figure 4.4 shows the resulting principle for distributed arithmetic with halved ROM. Only
1N variables are used to address the memory. The XOR gates used for halving the
memory can be merged with the XOR gates used for inverting the functionkF .
Figure 4.4: Distributed arithmetic with smaller ROM
This technique for reducing the memory size can easily be implemented using a small
modification of the shift accumulator.
4.3 General FPGA Architecture
Major FPGA specifications include the amount of configurable logic blocks (CLBs), the
number of fixed function logic blocks, such as multipliers, and size of memory resources.
Although there are many other parts of an FPGA chip, but these are typically the most
73
Figure 4.5: Different Parts of an FPGA
important when selecting and comparing FPGAs. The configurable blocks of logic, such
as slices or logic cells, are made up of two basic things: flip-flops and LUTs. Figure 4.5
shows the different parts of FPGA.
Figure 4.6: Structure of an FPGA
The structure of FPGA is array based, meaning that each chip comprises a two
dimensional array of logic blocks that can be interconnected via horizontal and vertical
74
routing channels. An illustration of this type of architecture is shown in figure 4.6. The
CLB is based on LUTs. A LUT is a small one bit wide memory array, where the address
lines for the memory are inputs of the logic block and the one bit output from the memory
is the LUT output. A LUT with K inputs would then correspond to a 2K x 1 bit memory
and can realize any logic function of its K inputs by programming the logic function‟s
truth table directly into the memory.
4.3.1 Stratix II FPGAs
The Stratix II family of FPGAs is based on a 1.5 V, 0.13 μm, all layer copper SRAM
process, with densities of up to 79,040 logic elements (LEs) and upto 7.5 MB of RAM
(Altera publication, 2002). Stratix devices offer up to 22 digital signal processing (DSP)
blocks with up to 176 (9-bit × 9-bit) embedded multipliers, optimized for DSP
applications that enable efficient implementation of high performance filters. Stratix
devices support various I/O standards and also offer a complete clock management
solution with its hierarchical clock structure with up to 420 MHz performance.
Stratix devices contain a two dimensional row and column based architecture to
implement custom logic. A series of column and row interconnects of varying length and
speed provide signal interconnects between logic array blocks (LABs), memory block
structures, and DSP blocks. The logic array consists of LABs, with 10 logic elements
(LEs) in each LAB. An LE is a small unit of logic providing efficient implementation of
user logic functions. LABs are grouped into rows and columns across the device. M512
RAM blocks are simple dual port memory blocks with 512 bits. These blocks provide
dedicated simple dual port or single port memory up to 18 bits wide. M512 blocks are
grouped into columns across the device in between certain LABs. M4K RAM blocks are
dual port memory blocks with 4K bits plus parity (4,608 bits). These blocks provide
dedicated dual port, simple dual port, or single port memory up to 36 bits wide. These
75
blocks are grouped into columns across the device in between certain LABs. M-RAM
blocks are dual port memory blocks with 512K bits. These blocks provide dedicated dual
port, simple dual port, or single port memory up to 144-bits wide. Several M-RAM blocks
are located individually or in pairs within the device‟s logic array. DSP blocks can
implement up to either eight full precision 9 × 9-bit multipliers, four full-precision 18 ×
18-bit multipliers, or one full-precision 36 × 36-bit multiplier with add or subtract
features. These blocks also contain 18-bit input shift registers for digital signal processing
applications, including FIR and infinite impulse response (IIR) filters. DSP blocks are
grouped into two columns in each device (Altera publication, 2002).
Figure 4.7: Block Diagram of Stratix II FPGA
76
Each Stratix device I/O pin is fed by an I/O element (IOE) located at the end of LAB rows
and columns around the periphery of the device. I/O pins support numerous single ended
and differential I/O standards. Each IOE contains a bidirectional I/O buffer and six
registers for registering input, output, and output enable signals.The number of M512
RAM, M4K RAM, and DSP blocks varies by device along with row and column numbers
and M-RAM blocks.
4.3.1.1 Logic Array Blocks (LABs)
The LAB local interconnect can drive LEs within the same LAB. The LAB local
interconnect is driven by column and row interconnects and LE outputs within the same
LAB (Altera publication, 2002)..
Figure 4.8: Stratix LAB Structure
Neighbouring LABs, M512 RAM blocks, M4K RAM blocks, or DSP blocks from the left
and right can also drive an LAB‟s local interconnect through the direct link connection.
The direct link connection feature minimizes the use of row and column interconnects,
77
providing higher performance and flexibility. Each LE can drive 30 other LEs through fast
local and direct link interconnects.
Each LAB contains dedicated logic for driving control signals to its LEs. The
control signals include two clocks, two clock enables, two asynchronous clears,
synchronous clear, asynchronous preset/load, synchronous load, and add/subtract control
signals. This gives a maximum of 10 control signals at a time. Although synchronous load
and clear signals are generally used when implementing counters, they can also be used
with other functions. Each LAB‟s clock and clock enable signals are linked. If the LAB
uses both the rising and falling edges of a clock, it also uses both LAB clock signals. De-
asserting the clock enable signal will turn off the LAB clock. Each LAB can use two
asynchronous clear signals and an asynchronous load/preset signal. The asynchronous
load acts as a preset when the asynchronous load data input is tied high. With the LAB
“addnsub”( see figure 4.9) control signal, a single LE can implement a one bit adder and
subtractor. This saves LE resources and improves performance for logic functions such as
DSP correlators and signed multipliers that alternate between addition and subtraction
depending on data.
4.3.1.2 Logic Elements (LEs)
The smallest unit of logic in the Stratix architecture, the LE, is compact and provides
advanced features with efficient logic utilization. Each LE contains a four-input LUT,
which is a function generator that can implement any function of four variables (Altera
publication, 2002). In addition, each LE contains a programmable register and carry chain
with carry select capability. A single LE also supports dynamic single bit addition or
subtraction mode selectable by an LAB-wide control signal. Each LE drives all types of
interconnects: local, row, column, LUT chain, register chain, and direct link interconnects.
Each LE‟s programmable register can be configured for D, T, JK or SR operation.
78
Figure 4.9: Block Diagram of Stratix LE
Each register has data, true asynchronous load data, clock, clock enable, clear, and
asynchronous load/preset inputs. Global signals, general-purpose I/O pins, or any internal
logic can drive the register‟s clock and clear control signals. Either general purpose I/O
pins or internal logic can drive the clock enable, preset, asynchronous load, and
asynchronous data. The asynchronous load data input comes from the data 3 input of the
LE. Each LE has three outputs that drive the local, row, and column routing resources.
The LUT or register output can drive these three outputs independently. Two LE outputs
drive column or row and direct link routing connections and one drives local interconnect
resources. This allows the LUT to drive one output while the register drives other output.
This improves device utilization because the device can use the register and LAB LUT
routing from previous LE functions.
4.3.1.3 TriMatrix Memory
TriMatrix memory consists of three types of RAM blocks: M512, M4K, and M-RAM
blocks (Altera publication, 2002). Although these memory blocks are different, still they
79
all can implement various types of memory with or without parity, including true dual
port, simple dual port, and single port RAM, ROM, and FIFO buffers. The largest
TriMatrix memory block, the M-RAM block, is useful for applications where a large
volume of data must be stored on-chip. The M-RAM block can be configured in true dual
port RAM, simple dual port RAM, single port RAM and FIFO RAM mode. Only
synchronous operation is supported in the M-RAM block. The memory address and output
width can be configured as 64K × 8 bits, 32K × 16 bits, 16K × 32 bits, 8K × 64 bits, and
4K × 128 bits. Mixed width configurations are also possible, allowing different read and
write widths.
4.3.1.4 Digital Signal Processing Block
The most commonly used DSP functions are finite impulse response (FIR) filters,
complex FIR filters, infinite impulse response (IIR) filters, fast Fourier transform (FFT)
functions and direct cosine transform (DCT) functions. Additionally, some applications
need specialized operations such as multiply-add and multiply accumulate operations.
Stratix devices provide DSP blocks to meet the arithmetic requirements of these functions.
Each Stratix device has two columns of DSP blocks to efficiently implement DSP
functions faster than LE-based implementations. Each DSP block can be configured to
support up to eight 9 × 9-bit multipliers, eour 18 × 18-bit multipliers or one 36 × 36-bit
multiplier (Altera publication, 2002).
As indicated, the Stratix DSP block can support one 36 × 36-bit multiplier in a single
DSP block. This is true for any matched sign multiplications, but the capabilities for
dynamic and mixed sign multiplications are handled differently. The the largest functions
that can fit into a single DSP block can be 36 × 36-bit unsigned by unsigned
multiplication, 36 × 36-bit signed by signed multiplication, 35 × 36-bit unsigned by signed
multiplication, 36 × 35-bit signed by unsigned multiplication, 36 × 35-bit signed by
80
dynamic sign multiplication, 35 × 36-bit dynamic sign by signed multiplication, 35 × 36-
bit unsigned by dynamic sign multiplication, 36 × 35-bit dynamic sign by unsigned
multiplication, 35 × 35-bit dynamic sign multiplication when the sign controls for each
operand are different or 36 × 36-bit dynamic sign multiplication when the same sign
control is used for both operands. DSP block multipliers can optionally feed an
adder/subtractor or accumulator within the block depending on the configuration. This
makes routing to LEs easier, saves LE routing resources, and increases performance,
because all connections and blocks are within the DSP block. So the DSP block registers
can be efficiently used to implement shift registers for FIR filter applications.
4.3.1.5 Modes of Operation
The adder, subtractor, and accumulate functions of a DSP block have simple multiplier,
multiply accumulator and multipliers adder modes of operation. In simple multiplier
mode, shown in figure 4.10, the DSP block drives the multiplier sub block result directly
to the output with or without an output register. Up to four 18 × 18-bit multipliers or eight
9 × 9-bit multipliers can drive their results directly out of one DSP block. DSP blocks can
also implement one 36 × 36-bit multiplier in multiplier mode. DSP blocks use four 18 ×
18-bit multipliers combined with dedicated adder and internal shift circuitry to achieve 36-
bit multiplication. In MAC mode, the DSP block drives multiplied results to the
adder/subtractor/accumulator block configured as an accumulator as shown in figure 4.11.
Two multiply-accumulators up to 18 × 18 bits can be implemented in one DSP block.
The first and third multiplier subblocks are unused in this mode, because only
one multiplier can feed one of two accumulators. The multiply accumulator output can be
up to 52 bits. The “addnsub” signal can set the accumulator for decimation and the
overflow signal indicates underflow condition (Altera publication, 2002). For FIR filters,
the DSP block combines the four multipliers adder mode with the shift register inputs.
81
Figure 4.10: Block Diagram of DSP block in Simple Multiplier Mode
Figure 4.11: Block Diagram of DSP block in Multiply Accumulate Mode
82
One set of shift inputs contains the filter data, while the other holds the coefficients loaded
in serial or parallel. The input shift register eliminates the need for shift registers external
to the DSP block. This architecture simplifies filter design since the DSP block
implements all of the filter circuitry. One DSP block can implement an entire 18-bit FIR
filter with up to four taps.
Figure 4.12: Block Diagram of DSP block in Four Multiplier Adder Mode
For higher configuration filter implementation, DSP blocks can be cascaded accordingly
(Altera publication, 2002).
83
4.3.1.6 I/O Structure
The IOE in Stratix devices contains a bidirectional I/O buffer, six registers and a latch for
a complete embedded bidirectional single data rate or DDR transfer. As shown in figure
4.13, the IOE contains two input registers with latch, two output registers and two output
enable registers. The design can use both input registers and the latch to capture DDR
input and both output registers to drive DDR outputs.
Figure 4.13: Stratix IOE structure
Additionally, the design can use the output enable register for fast clock to output enable
timing. The negative edge-clocked OE register is used for DDR SDRAM interfacing. The
84
Quartus II software automatically duplicates a single OE register that controls multiple
output or bidirectional pins. The IOEs are located in I/O blocks around the periphery of
the Stratix device. There are up to four IOEs per row I/O block and six IOEs per column
I/O block. The row I/O blocks drive row, column, or direct link interconnects. The column
I/O blocks drive column interconnects (Altera publication, 2002).
Although by using the FPGA architecture in efficient manner, resources can be
reduced but with the help of DA using suitable structural implementation, further
improvement in the FPGA design can be obtained.
4.4 Distributed Arithmetic FIR Filter
As discussed in chapter 3, FIR filters have the advantage of linear phase, high stability,
fewer finite precision errors and efficient implementation. But still they suffers from the
requirement of higher order i.e. more coefficients are required as compared to IIR filter.
This high order demand imposes more hardware requirements, arithmetic operations, area
usage and power consumption when designing and fabricating the filter. Therefore
reducing these parameters is a major objective which can be attained with the help of
efficient use of DA in FPGA implementation. Mathematically FIR filter can be shown as
0
[ ] [ ]N
k
k
y n a x n k
(4.9)
In Equation 4.9, x[n] represents the input, y[n] represents the filter output and ka
represents the filter coefficients. This filter is of Nth order and it contains N+1 taps.
Equation 4.9 can be implemented conventionally by using multipliers, adders and delay
elements as shown in figure 4.14. The delay elements can be implemented using memory
elements and at any time only N most recent inputs need to be stored (Chang, T. S. and
Jen, C. W., 1999). But implementing the FIR filter in this manner using MAC units is
expensive as it consumes N+1 MAC units which are very high for the filter order of N.
85
Figure 4.14: Conventional method for FIR Filter Implementation
To overcome this problem of high MAC unit requirements, DA architecture can be used,
which is very efficient in implementing the Sum Of Products (SOP) (Stanley A. White,
1989). DA implements MAC operations using LUTs/ROMs instead of dedicated
multipliers. DA is bit serial in nature and parallel implementations can be developed by
using serial DA FIRs in parallel.
Let the input variable x[n − k] , which is in 2‟s complement fixed point fractional
format, contain „M‟ bits and let x[n − k] < 1. It can then be expressed as
1
, ,0
[ ] 2M
mk o k m
m
x n k x x
(4.10)
In Equation 4.10, k ,0 x is the Most Significant Bit (MSB) or sign bit and k, M−1 x is the
Least Significant Bit (LSB) of the „M‟ bit variable x [n-k]. It must be noted that k, m, x,
are binary variables and can only assume values 0 or 1. Substituting Equation 4.10 in
Equation 4.9, we get
1
,0 ,
1 0 0
[ ] 2N N M
m
k k k m k
k k m
y n x a x a
(4.11)
86
Equation 4.11 can be expanded and rearranged shown as,
0,0 0 1,0 1 2,0 2 ,0[ ] [ . . . .... . ]N ny n x a x a x a x a
1
0,1 0 1,1 1 2,1 2 ,1[ . . . ..... . ]2N Nx a x a x a x a
2
0,2 0 1,2 1 2,2 2 ,2[ . . . ..... . ]2N Nx a x a x a x a
1
0, 1 0 1, 1 1 2, 1 2 , 1[ . . . ...... . ]2M
M M M N M Nx a x a x a x a
(4.12)
In Equation 4.12, each inner term inside the square brackets denotes a logical AND
operation and the plus sign denote arithmetic addition. The negative powers of 2, which
appear outside the brackets can be implemented simply by shifting the results of the
computation to the right. So the MAC operations in Equation 4.9 are now converted to
addition, subtraction, shifting and logical AND operations (Stanley A. White, 1989). Bits
of the input variable can be used to address the LUT.
A serial DA FIR filter can be constructed using a single LUT and time sharing it to
process all the bits. Input shift registers (ISR) are required to supply bits serially to the
LUT in serial DA FIR filter shown in figure 4.15. Bits are output from the ISR MSB first.
To construct a parallel DA FIR filter shown in figure 4.16 „M‟ LUTs are required. The 1st
bits of all the inputs are connected to the 1st LUT, 2nd
bits of all the inputs are connected
to 2nd
LUT and so on. (Tyler J. Moeller and David R. Martinez, 1999). The parallel filter
produces one output every clock cycle whereas the serial filter produces one output every
M clock cycles. The address and LUT contents has been calculated from equation 4.13
and shown in table 4.2.
0,0 0 1,0 1 2,0 2F x a x a x a (4.13)
87
Table 4.2: Address and Contents of an LUT
0,0x 1,0x
2,0x Contents
0 0 0 0
0 0 1 2a
0 1 0 1a
0 1 1 2 1a a
1 0 0 0a
1 0 1 0 2a a
1 1 0 0 1a a
1 1 1 0 1 2a a a
Figure 4.15: Serial Distributed Arithmetic FIR Filter
Since all channels have the same filtering requirements, a multi channel DA
FIR filter can be constructed by time sharing LUTs across data from multiple channels.
For a multi channel DA FIR filter, memory required the amount of memory required to
store input variables will be more since it has to store input variables of multiple streams,
but the logic resources required to compute results would be the same as a single channel
filter. As the filter processes input data one bit at a time per clock cycle, therefore
88
Figure 4.16: Parallel Distributed Arithmetic FIR Filter
serial structures will require clock cycles equal to the input data width to calculate an
output. In contrast, a parallel structure calculates the filter output in a single clock cycle,
so parallel structures provide the highest speed performance at the expense of large area.
Another option is a multibit serial structure combines several small serial FIR filters in
parallel to generate the FIR output. This structure provides greater throughput than a
standard serial structure while using less area than a fully parallel structure. Thus different
architectures can be used depending upon the specific requirement in term of area or
speed.
4.5 Design and Implementation of Proposed Digital Up Converter for WiMAX
System
In this section design and implementation of the proposed DUC for WiMAX system using
DA is presented. For its implementation, different architectures like fully serial, multibit
serial and fully parallel architectures are used to choose the best architecture. The
89
interpolation filters are implemented using Nyquist FIR design with direct form polyphase
structure. The input sample frequency, passband ripple and stpopband attenuation are
taken as 11.2 MHz, 0.015 dB and 60 dB respectively. The interpolation factor is taken as
8. Proposed DUC is implemented by cascading pusle shaping single rate FIR filter,
interpolaion by 2 filter and interpolation by 4 filter. The design and implementation of
these pulse shaping single rate FIR filter, interpolaion by 2 filter and interpolation by 4
filters are presented in the following sub sections.
4.5.1 Design and Implementation of Pulse Shaping Single Rate FIR Filter
In the DUC, pulse shaping filter is used to attenuate out of band power in order to meet
the spectral mask requirement. RRC is a favorable filter to do pulse shaping as it transition
band response meets the Nyquist criteria. The pulse shaping single rate FIR filter is
designed with roll off factor 0.25 and stop band attenuation of 60 dB. The passband and
stopband frequencies is taken as 4.65 MHz and 5.35 MHz respectively. The pulse shaping
single rate FIR filter is designed and implemented for fully serial, multibit serial and fully
parallel architectures. The resources utilized by different architectures and their
performance in term of speed is shown in tables 4.3 and 4.4. From table 4.3, it is
concluded that in case of DA fully serial architecture for interpolation single rate channel
filter, as the number of serial units are increased from 1 to 4, the number of logic cells
increases from 3941 to 4051 i.e. there is an increase of 2.8% Whereas number of clock
cycles required to process input and output data decreases from 16 to 4 i.e. the speed
increases by fourfold.
The results for fully parallel architecture implementation are shown in table 4.4.
From table 4.4, it is concluded that DA fully parallel architecture with the pipeline level 1
provides the best performance among all parallel architectures. On analyzing the results of
tables 4.3 and 4.4, it is concluded that DA fully serial architecture having 4 numbers of
90
Table 4.3: Comparison of FPGA Resource Utilization by Distributed Arithmetic
Fully Serial Interpolator Single Rate Filter with different Number of Serial Units
Table 4.4: Comparison of FPGA Resource Utilization by Distributed Arithmetic
Fully Parallel Interpolator Single Rate Filter with different levels of Pipelining
FPGA
Resources
No. of Serial Units
No. of Serial
Units =1 No. of Serial
Units =2
No. of Serial Units
=4
Logic Cells 3916 3941 4051
M512 1 1 1
M4K 0 0 0
Clock Cycles
Required to
Process Input Data
16 8 4
Clock Cycles
Required to
Generate Output
Data
16 8 4
Resources
Pipeline Level
Pipeline Level 1 Pipeline Level 2 Pipeline Level 3
Logic Cells 5137 5749 6505
M512 1 1 1
M4K 0 0 0
Clock Cycles
Required to
Process Input
Data
1 1 1
Clock Cycles
Required to
Generate
Output Data
1 1 1
91
serial units requires 4051 Logic cells, whereas DA fully parallel architecture with pipeline
level of 1 requires 5137 Logic cells. And DA fully parallel architecture with pipeline level
of 1 requires 1 clock cycle to process input data and 1 clock cycle to generate output data
whereas DA fully serial architecture having 4 numbers of serial units requires 4 clock
cycles to process input data and 4 clock cycles to generate output data. Thus as compared
to DA fully serial architecture having 4 numbers of serial units, the speed of DA fully
parallel architecture with pipeline level of 1 increases by four folds at an expense of only
about 26.8% of FPGA resources. As best result in term of speed are obtained in fully
parallel architecture with pipeline level of 1, so for this filter design, fully parallel
architecture with pipeline level 1 is used.
4.5.2 Design and Implementation of Interpolation by 2 FIR Filter
In interpolation by 2 filter, the input sample rate will be 11.2 Msps and at output, it
will provide 22.4 Msps. So interpolation by 2 filter is designed with input sample rate 11.2
Msps, passband ripple of 0.015, stopband attenuation of 60 dB and interpolation factor of
2. This interpolation by 2 filter is implemented for fully serial, multibit serial and fully
parallel architectures. The resources utilized by different architectures and their
performance in term of speed is shown in tables 4.5 and 4.6. From table 4.5, it is
concluded that in case of DA fully serial architecture for interpolation by 2 filter, as the
number of serial units are increased from 1 to 4, the number of logic cells increases from
523 to 1021 i.e. there is an increase of approximately 95%. Whereas number of clock
cycles required to process input data decreases from 32 to 8 and number of clock cycles
required to generate output data decreases from 16 to 4 i.e. the speed increases by
fourfold. Table 4.6 shows the result for fully parallel architecture with pilpeline levels 1, 2
and 3. Pipeline level 1 shows the best results in term of speed and less resources in fully
92
Table 4.5: Comparison of FPGA Resource Utilization by Distributed Arithmetic
Fully Serial Interpolation by 2 Filter with different Number of Serial Units
Table 4.6: Comparison of FPGA Resource Utilization by Distributed Arithmetic
Fully Parallel Interpolation by 2 Filter with different levels of Pipelining
FPGA
Resources
No. of Serial Units
No. of Serial
Units =1
No. of Serial
Units =2
No. of Serial
Units =4
Logic Cells 523 697 1021
M512 2 2 2
M4K 2 4 8
Clock Cycles
Required to
Process Input
Data
32 16 8
Clock Cycles
Required to
Generate Output
Data
16 8 4
Resources
Pipeline Level
Pipeline Level 1 Pipeline Level 2 Pipeline
Level 3
Logic Cells 1890 2000 3716
M512 2 2 2
M4K 18 18 18
Clock Cycles
Required to
Process Input
Data
2 2 2
Clock Cycles
Required to
Generate
Output Data
1 1 1
93
parallel architectures. On comparing the results of tables 4.5 and 4.6, it is concluded that
DA fully serial architecture having 4 numbers of serial units requires 1021 logic cells,
whereas DA fully parallel architecture with pipeline level of 1 requires 1890 logic cells.
Also DA fully parallel architecture with pipeline level of 1 requires 2 clock cycle to
process input data and 1 clock cycle to generate output data whereas DA fully serial
architecture having 4 numbers of serial units requires 8 clock cycles to process input data
and 4 clock cycles to generate the output data. Thus as compared to DA fully serial
architecture having 4 numbers of serial units, the speed of DA fully parallel architecture
with pipeline level of 1 increases by four folds at an expense of about 85% of logic cells.
4.5.3 Design and Implementation of Interpolation by 4 FIR Filter
In the DUC, after the signal get interpolated by 2, now it will be interpolated by 4 to get
the required interpolation factor 8. The input sample rate for interpolation by 4 filter is
22.4 Msps, passband ripple is 0.015 dB and stopband attenuation is 60 dB. This
interpolation by 4 filter is designed and implemented for fully serial, multibit serial and
fully parallel architectures. The resources utilized by different architectures and their
performance in term of speed is shown in tables 4.7 and 4.8.
From table 4.7, it is concluded that in case of DA fully serial architecture for
interpolation by 4 filter, as the number of serial units are increased from 1 to 4, the number
of logic cells increases from 584 to 818 i.e. there is an increase of approximately 39%.
Whereas number of clock cycles required to process input data decreases from 64 to 16
and number of clock cycles required to generate output data decreases from 16 to 4 i.e. the
speed increases by fourfold. From table 4.8, it is concluded that in case of DA fully
parallel architecture for interpolation by 4 filter, among all pipeline levels, the pipeline
level 1 provides the best result in term of speed with less required resources. On
comparing the results of tables 4.7 and 4.8, it is concluded that DA fully serial
94
Table 4.7: Comparison of FPGA Resource Utilization by Distributed Arithmetic
Fully Serial Interpolation by 4 Filter with different Number of Serial Units
Table 4.8: Comparison of FPGA Resource Utilization by Distributed Arithmetic
Fully Parallel Interpolation by 4 Filter with different levels of Pipelining
FPGA
Resources
No. of Serial Units
No. of
Serial Units
=1
No. of Serial
Units =2
No. of Serial Units
=4
Logic Cells 584 654 818
M512 1 1 1
M4K 1 1 1
Clock Cycles
Required to
Process Input
Data
64 32 16
Clock Cycles
Required to
Generate
Output Data
16 8 4
Resources
Pipeline Level
Pipeline Level
1
Pipeline Level
2
Pipeline
Level 3
Logic Cells 1038 1232 2172
M512 1 1 1
M4K 6 6 6
Clock Cycles
Required to
Process Input
Data
4 4 4
Clock Cycles
Required to
Generate
Output Data
1 1 1
95
architecture having 4 numbers of serial units requires 818 logic cells, whereas DA fully
parallel architecture with pipeline level of 1 requires 1038 logic cells. Also DA fully
parallel architecture with pipeline level of 1 requires 4 clock cycle to process input data
and 1 clock cycle to generate output data whereas DA fully serial architecture having 4
numbers of serial units requires 8 clock cycles to process input data and 4 clock cycles to
generate the output data. Thus as compared to DA fully serial architecture having 4
numbers of serial units, the speed of DA fully parallel architecture with pipeline level of 1
increases by four folds at an expense of about 27% of logic cells.
Figure 4.17: Logic cells used by different stages of DUC with different number of
serial units for fully serial DA architecture
The variations of the number of logic cells used by pulse shaping, interpolation by 2 and
interpolation by 4 filters, for fully serial DA architecture with different number of serial
units is shown in figure 4.17 and for fully parallel DA architecture with different number
of pipeline levels is shown in figure 4.18. From above discussions, it is concluded that for
implementing different stages, fully parallel DA architecture with pipeline level of 1
provides high speed with moderate area requirement. So, in the proposed design fully
96
parallel DA architecture with pipeline level of 1 is used to implement all the interpolator
stages for DUC for WiMAX system.
Figure 4.18: Logic cells used by different stages of DUC with different levels of
pipelining for fully parallel DA architecture
4.6 Design and Implementation of Proposed Digital Down Converter for WiMAX
System
In this section design and implementation of the proposed DDC for WiMAX system using
DA is presented. For its implementation, different architectures like fully serial, multibit
serial and fully parallel architectures are used to choose the best architecture. The
decimation filters are inplemented using Nyquist FIR design with direct form polyphase
structure. The input sample rate, passband ripple and stpopband attenuation are taken as
89.6 Msps, 0.015 dB and 60 dB respectively. The overall decimation factor is taken as 8.
Proposed DDC is implemented by cascading decimation by 4 filter, decimation by 2 and
decimation channel filters. The design and implementation of these decimation by 4 filter,
97
decimation by 2 and channel filters are presented in the following sub sections.
4.6.1 Design and Implementation of Decimation by 4 FIR Filter
Decimation by 4 filter will downconvert the sample rate from 89.6 Msps to 22.4 Msps.
The design specifications for its implementation have been taken as stopband attenuation
60dB, passband attenuation 0.015 dB, decimation factor 4. This decimation by 4 filter is
designed and implemented for fully serial, multibit serial and fully parallel architectures.
The resources utilized by different architectures and their performance in term of speed is
shown in tables 4.9 and 4.10.
Table 4.9: Comparison of FPGA Resource Utilization by Distributed Arithmetic
Fully Serial Decimation by 4 Filter with different Number of Serial Units
From table 4.9, it is concluded that in case of DA fully serial architecture for decimation
by 4 filter, as the number of serial units are increased from 1 to 4, the number of logic
cells increases from 590 to 824 i.e. there is an increase in required logic cells is 39%. But
the number of clock cycles required to process input data decreases from 16 to 4 and
number of clock cycles required to generate output data decreases from 64 to 16 i.e. the
speed increases by fourfold. From table 4.10, it is concluded that DA fully parallel
FPGA
Resources
No. of Serial Units
No. of
Serial Units
=1
No. of Serial
Units =2
No. of Serial
Units =4
Logic Cells 590 660 824
M512 0 0 0
M4K 1 1 1
Clock Cycles
Required to
Process Input
Data
16 8 4
Clock Cycles
Required to
Generate
Output Data
64 32 16
98
Table 4.10: Comparison of FPGA Resource Utilization by Distributed Arithmetic
Fully Parallel Decimation by 4 Filter with different levels of Pipelining
architecture with pipeline level 1 outperforms other pipeline architectures. On comparing
the results of tables 4.9 and 4.10, it is concluded that DA fully serial architecture having 4
numbers of serial units requires 824 logic cells, whereas DA fully parallel architecture
with pipeline level of 1 requires 1044 logic cells. Also DA fully parallel architecture with
pipeline level of 1 requires 4 clock cycle to process input data and 1 clock cycle to
generate output data whereas DA fully serial architecture having 4 numbers of serial units
requires 8 clock cycles to process input data and 4 clock cycles to generate the output data.
Thus as compared to DA fully serial architecture having 4 numbers of serial units, the
speed of DA fully parallel architecture with pipeline level of 1 increases by four folds at
an expense of about 26% of logic cells. so this filter design is implemented with DA fully
parallel architecture with pipeline level 1.
4.6.2 Design and Implementation of Decimation by 2 FIR Filter
In the DDC, after decimation by 4 filter, decimation by 2 filter will be used. Its
function is to downconvert the sample rate further by factor 2. So the input sample rate for
Resources
Pipeline Level
Pipeline Level
1
Pipeline Level
2
Pipeline Level
3
Logic Cells 1044 1238 2180
M512 0 0 0
M4K 6 6 6
Clock Cycles
Required to
Process Input
Data
1 1 1
Clock Cycles
Required to
Generate
Output Data
4 4 4
99
this filter will be 22.4 Msps and the output sample rate will be 11.2 Msps. In other design
specifications, the passband ripple and stopband attenuation are taken as 0.015 dB and 60
dB. This decimation by 2 filter is designed and implemented for fully serial, multibit serial
and fully parallel architectures. The resources utilized by different architectures and their
performance in term of speed are shown in tables 4.11 and 4.12. From table 4.11, it is
concluded that in case of DA fully serial architecture for decimation by 2 filter, as the
number of serial units are increased from 1 to 4, the number of logic cells increases from
Table 4.11: Comparison of FPGA Resource Utilization by Distributed Arithmetic
Fully Serial Decimation by 2 Filter with different Number of Serial Units
526 to 1024 i.e. there is an increase of approximately 94%. Whereas number of clock
cycles required to process input data decreases from 16 to 4 and number of clock cycles
required to generate output data decreases from 32 to 8 i.e. the speed increases by fourfold
From table 4.12, it can be seen that in case of DA fully parallel architecture with pipeline
level 1 provides best performance in term of speed with lesser resources as compared to
other parallel structures. On comparing the results of tables 4.11 and 4.12, it
FPGA
Resources
No. of Serial Units
No. of Serial
Units =1
No. of Serial
Units =2
No. of Serial
Units =4
Logic Cells 526 700 1024
M512 1 1 1
M4K 2 4 8
Clock Cycles
Required to
Process Input
Data
16 8 4
Clock Cycles
Required to
Generate
Output Data
32 16 8
100
Table 4.12: Comparison of FPGA Resource Utilization by Distributed Arithmetic
Fully Parallel Decimation by 2 Filter with different levels of Pipelining
is concluded that DA fully serial architecture having 4 numbers of serial units requires
1024 logic cells, whereas DA fully parallel architecture with pipeline level of 1 requires
1893 logic cells. Also DA fully parallel architecture with pipeline level of 1 requires 4
clock cycle to process input data and 1 clock cycle to generate output data whereas DA
fully serial architecture having 4 numbers of serial units requires 8 clock cycles to process
input data and 4 clock cycles to generate the output data. Thus as compared to DA fully
serial architecture having 4 numbers of serial units, the speed of DA fully parallel
architecture with pipeline level of 1 increases by four folds at an expense of about 84% of
logic cells. So the decimation by 2 filter is designed with fully parallel architecture with
pipeline level 1.
4.6.3 Design and Implementation of Decimation Channel Filter
In the DDC, the channel filter is used after decimation by 2 filter. The main function of
this filter is to provide stopband attenuation to remove adjacent channel interference. In
Resources
Pipeline Level
Pipeline Level 1 Pipeline Level 2 Pipeline Level 3
Logic Cells 1893 2003 3719
M512 1 1 1
M4K 18 18 18
Clock Cycles
Required to
Process
Input Data
1 1 1
Clock Cycles
Required to
Generate
Output Data
2 2 2
101
addition, it also have to keep passband ripple with in range. For this filter RRC filter with
Nyquist design is used with roll off factor 0.25, stopband attenuation 60 dB. This
decimation channel filter is designed and implemented for fully serial, multibit serial and
fully parallel architectures. The resources utilized by different architectures and their
performance in term of speed are shown in tables 4.13 and 4.14.
Table 4.13: Comparison of FPGA Resource Utilization by Distributed Arithmetic
Fully Serial Decimator Channel Filter with different Number of Serial Units
From table 4.13, it is concluded that in case of DA fully serial architecture for single rate
channel filter of DDC, as the number of serial units are increased from 1 to 4, the number
of logic cells increases from 2093 to 2255 i.e. there is an increase of approximately 7%.
Whereas number of clock cycles required to process input and output data decreases from
16 to 4 i.e. the speed increases by fourfold. From table 4.14, it is concluded that in case of
DA fully parallel architecture for single rate channel filter, among other pipeline level
parallel structures, the pipeline level 1 parallel structure provides the best performance in
FPGA
Resources
No. of Serial Units
No. of
Serial Units
=1
No. of Serial
Units =2
No. of Serial
Units =4
Logic Cells 2093 2147 2255
M512 1 1 1
M4K 0 0 0
Clock Cycles
Required to
Process Input
Data
16 8 4
Clock Cycles
Required to
Generate
Output Data
16 8 4
102
Table 4.14: Comparison of FPGA Resource Utilization by Distributed Arithmetic
Fully Parallel Decimator Channel Filter with different levels of Pipelining
term of speed with lesser area. On comparing the results of tables 4.13 and 4.14, it is
concluded that DA fully serial architecture having 4 numbers of serial units requires 2255
logic cells, whereas DA fully parallel architecture with pipeline level of 1 requires 3148
logic cells. Also DA fully parallel architecture with pipeline level of 1 requires 1 clock
cycle to process input data and 1 clock cycle to generate output data whereas DA fully
serial architecture having 4 numbers of serial units requires 4 clock cycles to process input
data and 4 clock cycles to generate output data. Thus as compared to DA fully serial
architecture having 4 numbers of serial units, the speed of DA fully parallel architecture
with pipeline level of 1 increases by four folds at an expense of about 39% logic cells. so
this filter is designed with DA fully architecture with pipeline level 1.
The variations of the number of logic cells used by decimation by 4, decimation by
2 and decimation channel filters, for fully serial DA architecture with different number of
Resources
Pipeline Level
Pipeline Level
1
Pipeline Level
2
Pipeline Level
3
Logic Cells 3148 3613 4319
M512 1 1 1
M4K 0 0 0
Clock Cycles
Required to
Process Input
Data
1 1 1
Clock Cycles
Required to
Generate
Output Data
1 1 1
103
Figure 4.19: Logic cells used by different stages of DDC with different number of serial units
for fully serial DA architecture
Figure 4.20: Logic cells used by different stages of DDC with different levels of pipelining
for fully parallel DA architecture
serial units are shown in figure 4.19 and for fully parallel DA architecture with different
number of pipeline levels are shown in figure 4.20. From these discussions, it is
concluded that fully parallel DA architecture with pipeline level of 1 has high speed with
104
moderate area requirement . So, in the proposed design fully parallel DA architecture with
pipeline level of 1 is used to implement all decimator stages of DUC for WiMAX system.
So in the proposed design fully parallel DA architecture with pipeline level of 1 is used to
implement all interpolator and decimator stages of DUC and DDC for WiMAX system.
4.7 Conclusions
Due to their high performance and facility to implement DSP function in efficient manner,
FPGAs can be considered a better choice to increse the performance of broadband
communication system like WiMAX. Also the availability of high level design tools helps
in reducing the design cycle for FPGA implementation.
DA can be used to implement low cost LUT based DSP functions either in serial or
parallel form. When the number of elements in a vector is same as word size, DA results
in fast operational speed. This fast speed is achieved by replacing multiplications by ROM
based LUT. Decomposition technique and coding technique are used to reduce the ROM.
FIR filters can be implemented using serial or parallel DA architecture. A parallel DA FIR
filter produces one output for every clock cycle, whereas serial DA FIR filters requires M
clock cycles to produce the output. Thus parallel architecture provides higher speed
performance. Multibit serial architecture is another option which combines several small
serial FIR units in parallel. This architecture provides greater throughput than the standard
serial architectures, but less than parallel architecture. So to improve the performance in
terms of speed, DA parallel architecture with pipeline level 1 is used for the proposed
designs of interpolation filters and decimation filters of DUC and DDC for WiMAX
system.