18-525 Spring 2005 Digital Signal Processing “Swiss Army ...ee525/projects/projects2005/final...

18-525 Spring 2005

Digital Signal Processing “Swiss Army Knife”

Final Report and Documentation

TABLE OF CONTENTS

I. Abstract ...................................................................................................................................3 II. Specification Sheet..................................................................................................................4 III. Introduction.............................................................................................................................6 IV. System Level...........................................................................................................................7

Comb:.................................................................................................................................. 8 Biquad:................................................................................................................................ 8

V. Gate Level...............................................................................................................................9 Floating Point Multiplier..................................................................................................... 9 Floating Point Adder......................................................................................................... 13 Comb Delay Network ....................................................................................................... 15

VI. Layout ...................................................................................................................................16 Primitive Gates.................................................................................................................. 17 Floating Point Multiplier................................................................................................... 18

Wallace Tree .................................................................................................................. 19 Other Major Multiplier Components ............................................................................. 20

Floating Point Adder......................................................................................................... 21 Components of the Floating Point Adder ...................................................................... 22 Correction ...................................................................................................................... 22 Register .......................................................................................................................... 22

Improved Multiplier and Revised Top Level.................................................................... 23 VII. Verification ...........................................................................................................................25

MATLAB Verification of Lyons and Bell paper:............................................................. 26 C and Behavioral Verilog verification of Floating Point functions:................................. 27 Structural Verilog Verification: ........................................................................................ 29 Schematic Verification: .................................................................................................... 29 Layout Verification:.......................................................................................................... 29 Analog Artist Verification: ............................................................................................... 31 Soft-IP:.............................................................................................................................. 32

VIII. Results...................................................................................................................................33 IX. Conclusion ............................................................................................................................33 A. Appendix: “The Swiss Army Knife of Digital Networks” by Lyons and Bell.....................34 B. Appendix: Test Suite Example .............................................................................................44 C. Appendix: Verilog and C Example.......................................................................................45 D. Appendix: Soft-IP Code........................................................................................................46

Top Level .......................................................................................................................... 46 Floating Point Adder......................................................................................................... 51 Floating Point Multiplier................................................................................................... 55

3 / 56

I. Abstract

The Swiss_525 DSP (Digital Signal Processing) chip is an IC (Integrated Circuit) that acts as a

toolbox for the DSP system designer. The strength of this chip is that it can be set to implement

16 different filters and basic functions based on the selection of input coefficients. This report

will discuss the usefulness of the Swiss_525 circuit, explain the design decisions, and conclude

with a software extension of the circuit. An analysis of the design from conception through

simulation will be given with verification of each step. This will highlight the following stages

of the design process: Verilog code, schematic representation, floorplan, layout, and simulation.

While not implemented in hardware, the “Soft-IP” extension of the project implements an

additional six functions and operates on complex numbers.

4 / 56

II. Specification Sheet Size of Design: 177,749.5 µm2 (443.52 µm wide by 400.77 µm tall) Transistor count: 34,564 transistors (17,282 nmos and 17,282 pmos) Transistor Density: 0.194 transistors / µm2 Chip Speed: 1 MHz Data Input: 12-bit floating point (S EEEEEE FFFFF) Data Output: 12-bit floating point (S EEEEEE FFFFF) Technology: 0.18 SCMOS Input Pins: 76

12-bit: 6 (X[n], a1, a2, b0, b1, b2) 1-bit: 4 (c1, N, rst, clk)

Output Pins: 12 12-bit: Y[n] InOut Pins: 2 (gnd!, vdd!) Number of Pins: 90 Pinout:

Pin # Name Type Bit-width Pin # Name Type Bit-width0 vdd! InOut 1 67 <none> - - 1 <none> - - 68 – 79 a1 Input 12 2 gnd! InOut 1 80 <none> - -

3-14 X[n] Input 12 81 – 92 b0 Input 12 15 <none> - - 93 <none> - -

16 - 27 a2 Input 12 94 c1 Input 1 28 <none> - - 95 <none> - -

29 – 40 b2 Input 12 96 N Input 1 41 <none> - - 97 <none> - -

42 – 53 Y[n] Output 12 98 clk Input 1 54 <none> - - 99 rst Input 1

55 – 66 b1 Input 12

5 / 56

Fig. 1 - PinOut Diagram

Inputs: c1: This input Coefficient is used to select whether or not the comb filter is active.

“1” represents that the comb filter’s register value of 8 or 16 (based on N) will be used. “0” represents that the comb-filter will be bypassed and the value of 12’b000000000000 (12-bit floating point “zero”) will be sent into the biquad.

N: This input Coefficient is used to select which register is read from the comb filter. The value “0” maps to 8 while a “1” maps to 16.

rst: This is the reset line for the registers in the circuit. It is asserted low. clk: This is the clock signal for the registers within the circuit. an/bn: These input coefficients are used to control the functionality of the circuit.

Depending on their values (as given in Appendix A) they cause the network to act as many different filters or functions. All coefficients correspond in hardware to being an input value to a 12-bit Floating Point multiplier.

X[n]: This is the 12-bit Floating Point input to the circuit. Outputs:

Y[n]: This is the 12-bit Floating Point output of the circuit

6 / 56

III. Introduction

The goal for this semester was to design and implement the ‘Swiss_525’ chip. This design was based on the paper,“The Swiss Army Knife of Digital Networks,” published by Lyons and Bell. This paper describes a system that should be in the toolbox of every DSP (Digital Signal Processing) chip designer. This is because this simple network can implement a total of 22 various functions by varying the input coefficients. This versatile chip can implement such things as a first-order or second-order IIR filter, used in feedback systems, a moving average, with applications in noise reduction, and a comb delay, used for audio effects such as echoing.

When initially considering this as a proposal, redundant features were noticed in the prior

year’s projects. Encryption, Fourier transforms, and filtering have previously been implemented and we found that this ‘Swiss Army Knife’ could be abstracted to fit into any of these applications. Because of this versatility, it also became clear that this chip would have wide marketability. Since DSP is everywhere, in audio filtering, video capturing and processing, and almost all forms of communication, the applicable markets for this chip are innumerable.

The design process of the Swiss_525 began in MATLAB. The transfer function defined in the

Lyons and Bell paper was implemented. This was then translated into behavioral Verilog with the help of some custom conversion tools written in C for verification. With this behavioral description, structural Verilog was written. The schematic followed and was exhaustively tested against our structural code. This was used as the basis for the layout and is what the chip was eventually LVSed against. Analog simulations were a time-consuming process, therefore only a small number of simulations were successfully run. However they were compared against the output of our schematic and did prove to be correct. Additionally, the behavioral Verilog was also parameterized and extended to handle complex inputs. This behavioral code was the soft-IP deliverable. Implementing such a chip as described in the extended behavioral code would have gone beyond the scope of this course.

The culmination of this process was a chip that was able to implement 16 of the 22 functions

enumerated in the Lyons and Bell paper and a parameterized behavioral description that was able to perform all 22. The overall area of this chip was 177,749.5 µm2 and it was confirmed to run at 1MHz, a speed higher than the audio applications we aimed to cater to.

7 / 56

IV. System Level At the highest level, the Swiss_525 circuit is a ‘comb’ filter and a second order recursive

network, henceforth referred to as ‘biquad,’ tied in series. This is shown below in Fig. 2, which is a modified version of that shown in “The Swiss Army Knife of Digital Networks” by Lyons and Bell (Appendix A). This paper served as the guidelines by which the chip was designed.

Fig. 2 - System Diagram

The comb is mainly a delay network, while the biquad consists of mainly adders and

multipliers. In order to support as many functions as possible within the scope of this course, the decision was made to represent X(n), a1, a2, b0, b1, b2, and Y(n) as 12-bit floating point numbers with the breakdown of (S EEEEEE FFFFF). This meant that all adders and multipliers had to be floating point and all registers had to have a capacity of 12 bits.

By having 12-bit floating point representation for the necessary values, the Swiss_525 chip is

capable of fully implementing 16 of the 22 functions listed in the Lyons and Bell paper. The remaining six functions required that some of the coefficients have imaginary components. This would have required that all adders and multipliers be capable of handling both the real and imaginary parts of a number. This was deemed to be outside the scope of this course and thus was not implemented.

A breakdown of the comb and biquad design decisions follows.

8 / 56

Comb: The main block within the comb filter is the Z- N delay network. Using the data tables in Lyons

and Bell, it was found than the value of N only held two values, 8 and 16. Thus, N is represented merely as a 1-bit number that either selects the 8th or 16th delay.

Fig. 3 – Comb

The value of c1 was found to reduce in a similar manner. In the original diagram, c1 was an

input to a multiplier, but the various functions listed only required that c1 take on either the value one or zero. This effectively reduced the multiplier to a mux, with c1 as the select line. Hence c1 was also represented as a 1-bit number as opposed to a 12-bit.

Biquad: The biquad is by far the largest and most complex part of the Swiss_525. As seen in Fig. 4, it

consists of five multipliers, four adders, and two unit-delay blocks (registers). In the Lyons and Bell paper, there was an additional multiplier but as with c1 and N, the input to this multiplier exhibited a pattern (it was always one) that allowed it to be removed entirely from the design without affecting functionality. The original diagram also only had two adders, each taking in three inputs. Rather than design a three-input floating point adder, it was decided that in order to simplify the logic, two adders in series would be implemented. While this may have resulted in an overall larger design, it allowed far more flexibility in the data flow and floorplan of the design.

Fig. 4 – biquad

9 / 56

Another design decision that greatly impacted the size and complexity of the biquad was the

implemention of the floating point multiplier with Booth recoding and a Wallace tree structure. While only offering marginal improvements due to the relatively small bit-width, it was a learning experience that greatly increased the overall complexity of the Swiss_525.

V. Gate Level The following section discusses the gate level design of the Swiss_525 chip. This will be

approached by looking at each of the main modules of the design and breaking them down systematically.

The components comprising the system are all synced under one clock. Here we will dissect the gate level design within the major components that make up the circuit.

Floating Point Multiplier A booth recoded Wallace-tree floating point multiplier was used throughout the design. The

major components within the multiplier can be seen in the schematic in Fig. 5; the parts are: the bigMan, addExp, shifter, and subBias, in addition to many smaller components.

Fig. 5 - Floating Point Multiplier

The largest structure within the multiplier is the bigMan which is the Booth Recoding Wallace-

tree that handles the bulk of the process for multiplying the mantissas of the input bits. Inside of the Wallace Tree there are a number of components with the major ones being: PPGenerate7, comp3_2, full adder, and half adder. Three PPGenerate7s determine the partial products that are used in the Wallace Tree (PPGenerate7 is partially shown in Fig. 6).

10 / 56

Fig. 6 - PPGenerate7bit

The Wallace structure is made up of a number of comp3_2 modules that take the outputs

generated by the PPGenerate7s and then feed a series of full and half adders (Fig. 7). The comp3_2 modules are functionally equivalent to three-bit-at-a-time full adder cells, and allow for a more efficient addition of the three partial products generated by the PPGenerates. However, since they produce two components, carry and sum, a full adder is still required to calculate the final product. For this, a simple ripple-carry adder was used since speed was not the major goal and this design lent itself to easier layout.

Fig. 7 - comp3_2

11 / 56

Fig. 8 - FullAdder

The outputs of the Booth Recoding Wallace Tree are then fed into either ground if they are

being truncated or a shifter to be adjusted based upon the exponents. (Fig. 9) The shifter is built mostly out of muxes and an adder. The exponents are fed through an adder using primarily full adders and full differencers (Fig. 10) in parallel with the mantissas going through the Wallace Tree and then into the shifter alongside the mantissa. The outputs of the shifter enter into a 12 bit mux (See a schematic of 2-bit mux in Fig. 11) where the outputs are selected against ground based upon edge cases determined through an overflow/zero detection network which checks to see if the inputs will create an overflow or if they are zero. The output bits of the mux are then sent through buffers before being sent out the floating point multiplier and into the next section of the circuit.

Fig. 9 - Multiplier Shifter

12 / 56

Fig. 10 - Full Differencer

Fig. 11 - 2bitMux

13 / 56

Floating Point Adder The floating point adder has a large number of components, some similar to the multiplier.

The top level schematic can be seen Fig. 12 and Fig.13.

Fig. 12 - Floating Point Adder (left)

Fig. 13 - Floating Point Adder (right)

The inputs are fed simultaneously through the swap module (Fig. 14) and the equiv_test

module. The swap module switches the input bits based upon whichever input number is larger, and the equiv_test checks to see if they are equal using the difference determined in the swap module. Equiv_test is used because the addition of two numbers of equal magnitude is treated as a special case.

14 / 56

Fig. 14 - Swap

From the swap module the exponents enter a shifter, while the mantissas enter an

adder/differencer. The schematic of an example shifter used in the floating point adder can be seen in Fig. 15.

Fig. 15 - Adder Shifter

The outputs of the swap are fed into the shift_adj module. Within this modeule, the difference

between the two exponents is determined and used as the amount that the mantissa of the smaller

15 / 56

numbers needs to be shifted. This is due to the fact that the exponents must be equivalent when adding and as such, the mantissa of the smaller number must be shifted left by the difference of the exponents. After shifting the mantissa of the smaller number, it is fed into the add_sub along with the mantissa of the larger from swap. The add_sub does the fixed point addition or subtraction, based on the xor of the sign of the two inputs. (Fig. 16)

Fig. 16 - Correction

The outputs of the correction module are then sent through a 12-bit mux with the other half of

the inputs connected to ground before being sent through output buffers. This is similar to in the floating point multiplier.

Comb Delay Network The comb delay network is a series of 16 registers and muxes A schematic of half of the delay

network can be seen in Fig. 17. The registers are the same as those in other parts of the circuit. (Fig. 18) They include low enabled reset signals and are built using a number of nand gates.

Fig. 17 - 8 Delay Registers in Comb

16 / 56

Fig. 18 - Register

VI. Layout The following describes the methodology and process for the layout of the Swiss_525 circuit.

Ideally before actually

implementing the layout, there would be a firm plan to base design constraints on. However, the design was rapidly changing, making it difficult to follow rigid guidelines. Waiting for design finalization was not an option, so to stay on relative pace with the layout, good design sense was used to make instances that could be used generally. Flexibility of the design in each of the blocks was critical because it was obvious from the flow of data there would be a lot of routing feeding back and forth, sometimes over adjacent blocks. After finalizing the components of the design, the floorplan, was developed. (Fig.19)

Fig. 19 - Final Flooplan

17 / 56

The circuit had two major components, the biquad and the comb. The components of the comb

included the delay network, comprised of 16 registers, a floating point adder and a mux. As described before, the coefficients were used to alter the functionality of our circuit to any of the 16 possible functions, with the entire comb network located in the top left corner.

The biquad was the second major component. The significant components within the biquad

were registers, a floating point multiplier, and a floating point adder. The layout of these will be discussed in later sections.

Primitive Gates

When designing the primitive gates, the main limitation was using only metal one for interconnect. Larger vdd! and gnd! rails were used in an attempt to plan ahead for voltage drop issues. In addition to this, several versions of each gate were created in order to be used in disparate places in the design. Below is an example of the minimum sized OR and a similar AND gate.

Fig. 20 - OR gate

Fig. 21 - AND gate

18 / 56

Floating Point Multiplier The final layout for the floating point multiplier, is shown on the right. In the top-level layout there are five total instances of this floating point multiplier. Each of these had a total of 2,464 transistors. The total area was 13,543.50 µm2. Initially, the multiplier design had a density of over .20, however additional components meant to handle edge case scenarios dropped the design to .18. Looking at the layout, it is possible to distinguish the older, denser sections and the components to the left that were added to fix computation problems. The multiplier was created in a way that would allow it to be very porous to metal four interconnect. Therefore, the use of metal four was restricted. Had more metal four been used at the lower levels, the design would have been more dense.

19 / 56

Wallace Tree Pictured below is the core of the multiplier. As stated earlier, it is a booth recoded Wallace tree. This layout is the components that do the Wallace tree operations. Despite the complexity of the booth encoded Wallace tree, the layout is relatively dense. It is also uses metal four so sparingly that it is virtually entirely metal four porous.

Fig. 23 - Booth Recoder and Wallace Tree

20 / 56

Other Major Multiplier Components

Fig. 24 - PPgen7bit

Fig. 25 - comp3_2

21 / 56

Floating Point Adder Pictured here is the implementation of the floating point adder. The main goal that we were aiming for was ease of data flow, abutment, and a neat, easy to interpret layout. Like the multiplier, it is highly porous to metal 4. This made for easier global routing, however if metal 4 was used more generously, there were areas of the design that could have been more optimized.

Fig. 26 - Floating Point Adder

22 / 56

Components of the Floating Point Adder

Correction

Register Registers make up the primary components of the comb and are used in the recursive network as well. Below shows basic registers as well as segments of the comb.

Fig. 28 - 12-bit Register

Fig. 27 - Correction

Fig. 29 - Register

23 / 56

Fig. 30 - Comb Delay Network

Improved Multiplier and Revised Top Level

Once most of the major milestones had been achieved, it was obvious that the floating point multiplier was the limiting factor both in overall size and density of the design. Hence, as an exercise in demonstrating what could be achieved with a second iteration, the multiplier was redone from scratch using what had been learned during the original process. As can be seen in Fig. 31, the overall improvement was significant. While this was partially due to the higher quality of layout, a vast majority came from the logic simplifications that were made once the full function of the various components was understood. This improved multiplier, while not completed, was also put into the original floorplan to get an estimate of the overall effect on the chip’s dimensions and density (Fig. 32). The area of the chip decreased by 34,700µm2 and the density increased to 0.214.

24 / 56

Fig. 31 - Improved Floating Point Multiplier

Fig. 32 - Modified Floorplan

25 / 56

VII. Verification Throughout the design process, all high-level testing was done against the graphs provided in

the Lyons and Bell publication. A cropped version of the first table from Lyons and Bell is shown in Fig. 33.

Verification on the Swiss_525 chip was done in a hierarchical fashion, though it switched from top down to bottom up after the preliminary testing was done. Initially MATLAB was used to verify the transfer function that was highlighted in Lyons and Bell1. After this “proof of concept” stage, the floating point adder and multiplier were verified using C and Behavioral Verilog. Next the structural Verilog was written and tested from the bottom up. Almost every individual component of the floating point adder and multiplier was exhaustively tested. Following this, the schematics for floating point adder and floating point multiplier were tested against the Verilog for the sample cases that had been previously used to verify each section in structural Verilog. From here, the layout was tested against the schematic using the LVS function in cadence. After the entire chip LVSed, many of the individual components were simulated using analog artist and finally a simulation of the entire chip was completed. For the final chip verification a test was run through the critical path.

1 H(z) = (1-c1z-n)*( (b0+b1z-1+b2z-2)/(1/a0-a1z-1-a2z-2))

Fig. 33 - Lyons and Bell table

26 / 56

MATLAB Verification of Lyons and Bell paper:

Initially MATLAB was used to verify the transfer function, H(z), that was highlighted in the Lyons and Bell paper. This was done using the transfer equation stated, a few basic mathematical functions, and Fourier shift functions from MATLAB’s toolbox. The functions operated on an impulse as input and produced matching graphs for those provided in the paper. A few of these graphs from the paper are highlighted in Fig. 34.

Fig. 35 and Fig. 36 show the transfer

function of a Leaky Integrator and a Moving Averager respectively. In each of these figures, the original graph from the paper is shown on the left and the MATLAB produced graph is shown on the right.

Transfer function from a Leaky Integrator. Original shown on left, MATLAB verification shown on

right.

Fig. 35 (left)

Transfer function from a Moving Averager. Original shown on left, MATLAB verification shown on

right.

Fig. 36 (left)

Fig. 34 - Transfer Function

Fig. 35 - Leaky Integrator

Fig. 36 - Moving Averager

27 / 56

C and Behavioral Verilog verification of Floating Point functions:

It was decided that a 12-bit floating point adder and multiplier were needed and these two major blocks were written in behavioral Verilog. For verification, a few sample-cases were hand-picked to check that the code was working. “Easy” test cases were first picked and as code would pass a case, a more “difficult” test case would be constructed. What defined “Easy” through “Difficult” was the number of lower level blocks that had to work correctly to show a correct final result, and the existence or lack of an edge case condition. For example, the addition of two positive numbers, with the same exponent, same sign, and no overflow in the addition of the mantissas would be considered a base or “Easy” test case. A “Difficult” test case might be the addition of a positive and a negative number with different exponents and mantissas set to “11111” each (edge case and overflow). A sample of one of these “progressive functionality” test suites is shown in Appendix B. Fig. 37 below shows the waveforms produced from an earlier version of that test.

The three waveforms shown above are input1, input2 and output for a 12-bit floating point multiplier. They show the results:

1.) 0.00 * 0.0000 = 0 (0 000000 00000) 2.) 0.50 * -0.4375 = -0.21875 (1 011100 11000) 3.) 4.50 * 2.5000 = 11.25 (0 100010 01101) 4.) 5.75 * 2.0625 = 11.75 (0 100010 01111) 5.) 7.50 * 3.8750 = 29.0 (0 100011 11010) 6.) 62.00 * 14.0000 = 864 (0 101000 10110) 7.) -62.00 * -14.0000 = 864 (0 101000 10110) 8.) 4.25 * 7.7500 = 32 (0 100100 00000)

As can be seen, the output, while correct within the granularity of the 12-bit floating point numbers, is not always the exact decimal equivalent of a*b. As an example: 5.75*2.0625 = 11.859375, but the closest 12-bit floating point representation to 11.859375 is 11.75. Hence no accuracy was lost during the calculation, only in the final storage of the number. It was found using the program explained on the next page, that this was almost always the case.

It became clear after doing a few tests by hand that a more efficient way of calculating the

value for the, non-standard, 12-bit floating point numbers was needed. A conversion program was written in the C language to quickly move between base 10 decimal and 12-bit floating point. This program later was upgraded to have calculator functionality and it assisted with testing for the remainder of the project. Fig. 38 shows a screenshot of this program.

Fig. 37 – Behavioral Waveforms

28 / 56

In the example shown in Fig. 38 a user has: 1.) Started the program.

• A help menu is displayed showing the functionality of the program. 2.) Entered a 12-bit binary number.

• The decimal equivalent is displayed with enough accuracy to display all numerical data provided by the 12-bit floating point.

3.) Entered a decimal number. • The 12-Floating Point is displayed in 3 sections: S EEEEEE FFFFF.

Where S = sign. E = exponent bit. F = Fraction bit. 4.) Entered “* 5.4 2.3”.

• The result of 5.4*2.3 is displayed in decimal. Functionality for addition and subtraction is also included.

5.) Entered the result “12.42”. • The binary of “5.4*2.3” is displayed (5.4*2.3 = 12.42). • Steps 2-5 are often used to multiply 2 floating point numbers and

check their accuracy, as the internal representation (12-bit FP) is responsible for some error, and the user can’t assume the “correct’ answer of “a * b” = a*b in decimal, rather, it equals “a’s closest FP representation * b’s closest FP representation = c. Where c is again represented in the closest FP representation.

6.) Quit the program. • The program exits.

Fig. 38 - Converstion Code Screenshot

29 / 56

Structural Verilog Verification:

The structural Verilog was written and tested from the bottom up. Individual components were exhaustively tested as they were written. Certain modules were too complex to exhaustively test by a simple Verilog testbench and inspection. For these complex cases, C code was written that would exhaustively test on a behavioral level and was output to a text file. Following this the Verilog was run exhaustively and output to a (or many) text file(s) and the outputs were compared using the unix program “diff”. An example of one of these sets is found in Appendix C. An abridged list of structural components that were exhaustively tested is shown below.

Or_6_Bit Or_9_Bit Nor_12_Bit Nor_11_Bit

FxDiff_11bit FxDiff_6bit FxDiff_5bit Mux_7bit Mux_5bit Mux_11bit Mux_1bit FxAdder_5bit

FA FxMult_6bit Mux_6bit FxAdder_6bit

Schematic Verification:

The schematics for floating point adder and floating point multiplier were tested against the Verilog for the sample cases that had been previously used to verify each section in structural Verilog. They quickly passed, and a top level of the circuit was constructed and passed tests after a few bugs. This stage was rather uneventful.

Layout Verification:

The layout was tested against the schematic using the LVS function in cadence. Even though a top-level schematic was already created, testing for the layout was done in smaller steps. Basic blocks were LVSed against their counterparts in schematic. However, after the floating point adder and multiplier were LVSed the design was approached differently. The top level was then built in small “chunks” and new schematics were made solely for the purpose of testing these chunks. Fig. 39 and Fig. 40 below shows “wings2”, a layout and schematic made up of two of the floating point multipliers and three 12-bit registers. It was in 11 sections such as these that the top level was constructed piece by piece.

30 / 56

Fig. 39 - wings2 layout

Fig. 40 (right) shows the schematic that was built solely for verification purposes. This became one of the basic building blocks of the top-level design.

These 11 pieces were constructed as follows:

1.) One Half of the top-level design will eventually be called “Top Six”. It was built and tested in the following stages:

a. Wings2(below), Wing2_bottom(below) and two floating point adders make up “Top Six”.

i. Two Floating Point Multipliers, their input registers and register “D2” make up “wings2”

ii. Two Floating Point Multipliers, their input registers and registers “D1” make up “wings2_bottom”

2.) The other half is called, CombBottom3, and was built as follows: a. “Comb” is built as such:

i. Sixteen, 12-bit Registers are used to construct “Comb_Delay_8_16” ii. A row of two 12-bit Muxes were added to “Comb_Delay_8_16 to make

“Comb_delayMuxes” iii. A floating point adder was connected to “Comb_delayMuxes” to make

“Comb” b. Bottom3 is built as such:

i. A floating point adder and multiplier are connected to make “Bottom2” ii. Bottom2 and a floating point adder are connected to make “Bottom3”

c. Comb and bottom3 are combined to make CombBottom3

Fig. 40 – wings2 schematic

31 / 56

The top level schematic was then built anew from these two blocks and input and output registers. It was tested against the old top-level schematic for accuracy. The visual is not impressive, but is shown in Fig. 41 along with a picture of spaceship one for spice.

Fig. 41 - Top Level Schematic

Analog Artist Verification:

After the entire chip LVSed, many of the individual components were simulated using analog artist and finally simulated the entire chip. The final chip verification was tested through our critical path. The chip successfully ran at 1 MHz, which was above the speed goal for audio applications. This was considered acceptable, so there was no attempt to specify the maximum speed of the circuit any more narrowly.

Fig. 42 - Analog Simulation

32 / 56

Soft-IP:

It was determined that to do all 22 functions for our circuit, the circuit needed the ability to operate on complex coefficients. Laying out a complex version of the design in hardware would have required the addition of approximately 60,000 transistors. In a naïve implementation, every floating point adder would become two adders and every floating point multipliers would become four multipliers and two adders. This approximately 100,000 transistor final design would be well outside the scope of the class. Instead, this circuit was constructed in software. The actual naïve implementation of the complex mathematics was done for the top-level of the design. This Verilog version of the circuit was verified against the Lyons/Bell paper as shown in Fig. 43.

Fig. 43 - SoftIP verification

33 / 56

VIII. Results The final result of this design process was a design specification ready for fabrication using the

0.18 CMOS technology (hardware end) and parameterized behavioral Verilog code for complete functionality. The Swiss_525 chip (hardware) is able to implement 16 of the 22 functions enumerated in the Lyons and Bell paper. The chip has 90 pins, an overall area of 177,749.5 µm2

and functions fully at 1MHz. The density of the chip varies from section to section, with the comb being as dense at 0.4 transistors / µm2 And the floating point multiplier standing around 0.18 transistors / µm2. The overall density is 0.194 transistors / µm2. The chip operates on Real inputs and coefficients and views them internally as 12-bit floating point numbers.

IX. Conclusion Overall, the design process was completed as above. However, there were a number of issues

that increased the difficulty of the design. First, there was a general misunderstanding of the DSP concepts presented in the Lyons and Bell paper. This error came up very early in the design process and was so fundamental that it caused the very definition of the initial proposal to change on a weekly basis. This resulted in an emphasis on meeting the weekly milestones rather than correctly and thoroughly addressing all design decisions. While this was analogous to a "time-to-market" scenario, it was very clear that a vast number of improvements to the design could be made if the chance to stop, step back, and thoroughly fix the initial stages. However, if any task was not finished in the allotted time, the product would not have been completed during the time-period where it was still marketable.

As a demonstration of this, it was decided that the floating point multiplier would be redesigned with the goal of reducing area and improving the internal logic. This process was found to reduce the overall number of transistors needed in the multiplier by over 20 percent and could have reduced the overall area of the chip by roughly the same amount. During this process, the errors caused by pushing solely for the milestones became very apparent. Such a large improvement on this single component could have translated into massive reductions in both transistor count and area had all aspects of the chip gone through a similar process.

As mentioned earlier, there was a "Soft-IP" produced in addition to the hardware design. This was completed, but with errors under certain bit-widths. Changes are being made post-production of this paper to fix these cases, and since the behavioral Verilog meets no constraints other than functionality or lack there of, the only statement that can be made is that it can be done, but not within the constraints of this course.

34 / 56

A. Appendix: “The Swiss Army Knife of Digital Networks” by Lyons and Bell

IEEE SIGNAL PROCESSING MAGAZINE90 MAY 2004

his article describes ageneral discrete-sig-nal network thatappears, in variousforms, inside many

digital signal processing (DSP)applications. So the “DSP Tip” forthis column is for every DSP engi-neer to become acquainted with thisnetwork. Figure 1 shows how thenetwork’s structure has the distinctlook of a digital filter, a comb filterfollowed by a second-order recur-sive network. However, we do notcall this unique general network afilter because its capabilities extendfar beyond simple filtering.Through a series of examples, weillustrate the fundamental strengthof the network: its ability to bereconfigured to perform a surpris-ingly large number of useful func-tions based on the values of itsseven control parameters.

The general network has a trans-fer function of

H (z ) =(1 − c 1z −N )

× b0 + b1z −1 + b2z −2

1/a0 − a1z −1 − a2z −2 .

(1)

From this point on, we’ll use DSPfilter lingo and call the second-orderrecursive network a “biquad”because its transfer function is theratio of two quadratic polynomials.The tables in this article list varioussignal-processing functions per-formed by the network based on thean , bn , and c 1 coefficients. The vari-able N is the order of the comb fil-ter. Included in the tables aredepictions of the network’s impulseresponse, z -plane pole/zero loca-tions, as well as frequency-domain

magnitude and phase responses.The frequency axis in those tables isnormalized such that a value of 0.5represents a frequency of fs /2,where fs is the sample rate in hertz.

Moving AveragerReferring to the first entry in Table1, this network configuration is acomputationally efficient methodfor computing the N-point movingaverage of x (n). Also called a recur-sive running sum or boxcaraverager, this structure is equivalentto an N-tap direct convolution finiteimpulse response (FIR) filter withall the coefficients having a value of1/N . However, this moving aver-ager is efficient because it performsonly one add and one subtract peroutput sample regardless of thevalue of N (whereas an N-tap directconvolution FIR filter must per-form N−1 additions per outputsample). The moving averager’stransfer function is

Hma(z ) =(1/N )(1 − z −N )/(1 − z −1).

dsp tips & tricks

Richard Lyons and Amy Bell

The Swiss Army Knife of Digital Networks

“DSP Tips and Tricks” introduces prac-tical tips and tricks of design andimplementation of signal processingalgorithms so that you may be able toincorporate them into your designs.We welcome readers who enjoy read-ing this column to submit their contri-butions. Contact Associate Editors RickLyons ([email protected]) or Amy Bell([email protected]).

▲ 1. General discrete-signal processing network.

T

Comb Second-Order Recursive Network (Biquad)

x(n)

z–N

c1

a0

a1

a2

b0

b1

b2

y(n)

z–1

z–1

+ + ++ –

IEEE SIGNAL PROCESSING MAGAZINEMAY 2004 91

Hma(z )’s numerator results in Nzeros equally spaced around the z -plane’s unit circle located atz (k) = e j2πk/N , where integer k is 0≤k <N . Hma(z )’ s denominatorplaces a single pole at z = 1 on theunit circle, canceling the zero atthat location.

DifferencerThis is a discrete version of a first-order differentiator. An ideal differ-entiator has a frequency magnituderesponse that is a linear function offrequency, and this network onlyapproaches that ideal at low fre-quencies relative to fs .

IntegratorThis structure performs the run-ning summation of the x (n) inputssamples, making it the discrete-time equivalent of a continuous-time integrator.

Leaky IntegratorThis network configuration, alsocalled an exponential averager, is avenerable structure used in low-passfilter implementations for randomnoise reduction. It is a first-orderinfinite impulse response (IIR) filterwhere, for stable low-pass operation,the constant α lies in the range0 < α < 1.

This nonlinear-phase filter has asingle pole at z = 1 − α on the z -plane and a transfer function ofHli(z ) = α/[1 − (1 − α)z −1]. Smallvalues for α yield narrow passbandsat the expense of increased filterresponse time. Table 1 shows the fil-ter’s behavior for α = 0.1 as solidcurves. For comparison, the frequen-cy domain performance for α = 0.5is indicated by the dashed curves.

First-Order Delay NetworkA subclass of a first-order IIR filter,the coefficients in Table 1 yield anall-pass network having a relativelyconstant group delay at low fre-quencies. The network’s delay is

Dtotal = 1 + �delay samples where�delay, typically in the range of −0.5to 0.5, is a fraction of the 1/fs sam-ple period. For example, when�delay is 0.2, the network delay (atlow frequencies) is 1.2 samples. Thereal-valued R coefficient is

R = −�delay

�delay + 2, (2)

producing a z -plane transfer func-tion of H1,del(z ) =(R + z −1)/

(1 + Rz −1) with a pole at z = −Rand a zero at z = −1/R.

Performance for �delay = 0.2(R = 0.91) is shown in Table 1,where we see the magnituderesponse being constant. The band,centered at dc, over which thegroup delay varies no more than|�delay|/10 from the specified Dtotalvalue, the bar in the group delayplot, ranges roughly from 0.1 fs to0.2 fs for first-order networks. Ifyour signal is oversampled, makingit low in frequency relative to fs ,this first-order all-pass delay net-work may be of some use. If youpropose its use in a new design, youcan impress your colleagues by say-ing this network is based on theThiran approximation [1].

Second-OrderDelay NetworkA subclass of a second-order IIR fil-ter, the coefficients in Table 1 yieldan all-pass network having a relative-ly constant group at low frequencies(over a wider frequency range, bythe way, than the first-order delaynetwork.) This network’s delay isDtotal = 2 + �delay samples, where�delay is typically in the range of−0.5 to 0.5. For example, when�delay is 0.3, the network delay (atlow frequencies) is 2.3 samples. Thereal-valued coefficients are

R1 = −2�delay

�delay + 3

and

R2 = (�delay)(�delay + 1)

(�delay + 3)(�delay + 4). . (3)

The band, centered at dc, overwhich the group delay varies nomore than |�delay|/10 from thespecified Dtotal value, the bar in thegroup delay plot, ranges roughlyfrom 0.26 fs to 0.38 fs for this sec-ond-order network. Performancefor �delay = 0.3 (R1 = −0.182 andR2 = 0.28) is shown in Table 1,where we see the magnituderesponse being constant.

The flat group delay band iswider for negative �delay than when�delay is positive. For example, thismeans if you desire a group delay ofDtotal = 2.5 samples, it is better touse an external unit delay and set�delay to −0.5 rather than letting�delay be 0.5. To ensure stability,�delay must be greater than −1.Reference [1] provides methods fordesigning higher-order allpass delaynetworks.

Goertzel NetworkReferring to the first entry in Table2, this traditional Goertzel networkis used for single-tone detectionbecause it computes a single-bin N-point discrete Fourier transform(DFT) centered at an angle ofθ = 2πk/N rad on the unit circle,corresponding to a cyclic frequencyof kfs /N Hz. Frequency variable k,in the range 0 ≤k <N, need not bean integer. The behavior of the net-work is shown by the solid curves inTable 2. However, the frequencymagnitude response of the Goertzelalgorithm, for N = 8 and k = 1, isshown as the dashed curve.

After N +1 input samples areapplied, y (n) is a single-bin DFTresult. The DFT computationalworkload is N + 2 real multiplies and2N + 1 real adds. The network istypically stable because N is kept fair-ly low (in the hundreds) in practicebefore the network is reinitialized[2], [3].


dsp tips & tricks continued

Sliding DFT NetworkThis structure computes a single-bin N -point DFT centered at anangle of θ = 2πk/N rad on the unitcircle, corresponding to a cyclic fre-quency of kfs /N Hz. N is the DFT

size and integer k is 0 ≤ k < N .The real damping factor r is kept asclose to, but less than, unity as pos-sible to maintain network stability.After N input samples have beenapplied, this network will compute

a new follow-on DFT result basedon each new x (n) sample (thus theterm sliding) at a computationalworkload of only four real multi-plies and four real adds per inputsample [2], [3]. Setting coefficient

Moving Averagera0 = 1, a1 = 1, a2 = 0,

b0 = 1/N , b1 = 0, b2 = 0,

c1 = 1,N = 8

Differencera0 = 1, a1 = 0, a2 = 0,

b0 = 1, b1 = −1, b2 = 0,

c1 = 0

Integratora0 = 1, a1 = 1, a2 = 0,

b0 = 1, b1 = 0, b2 = 0,

ci = 0

Leaky Integratora0 = 1, a1 = 1 − α, a2 = 0,

b0 = α, b1 = 0, b2 = 0,

c1 = 0, α = 0.1

First-Order Delay Networka0 = 1, a1 = −R, a2 = 0,

b0 = R, b1 = 1, b2 = 0,

c1 = 0

Second-Order Delay Networka0 = 1, a1 = −R1, a2 = −R2,

b0 = R2, b1 = R1, b2 = 1,

c1 = 0

Table 1. General functions.

Functions and Network BehaviorCoefficients Impulse Response z-plane Magnitude (dB) Phase (rad.)

0 5 10–1

0

1

Time–1 0 1

–1

0

1

Real PartIm

agin

ary

Par

t–0.5 0 0.5

–20

–10

0

–0.5 0 0.5–2

0

2

Frequency Frequency

0 5 10–0.1

0

0.1

0.2

–1 0 1–1

0

1

–0.5 0 0.5–20

–10

0

–0.5 0 0.5–5

0

5

1/N

Time Real Part

Imag

inar

y P

art

Frequency Frequency

0.5 0 0.5–2

0

2

0 5 10

0

0.5

1

–1 0 1–1

0

1

–0.5 0 0.5–20

–10

0

Time Real Part

Imag

inar

y P

art

Frequency Frequency

–1 0 1–1

0

1

Real Part

Imag

inar

y P

art

–0.5 0 0.5–20

–10

0

–0.5 0 0.5–2

0

2α = 0.5

α = 0.1

α = 0.5

α = 0.1

0 10 20

0

0.1

0.2

Time

α = 0.1

Frequency Frequency

–1 0 1–1

0

1

Real Part–0.5 0 0.5

0

0 5 10

0

0.5

1

Time

Zero atz = 11

Delay∆= 0.2

–0.5 0 0.51

1.1

1.2Group Delay

0.2/fs

R = 0.91Delay∆ = 0.2

Frequency Frequency

Imag

inar

y P

art

–1 0 1–1

0

1

Real Part–0.5 0 0.5

0

ZerosNot

ShownDelay∆

–0.5 0 0.51.5

2

2.5

0.3/f s

Group Delay

0 5 10

0

0.5

1

Time

2.3R1 = –0.182R2 = 0.028Delay∆ = 0.3

FrequencyFrequency

Imag

inar

y P

art

= 0.3


c 1 = −rN allows the analysis bandto be centered at an angle ofθ = 2π (k + 1/2)/N rad, corre-sponding to a cyclic frequency of(k + 1/2) fs /N Hz.

Real OscillatorThere are many possible digitaloscillator structures, but this net-work generates a real-valued sinu-soidal y (n) sequence whose

amplitude is not a function of theoutput frequency. The argument forcoefficient a1 in Table 2 isθ = 2πft /fs rad, where ft is theoscillator’s tuned frequency inhertz. To start the oscillator we setthe y (n − 1) sample driving the a1multiplier equal to 1 and computenew output samples as the timeindex n advances. For fixed-pointimplementations, filter coefficients

may need to be scaled so that allintermediate results are in the prop-er numerical range [4].

Quadrature OscillatorCalled the coupled quadrature oscil-lator, this structure provides y (n) =cos (nθ) + j sin(nθ) outputs fora complex exponential sequencewhose tuned frequency is ft Hz.The exponent for a1 in Table 2 is

Table 2. Analysis and synthesis functions.


0 5 100

0.5

1

Time–1 0 1

–1

0

1

Real Part

Imag

inar

y P

art

–0.5 0 0.5–4

–2

0

–0.5 0 0.5–0.5

0

0.5

Frequency Frequency

0 10 20–1

0

1

Time–1 0 1

–1

0

1

Real Part

Imag

inar

y P

art

–0.5 0 0.5–20

–10

0

–0.5 0 0.5–5

0

5

10

θ = π/4G = 1

Real Part

Frequency Frequency

0 20–2

0

2

Time–1 0 1

–1

0

1

Real Part

Imag

inar

y P

art

–0.5 0 0.5–20

–10

0

–0.5 0 0.50

5

10

15

θ = π/4

Frequency Frequency

0 10 20–1

0

1

Time–1 0 1

–1

0

1

Real Part

Imag

inar

y P

art

–0.5 0 0.5–20

–10

0

–0.5 0 0.5–5

0

5N = 22

θ = 4π/22k = 2

r = 0.999RealPart

Frequency Frequency

N = 8k = 1

–1 0 1–1

0

1

Real Part

Imag

inar

y P

art

–0.5 0 0.5–20

–10

0

–0.5 0 0.5–5

0

5

10

0 5 10–1

0

1

Time

RealPart

Frequency Frequency

Goertzel Networka0 = 1, a1 = 2cos(θ),

a2 = −1, b0 = 1, b1 = −e−jθ ,

b2 = 0, c1 = 0, θ = 2πk/N

Sliding DFT Network a0 = re jθ , a1 = 1, a2 = 0,

b0 = 1, b1 = 0, b2 = 0,

c1 = rN , θ = 2πk/N

Real Oscillator a0 = 1, a1 = 2cos(θ),

a2 = −1, b0 = 1,

b1 = 0, b2 = −1

Quadrature Oscillatora0 = G (n), a1 = e jθ ,

a2 = 0, b0 = 1,

b1 = 0, b2 = 0

Audio Comba0 = 1, a1 = 0, a2 = α,

b0 = 1, b1 = 0, b2 = 0,

c1 = 0, α = 0.2


θ = 2 π ft /fs rad. To start the oscil-lator, we set the complex y (n − 1)

sample, driving the a1 multiplier,equal to 1 + j0 and begin comput-ing output samples as the timeindex n advances. To ensure oscilla-tor output stability in fixed-pointarithmetic implementations, instan-taneous gain correction G (n) mustbe computed for each output sam-ple. The G (n) sample values will bevery close to unity [5], [6].

Audio CombThis structure is a second-order(the simplest) version of an IIRcomb filter used by audio folks tosynthesize the sound of a plucked-string instrument. The input to thefilter is random noise samples. Thefilter has frequency response peaksat dc and ± fs /2, with dips in theresponse located at ± fs /4. The fil-ter’s transfer function is Hac(z ) =1/(1 − αz −2), resulting in two poleslocated at z = ±√

α on the z-plane.To maintain stability the real-valuedα must be less than unity, and thecloser α is to unity, the more nar-row the frequency response peaks.

For a more realistic-soundingsynthesis, we can set a1 = α and thetop delay element of the biquad inFigure 1 may have its delayincreased to, say, eight instead ofone, yielding more frequencyresponse peaks between 0 and fs /2Hz. In this music application, thefilter’s input is Gaussian white noisesamples. Other plucked-stringinstrument synthesis networks havebeen used with success [7], [8].

Comb FilterReferring to the first entry in Table3, this standard comb filter is a keycomponent on many filtering appli-cations, as we shall see. Its transferfunction, Hcomb(z ) = 1 − z −N ,results in N zeros equally spacedaround the z-plane’s unit circlelocated at z (k) = e j2πk/N , whereinteger k is 0 ≤k <N . Those z (k)

values are the N roots of unity whenwe set Hcomb(z ) equal to zero yield-ing z (k)N = (e j2πk/N )N = 1. TheN zeros on the unit circle result infrequency response nulls (infiniteattenuation) located at cyclic fre-quencies of m fs/N , where integer mis 0 ≤m ≤N /2. The peak gain ofthis linear-phase filter is two.

If we set coefficient c 1 to −1 inthe comb filter, making its transferfunction Halt,comb(z ) = 1 + z −N , weobtain an alternate linear-phase combfilter having zeros rotated counter-clockwise around the unit circle by anangle of π/N rad positioning thezeros at angles of 2π(k + 1/2)/N radon the z-plane’s unit circle. Therotated zeros result in frequencyresponse nulls located at cyclic fre-quencies of (m + 1/2) fs /N . Withthis filter a frequency magnitude peakis located at 0 Hz (dc).

Bandpass Filter at fffs/4This network is a bandpass filter cen-tered at fs /4 having a sin(x )/x -likefrequency response and linear-phaseover the passband. It has poles atz = ± j , so for pole/zero cancella-tion the comb filter’s delay (N ) mustbe an integer multiple of four. Thisguaranteed-stable, multiplierless,bandpass filter’s transfer function isHbp(z ) = (1 − z −N )/(1 + z −2).

First-Order IIR FilterThis is the direct form II version ofa simple first-order IIR filter havinga single pole located at a radius ofRp from the z-plane’s origin at anangle of θp rad and a single zero ata radius of Rz at an angle of π + θz .For real-valued coefficients(θp = θz = 0), the filter can onlyexhibit either a low-pass or ahigh-pass frequency response; nobandpass or bandstop filters are pos-sible. The filter’s transfer function isH1,iir(z ) = (1+Rpe jθz z −1)/

(1−Rpe jθpz −1).The shape of the filter’s frequency

magnitude responses are nothing to

write home about; its transitionregions are so wide that they don’tactually have distinct passbands andstopbands. Of course, to ensure sta-bility, Rp must be between zero andone to keep the pole inside the z-plane’s unit circle, and the closer Rpis to unity, the more narrowband isthe filter.

First-Order EqualizerThis structure has a frequency mag-nitude response that is constantacross the entire frequency band (anallpass filter). It has a pole at z = Ron the z -plane and a zero located at1/R*, where * means conjugate.The value of R, which can be realor complex but whose magnitudemust be less than unity to ensurestability, controls the nonlinear-phase response. The equalizer has atransfer function of H1,eq(z ) =(−R∗ + z −1)/(1 − Rz −1).

These networks can be used asphase equalizers by cascading themafter a filter or network whose non-linear phase response requires crudelinearization. The goal is to makethe cascaded filters’ combined phaseas linear as possible. Table 3 showsthe filter’s behavior for R = 0.7 assolid curves. For comparison, thephase response for R = −0.3 isindicated by the dashed curve.These first-order allpass filters canalso be used for interpolation andaudio reverberation for low-fre-quency signals.

Second-Order IIR FilterThis is the direct form II version of asecond-order IIR filter, theworkhorse of IIR filter implementa-tions. Conjugate pole and zero pairsmay be positioned anywhere on thez-plane to control the filter’s fre-quency response. [There’s a terrificpiece of MATLAB code (PEZ, cre-ated by the talented Dr. CraigUlmer) that allows us to see the fre-quency-domain effect of movingmultiple poles and zeros, manually



using a mouse, around on the z -plane. The code is available athttp://www.cspl.umd.edu/spm/tips-n-tr icks/.] Because high-order IIR filters are so susceptible

to coefficient quantization andpotential data overflow problems,practitioners typically implementtheir IIR filters by cascading mul-tiple copies of this second-order

IIR structure to ensure filter sta-bility and avoid limit cycles. Thefilters have a transfer function of(1) with the c 1 = 0. Low-pass,high-pass, bandpass, and bandstop

Comb Filtera0 = 1, a1 = 0, a2 = 0,

b0 = 1, b1 = 0, b2 = 0,

c1 = 1,N = 8

Bandpass Filter at f s /f s /f s /4a0 = 1, a1 = 0, a2 = −1,

b0 = 1, b1 = 0, b2 = 0,

c1 = 1,N = 16

First-Order IIR Filtera0 = 1, a1 = Rpe j θp ,

a2 = 0, b0 = 1,

b1 = Rze jθz , b2 = 0,

c1 = 0

First-Order Equalizera0 = 1, a1 = R, a2 = 0,

b0 = −R∗, b1 = 1, b2 = 0,

c1 = 0

Second-Order IIR Filtera0 = 1, a1 = 1.194,

a2 = −0.436,

b0 = b2 = 0.0605,

b1 = 0.121, c1 = 0

Second-Order Equalizera0 = 1, a1 = 2Rcos(θ),

a2 = −R2, b0 = 1,

b1 = −(2/R)cos(θ),

b2 = 1/R2, c1 = 0

Table 3. Filter functions.


Time Real Part Frequency Frequency



Real PartTime Frequency Frequency

Real PartTime Frequency Frequency

Real PartTime FrequencyFrequency

Imag

inar

y P

art

Imag

inar

y P

art

Imag

inar

y P

art

Imag

inar

y P

art

Imag

inar

y P

art

Imag

inar

y P

art

–0.5 0 0.5

0R = 0.6θ = π/3

0 5 10–1

0

1

–2 0 2

–1

0

1

–1 1

R = -0.7

R = 0.6

–0.5 0 0.5–15

–10

–5

0

0 10–0.1

0

0.1

0.2

0.3

–1 0 1–1

0

1

–0.5 0 0.5–20

–10

0

–0.5 0 0.5–5

0

5

0 5 10

–0.5

0

0.5

–1 0 1–1

0

1

–0.5 0 0.5

0

–0.5 0 0.5–5

0

5

R = -0.3

R = 0.7

R = 0.7

0 10 20

–1

0

1

–1 0 1–1

0

1

–0.5 0 0.5–20

–10

0

–0.5 0 0.5–2

0

2

θ = 0.5πθ = 0.4π

zp

R = R = 0.8z pReal Part

0 10 20–1

0

1

–1 0 1–1

0

1

–0.5 0 0.5–20

–10

0

–0.5 0 0.5–5

0

5

0 5 10–1

0

1

–1 0 1–1

0

1

0.5 0 0.5–20

–10

0

0.5 0 0.5–2

0

2


filters are possible. No single exam-ple shows all the possibilities of thisstructure, so Table 3 merely gives asimple low-pass filter example.

If an IIR filter design requireshigh performance, high Q, i tturns out the direct form I ver-sion of a second-order IIR filter isless suscept ible to coef f ic ientquantization and overflow errorsthan the direct form II structuregiven here.

Second-Order EqualizerThis structure has a frequency magni-tude response that’s constant acrossthe entire frequency band, making italso an allpass filter. It has two conju-gate poles located at a radius of Rfrom the z-plane’s origin at angles of±θ rad and two conjugate zeros at areciprocal radius of 1/R at angles of±θ . The positioning of the poles andzeros, using real-valued R, controlsthe nonlinear-phase response.

Table 3 shows the equalizer’sbehavior for R = 0.6 and θ = π/3as solid curves. For comparison,the phase response for R = −0.7and θ =π/3 is indicated by thedashed curve.

These networks are primarilyused for phase equalization by cas-cading them after a filter or net-work whose nonlinear phaseresponse requires linearization. Itmay take multiple cascaded biquad

Table 4. Additional filter functions.


0 10

0

0.5

1

–1 0 1–1

0

1

–0.5 0 0.5–20

–10

0

–0.5 0 0.5–2

0

2

α = 0.8

0–40

–20

0

0–5

0

5

–1 0 1

–1

0

1 N = 22

–0.5 0.5 –0.5 0.50 10 20

0

2

4

6

8

–0.5 0 0.5–40

–20

0

0–5

0

5

–1 0 1–1

0

1

–0.5 0.5

N = 22

0 10 20

0

2

4

–0.5 0 0.5–40

–30

–20

–10

0

–0.5 0 0.5–10

–5

0

5

–1 0 1–1

0

1 N = 16

0 10–2

0

2

4

0 5 10

0

0.5

1

–1 0 1–1

0

1

–0.5 0 0.5–20

–10

0

–0.5 0 0.5–5

0

5

Time Real Part

Imag

inar

y P

art

Frequency Frequency

Time Real Part

Imag

inar

y P

art

Frequency Frequency

Time Real Part

Imag

inar

y P

art

Frequency Frequency

Time Real Part

Imag

inar

y P

art

Frequency Frequency

Real Part

Imag

inar

y P

art

Time Frequency Frequency

CIC Interpolation Filtera0 = 1, a1 = 1, a2 = 0,

b0 = 1, b1 = 0, b2 = 0,

c1 = 1,N = 8

Complex FSFa0 = 1, a1 = e jθk, a2 = 0,

b0 = (−1)k, b1 = 0, b2 = 0,

c1 = 1, θk = 2πk/N

Real FSF, Type Ia0 = 1, a1 = 2cos(θk),

a2 = −1, b0 = |Hk|cos(φk),

b1 = −|Hk|cos (φk − θk),

b2 = 0, c 1 = 1, θk = 2πk/N

Real FSF, Type IVa0 = 1, a1 = 2cos(θk),

a2 = −1, b0 = (−1)kMk,

b1 = 0, b2 = (−1)k(−Mk),

c 1 = 1, θk = 2πk/N

dc Bias Removala0 = 1, a1 = α, a2 = 0,

b0 = 1, b1 = −1, b2 = 0,

c1 = 0



networks to achieve acceptableequalization, however.

CIC Interpolation FilterReferring to the first entry in Table4, this network is a single-stage cas-caded integrator-comb (CIC) inter-polation filter used for time-domaininterpolation. If a time-domain sig-nal sequence is upsampled by N (byinserting N − 1 zero-valued samplesin between each original sample)and applied to this low-pass filter,the filter’s output is an interpolatedby N version of the original signal.Hcic(z ) = (1− z−K )/(1−z −1) is thislow-pass filter’s transfer function.To improve the attenuation of spec-tral images, we can cascade Mcopies of the comb filter followedby M cascaded biquad sections.Such cascaded filters will also havenarrower passband widths at 0 Hz.

In practice, the upsamplingoperation (zero stuffing) is per-formed after the comb filter andbefore the biquad network. Thishas the sweet advantage that thecomb filter’s delay length becomesN =1, reducing the necessary combdelay storage requirement to one.CIC filters are typically used as thefirst stage of multistage low-pass fil-tering in hardware sample rateincrease (interpolation) by N appli-cations because no multipliers arerequired [6].

Complex FrequencySampling FilterThis structure is a single section ofa complex frequency sampling fil-ter (FSF) having a sin(x )/x -likefrequency magnitude responsecentered at an angle of θk = 2πk/Nrad on the unit circle, correspond-ing to a cyclic frequency of kfs/NHz. N and k are integers when ki s 0≤k <N . The larger N , themore narrow the filter’s mainlobewidth [6].

If multiple biquads are imple-mented in parallel (all driven by the

single comb filter) with adjacentcenter frequencies, complex almost-linear phase-bandpass filters can bebuilt. Table 4 shows the behavior ofan N = 16, three-biquad, complexbandpass filter, each centered atk = 2, 3, and 4, respectively.

Real FrequencySampling Filter, Type IThis structure is a single section of areal-coefficient frequency samplingfilter having a sin(x )/x -like fre-quency magnitude response cen-tered at both ±θk = ±2πk/N rad,where N is an integer. The larger Nthe more narrow than the filter’smainlobe width. Integer k is0≤k <N .

If multiple biquads are imple-mented in parallel (all driven by thesingle comb filter) with adjacentcenter frequencies, almost-linearphase, low-pass filters can be built.In this case, complex gain factorsHk are the desired peak frequencyresponse of the kth biquad.Parameter φk is the desired relativephase shift, in radians, of Hk . Table4 shows the behavior of an N =22,three-biquad, low-pass filter eachcentered at k = 0, 1, and 2, respec-tively. In this example, |H0| = 1,|H1| = 2, and |H2| = 0.74. Thesebandpass filters can have groupdelay fluctuations as large as 2/fs inthe passband. This recursive FIR fil-ter is the most common frequencysampling filter discussed in the DSPtextbooks [6], [9], [10].

Real Frequency SamplingFilter, Type IVThis structure is similar in behaviorto the type I frequency sampling fil-ter, with important exceptions.First, in multibiquad low-pass filterimplementations this filter yields anexactly linear phase response. Also,this filter provides deeper stopbandattenuation than the Type I filter.

The real-valued gain factors Mkare the desired peak frequency mag-

nitude response of the kth biquad.Table 4 shows the behavior of anN = 22, three-biquad low-pass fil-ter with the biquads centered atk = 0, 1, and 2, respectively. In thisexample, M0 = 1, M1 = 2, andM2 = 0.74. Here’s why you need toknow about these filters: with judi-cious choice of Mk gain factors, nar-rowband low-pass linear-phase FIRfilters can be built, in some cases,whose computational workload isless than Parks-McClellan-designedFIR filters [6].

DC Bias RemovalThis network, used to remove any dcbias from the x (n) input, has a trans-fer function with a pole located atz = α and a zero at z = 1. Having afrequency response notch (null) at 0Hz (dc, hence the name) and thesharpness of the notch is determinedby α, where for stable operation αlies in the range 0<α<1. The closerα is to unity, the more narrow thenotch at dc. This nonlinear-phase fil-ter has a transfer function ofHdc(z ) = (1−z −1)/(1−αz −1). Table4 shows the filter’s behavior forα = 0.8.

In those fixed-point implementa-tions where the output y (n)

sequence must be truncated to avoiddata overflow [that is, y (n) musthave fewer bits than input x (n)],feedback noise shaping can be usedto reduce the quantization noiseinduced by truncation [6], [11].

To avoid overflow, an alternativeto truncation is to limit the gain ofthe filter. For example, we couldprecede the network with a positivegain element whose gain is less thanunity. On the other hand, we coulduse b0 = G and b1 = −G , whereG = (1 + α)/2, in our implementa-tion for this purpose, yielding areduced-gain transfer function ofHalt,dc(z ) = (G − G z −1)/(1 − αz −1).

(continued on page 100)


dsp history continued

Charles: I’m now working halftime (or rather, half effort). I don’tknow when I will fully retire. Mywife has been renovating a propertyon Cape Cod, which is nearly fin-

ished. I like to joke that it’s rising,Phoenixlike, from the ashes of oursavings. When I can spend somemore time there, I want to try writ-ing about how public policy deci-

sions are made when they are basedon technology.

SPM: It has been an honor. Thankyou and hope we will do it again.

dsp tips & tricks continued from page 97

If the reader has any commentsregarding this article, please e-mailone of the authors. Feedback fromour readers, either positive or nega-tive, is most welcome.

Richard Lyons is a consulting systemsengineer and lecturer with BesserAssociates in Mt. View, California, andthe author of Understanding DigitalSignal Processing, second edition.

Amy Bell is an assistant professor inthe department of Electrical andComputer Engineering at VirginiaTech.

References[1] T. Laakso et al., “Splitting the unit delay,”

IEEE Signal Processing Mag., vol. 13, pp.30–60, Jan. 1996.

[2] E. Jacobsen and R. Lyons, “The sliding DFT,”IEEE Signal Processing Mag., vol. 20, pp. 74–80,Mar. 2003.

[3] E. Jacobsen and R. Lyons, “The sliding DFT,An update,” IEEE Signal Processing Mag., vol.21, pp. 110–111, Jan. 2004.

[4] D. Grover and J. Deller, Digital SignalProcessing and the Microcontroller. UpperSaddle River, NJ: Prentice-Hall, 1999.

[5] C. Turner, “Recursive discrete-time sinusoidaloscillators,” IEEE Signal Processing Mag., vol.20, pp. 103–111, May 2003.

[6] R. Lyons, Understanding Digital SignalProcessing, 2nd ed. Upper Saddle River, NJ:

Prentice-Hall, 2004.

[7] Comb Filters. Available: http://ccrma-www.s tan ford .edu/~jos/wavegu ide/Comb_Filters.html

[8] Texas Instruments, “How can comb filters beused to synthesize musical instruments on aTMS320 DSP?,” TMS320 DSP DesignersNotebook, no. 56, 1995.

[9] V. Ingle and J. Proakis, Digital SignalProcessing Using MATLAB. Pacific Grove, CA:Brookes/Cole, 2000, pp. 202–208.

[10] J. Proakis and D. Manolakis, Digital SignalProcessing-Principles, Algorithms, and Applications,3rd ed. Upper Saddle River, NJ: Prentice-Hall,1996, pp. 630–637.

[11] C. Dick and F. Harris, “FPGA signal processingusing sigma-delta modulation,” IEEE SignalProcessing Mag., vol. 17, pp. 20–35, Jan. 2000.

lecture notes continued from page 89

Concluding RemarksIn view of (2) and (7), the waterbedeffect result, also called an uncer-tainty conservation result in [7],appears to be a fundamental proper-ty of both nonparametric and para-metric spectral estimation methods.Consequently, an even “more intu-itive” or “higher-level” derivation ofthis property than the one presentedherein might exist, but it remains tobe discovered.

AcknowledgmentsThis work was supported in part bythe Swedish Science Council (VR)and the National Science Found-ation Grant CCR-0104887.

References[1] M.B. Priestley, Spectral Analysis and Time Series

(Univariate Series, vol. 1). New York, NY:Academic, 1981.

[2] P. Stoica and R.L. Moses, Introduction toSpectral Analysis. Englewood Cliffs, NJ:Prentice-Hall, 1997.

[3] B. Porat, Digital Processing of RandomSignals—Theory and Methods. Englewood Cliffs,NJ: Prentice-Hall, 1994.

[4] P. Stoica and T. Sundin, “On nonparametric spec-tral estimation,” Circ. Syst. Sign Process, vol. 18, pp.169–181, 1999.

[5] K. Berk, “Consistent autoregressive spectralestimates,” Ann. Statist., vol. 2, pp. 489–502,1974.

[6] L. Ljung, “Asymptotic variance expressions foridentified black-box transfer function models,”IEEE Trans. Automat. Contr., vol. AC-30, no.9, pp. 834–844, 1985.

[7] B. Ninness, “The asymptotic CRLB for thespectrum of ARMA processes,” IEEE Trans.Signal Processing, vol. 51, no. 6, pp.1520–1531, Nov. 2003.

[8] P. Whittle, “The analysis of multiple stationarytime series,” J. Royal Statist. Soc., ser. b, vol. 15,pp. 125–139, 1953.

44 / 56

B. Appendix: Test Suite Example Example of a “progressive functionality” test suite for the 12-bit floating point multiplier. ---------------------------------------------- module tester(); reg [11:0] FP_1, FP_2; wire [11:0] Product; wire Overflow; fp_mult mult1(Product,Overflow, FP_1, FP_2); initial begin $monitor("%b %b %b * %b %b %b = %b %b %b V?->%b", FP_1[11], FP_1[10:5], FP_1[4:0], FP_2[11], FP_2[10:5], FP_2[4:0], Product[11], Product[10:5], Product[4:0], Overflow); #1; FP_1[11:0] = 12'b000000000000; FP_2[11:0] = 12'b000000000000; #1; FP_1[11:0] = 12'b001111000000; /* 0.5 */ FP_2[11:0] = 12'b101110111000; /* -0.4375 */ #1; FP_1[11:0] = 12'b010000000000; /* 2 */ FP_2[11:0] = 12'b000000000000; /* 0 */ #1; FP_1[11:0] = 12'b010000100100; /* 4.5 */ FP_2[11:0] = 12'b010000001000; /* 2.5 */ #1; FP_1[11:0] = 12'b010000101110; /* 5.75 */ FP_2[11:0] = 12'b010000000001; /* 2.0625 */ #1; FP_1[11:0] = 12'b010000111100; /* 7.5 */ FP_2[11:0] = 12'b010000011110; /* 3.875 */ #1; FP_1[11:0] = 12'b010010011110; /* 62 */ FP_2[11:0] = 12'b010001011000; /* 14 */ #1; FP_1[11:0] = 12'b110010011110; /* -62 */ FP_2[11:0] = 12'b110001011000; /* -14 */ #1; FP_1[11:0] = 12'b010000100010; /* 4.25 */ FP_2[11:0] = 12'b110000111110; /* -7.75 */ #1; FP_1[11:0] = 12'b011111100000; /* 4,294,967,296 */ FP_2[11:0] = 12'b110000111110; /* -7.75 Trying for EXP overflow here*/ #1; $finish; end endmodule // tester ----------------------------------------------

45 / 56

C. Appendix: Verilog and C Example Example of a pair of Verilog and C code files to exhaustively test a 6-bit fixed point multiplier. `timescale 10 fs / 1 fs module tester(); /*--------------FxMult (mult) Exhaustive test--------------*/ reg[5:0] A, B; wire [11:0] Product; mult FX_multer(Product,A,B); integer i,j, file; initial begin file = $fopen("exhaustive.out"); for (i = 0; i < 64; i = i+1) for (j = 0; j < 64; j = j+1) begin A = i; B = j; #1; $fdisplay(file,"%2d*%2d=%3d", A, B, Product); end $finish; end endmodule /*The Ccode that was used to generate another pair to diff against:*/ #include <stdio.h> #include <stdlib.h> int main(){ int i,j; FILE *outFile; outFile = fopen( "/tmp/exhaustiveC.out" , "w"); /*mult FX_multer(A,B,Product); Functionality: A*B, or OVERFLOW -> NEVER */ for( i = 0; i<64; i++ ) for( j = 0; j<64; j++ ){ if(1) fprintf(outFile,"%2d*%2d=%3d\n", i, j, i*j); else fprintf(outFile,"OVERFLOW isn't possible. So, ha!\n"); } close(outFile); }

46 / 56

D. Appendix: Soft-IP Code

Top Level module top(Yn, Overflow, Xn, a1_Real, a1_Imag, a2_Real, a2_Imag, b0_Real, b0_Imag, b1_Real, b1_Imag, b2_Real, b2_Imag, c1, N, clk, rst); output [11:0] Yn; output Overflow; input [11:0] Xn; input [11:0] a1_Real, a2_Real, b0_Real, b1_Real, b2_Real; input [11:0] a1_Imag, a2_Imag, b0_Imag, b1_Imag, b2_Imag; input N, c1, clk, rst; wire [11:0] Out_Comb_Real, Out_Comb_Imag, Xn_OutReg, Yn_Temp; wire [11:0] a1_OutReg_Real, a2_OutReg_Real, b0_OutReg_Real, b1_OutReg_Real, b2_OutReg_Real; wire [11:0] a1_OutReg_Imag, a2_OutReg_Imag, b0_OutReg_Imag, b1_OutReg_Imag, b2_OutReg_Imag; wire c1_OutReg, N_OutReg; wire Overflow_temp1, Overflow_temp2, Overflow_Temp; InOutReg_12bit XnReg(Xn_OutReg, Xn, clk, rst); InOutReg_1bit c1Reg(c1_OutReg, c1, clk, rst); InOutReg_1bit NReg(N_OutReg, N, clk, rst); complex_Comb moustashe(Out_Comb_Real, Out_Comb_Imag, Overflow_temp1, Xn_OutReg, N_OutReg, c1_OutReg, clk, rst); InOutReg_12bit a1RegR(a1_OutReg_Real, a1_Real, clk, rst); InOutReg_12bit a2RegR(a2_OutReg_Real, a2_Real, clk, rst); InOutReg_12bit b0RegR(b0_OutReg_Real, b0_Real, clk, rst); InOutReg_12bit b1RegR(b1_OutReg_Real, b1_Real, clk, rst); InOutReg_12bit b2RegR(b2_OutReg_Real, b2_Real, clk, rst); InOutReg_12bit a1RegI(a1_OutReg_Imag, a1_Imag, clk, rst); InOutReg_12bit a2RegI(a2_OutReg_Imag, a2_Imag, clk, rst); InOutReg_12bit b0RegI(b0_OutReg_Imag, b0_Imag, clk, rst); InOutReg_12bit b1RegI(b1_OutReg_Imag, b1_Imag, clk, rst); InOutReg_12bit b2RegI(b2_OutReg_Imag, b2_Imag, clk, rst); complex_biquad onefourthanoctopus(Yn_Temp, Overflow_temp2, Out_Comb_Real, Out_Comb_Imag, a1_OutReg_Real, a1_OutReg_Imag, a2_OutReg_Real, a2_OutReg_Imag, b0_OutReg_Real, b0_OutReg_Imag, b1_OutReg_Real, b1_OutReg_Imag, b2_OutReg_Real, b2_OutReg_Imag, clk, rst); InOutReg_12bit YnReg(Yn, Yn_Temp, clk, rst); or Vflower(Overflow_Temp, Overflow_temp1, Overflow_temp2); InOutReg_1bit OverflowReg(Overflow, Overflow_Temp, clk, rst); endmodule // top module complex_biquad(Yn, Overflow, Out_Comb_Real, Out_Comb_Imag, a1_Real, a1_Imag, a2_Real, a2_Imag, b0_Real, b0_Imag, b1_Real, b1_Imag, b2_Real, b2_Imag, clk, rst); output [11:0] Yn; output Overflow; input [11:0] Out_Comb_Real, a1_Real, a2_Real, b0_Real, b1_Real, b2_Real; input [11:0] Out_Comb_Imag, a1_Imag, a2_Imag, b0_Imag, b1_Imag, b2_Imag; input clk, rst; wire [11:0] Out_Add_11_Real, Out_Add_12_Real, Out_Add_21_Real, Out_B0_Real, Out_B1_Real, Out_B2_Real, Out_D1_Real, Out_A1_Real, Out_A2_Real, Out_D2_Real; wire [11:0] Out_Add_11_Imag, Out_Add_12_Imag, Out_Add_21_Imag, Out_B0_Imag,

47 / 56

Out_B1_Imag, Out_B2_Imag, Out_D1_Imag, Out_A1_Imag, Out_A2_Imag, Out_D2_Imag; wire [8:0] Vflow_In; supply0 [11:0] gnd; complex_add inst_add12 (Out_Add_12_Real, Out_Add_12_Imag, Vflow_In[0], Out_Comb_Real, Out_Comb_Imag, Out_Add_11_Real, Out_Add_11_Imag); complex_mult inst_b0 (Out_B0_Real, Out_B0_Imag, Vflow_In[1], b0_Real, b0_Imag, Out_Add_12_Real, Out_Add_12_Imag); DELAY inst_d1R (Out_D1_Real, Out_Add_12_Real, clk, rst); DELAY inst_d1I (Out_D1_Imag, Out_Add_12_Imag, clk, rst); complex_mult inst_a1 (Out_A1_Real, Out_A1_Imag, Vflow_In[2], Out_D1_Real, Out_D1_Imag, a1_Real, a1_Imag); complex_mult inst_b1 (Out_B1_Real, Out_B1_Imag, Vflow_In[3], Out_D1_Real, Out_D1_Imag, b1_Real, b1_Imag); complex_add inst_add22 (Yn, gnd, Vflow_In[4], Out_Add_21_Real, Out_Add_21_Imag, Out_B0_Real, Out_B0_Imag); DELAY inst_d2R (Out_D2_Real, Out_D1_Real, clk, rst); DELAY inst_d2I (Out_D2_Imag, Out_D1_Imag, clk, rst); complex_mult inst_a2 (Out_A2_Real, Out_A2_Imag, Vflow_In[5], Out_D2_Real, Out_D2_Imag, a2_Real, a2_Imag); complex_mult inst_b2 (Out_B2_Real, Out_B2_Imag, Vflow_In[6], Out_D2_Real, Out_D2_Imag, b2_Real, b2_Imag); complex_add inst_add11 (Out_Add_11_Real, Out_Add_11_Imag, Vflow_In[7], Out_A1_Real, Out_A1_Imag, Out_A2_Real, Out_A2_Imag); complex_add inst_add21 (Out_Add_21_Real, Out_Add_21_Imag, Vflow_In[8], Out_B1_Real, Out_B1_Imag, Out_B2_Real, Out_B2_Imag); Or_9_Bit Vflow(Overflow, Vflow_In[8:0]); endmodule // biquad module complex_Comb(Out_Comb_Real, Out_Comb_Imag, Overflow, Xn, N, c1, clk, rst); /* Right now the comb's adder doesn't really need to be complex. Make the decision later */ output[11:0] Out_Comb_Real, Out_Comb_Imag; output Overflow; input [11:0] Xn; input N, c1, clk, rst; wire [11:0] Delay, Out_D8, Out_D16, Out_Mux_1; supply0 [11:0] gnd; Delay_8_16 ztothe(Out_D8[11:0], Out_D16[11:0], Xn[11:0], clk, rst); complex_add a1(Out_Comb_Real, Out_Comb_Imag, Overflow, Xn[11:0], gnd, {~Delay[11], Delay[10:0]}, gnd ); MUX_Nx1xN #(12) m12er( Out_Mux_1[11:0], Out_D8[11:0], Out_D16[11:0], N); MUX_Nx1xN #(12) m12est( Delay[11:0], gnd, Out_Mux_1, c1); endmodule // complex_Comb module complex_add(Out_Real, Out_Imag, Overflow, A_Real, A_Imag, B_Real, B_Imag); output [11:0] Out_Real, Out_Imag; output Overflow; input [11:0] A_Real, A_Imag, B_Real, B_Imag; wire Overflow_Temp1, Overflow_Temp2;

48 / 56

fp_add add1(Out_Real[11:0], Overflow_Temp1, A_Real[11:0], B_Real[11:0] ); fp_add add2(Out_Imag[11:0], Overflow_Temp2, A_Imag[11:0], B_Imag[11:0] ); or oring(Overflow, Overflow_Temp1, Overflow_Temp2); endmodule module complex_mult(Out_Real, Out_Imag, Overflow, A, B, C, D); /* x = a+ib, y = c + id, x*y = (ac - bd) + i(ad + bc) */ output [11:0] Out_Real, Out_Imag; output Overflow; input [11:0] A,B,C,D; wire [11:0] Out_AC, Out_BD, Out_AD, Out_BC; wire [5:0] Overflow_Temp; fp_mult mult1(Out_AC, Overflow_Temp[0], A, C); fp_mult mult2(Out_BD, Overflow_Temp[1], B, D); fp_add add1(Out_Real, Overflow_Temp[2], Out_AC, { ~Out_BD[11], Out_BD[10:0] } ); fp_mult mult3(Out_AD, Overflow_Temp[3], A, D); fp_mult mult4(Out_BC, Overflow_Temp[4], B, C); fp_add add2(Out_Imag, Overflow_Temp[5], Out_AD, Out_BC); Or_6_Bit or6(Overflow,Overflow_Temp[5:0]); endmodule module Delay_8(Out, In, clk, rst); output [11:0] Out; input [11:0] In; input clk, rst; wire [11:0] rOut_0, rOut_1, rOut_2, rOut_3, rOut_4, rOut_5, rOut_6; DELAY r0(rOut_0, In, clk, rst); DELAY r1(rOut_1, rOut_0, clk, rst); DELAY r2(rOut_2, rOut_1, clk, rst); DELAY r3(rOut_3, rOut_2, clk, rst); DELAY r4(rOut_4, rOut_3, clk, rst); DELAY r5(rOut_5, rOut_4, clk, rst); DELAY r6(rOut_6, rOut_5, clk, rst); DELAY r7(Out, rOut_6, clk, rst); endmodule // Delay_8 module Delay_8_16(Out_8, Out_16, In, clk, rst); output [11:0] Out_8, Out_16; input [11:0] In; input clk, rst; Delay_8 first(Out_8,In, clk, rst); Delay_8 second(Out_16,Out_8,clk,rst); endmodule module DELAY(Out, In, clk, rst); output[11:0] Out; input [11:0] In; input clk, rst; reg [11:0] Out;

49 / 56

always @( posedge clk ) if (rst == 1'b1) begin Out <= 0; end else begin Out <= In; end endmodule // DELAY module InOutReg_12bit(Out, In, clk, rst); output[11:0] Out; input [11:0] In; input clk, rst; reg [11:0] Out; always @( posedge clk ) if (rst == 1'b1) begin Out <= 0; end else begin Out <= In; end endmodule // InOutReg_12bit module InOutReg_1bit(Out, In, clk, rst); output Out; input In; input clk, rst; reg Out; always @( posedge clk ) if (rst == 1'b1) begin Out <= 0; end else begin Out <= In; end endmodule // InOutReg_1bit module Or_9_Bit(Out, In); output Out; input [8:0] In; nor nor11(Out_Nor_11, In[8], In[7]); nor nor12(Out_Nor_12, In[6], In[5]); nor nor13(Out_Nor_13, In[4], In[3]); nor nor14(Out_Nor_14, In[2], In[1]); not inv15(Out_Inv_15, In[0]); nand nand21(Out_Nand_21, Out_Nor_11, Out_Nor_12); nand nand22(Out_Nand_22, Out_Nor_13, Out_Nor_14); nor nor31(Out_Nor_31, Out_Nand_21, Out_Nand_22); nand nand41(Out, Out_Nor_31, Out_Inv_15); endmodule // Or_9_Bit (Exhaustively tested) module Or_6_Bit(Out, In); output Out; input [5:0] In;

50 / 56

nor nor11(Out_Nor_11, In[5], In[4]); nor nor12(Out_Nor_12, In[3], In[2]); nor nor13(Out_Nor_13, In[1], In[0]); and and21(Out_And_21, Out_Nor_11, Out_Nor_12); nand nand31(Out, Out_And_21, Out_Nor_13); endmodule // Or_6_Bit (Exhaustively tested) module Comb(Out_Comb, Overflow, Xn, N, c1, clk, rst); output[11:0] Out_Comb; output Overflow; input [11:0] Xn; input N, c1, clk, rst; wire [11:0] Delay, Out_D8, Out_D16, Out_Mux_1; supply0 [11:0] gnd; Delay_8_16 ztothe(Out_D8[11:0], Out_D16[11:0], Xn[11:0], clk, rst); fp_add a1(Out_Comb, Overflow, Xn[11:0], {~Delay[11], Delay[10:0]} ); MUX_Nx1xN #(12) m12er( Out_Mux_1[11:0], Out_D8[11:0], Out_D16[11:0], N); MUX_Nx1xN #(12) m12est( Delay[11:0], gnd, Out_Mux_1, c1); endmodule // Comb module biquad(Yn, Overflow, Out_Comb, a1,a2,b0,b1,b2,clk,rst); output [11:0] Yn; output Overflow; input [11:0] Out_Comb, a1,a2,b0,b1,b2; input clk, rst; wire [11:0] Out_Add_11, Out_Add_12, Out_Add_21, Out_B0, Out_B1, Out_B2, Out_D1, Out_A1, Out_A2, Out_D2; wire [8:0] Vflow_In; fp_add inst_add12 (Out_Add_12, Vflow_In[0], Out_Comb, Out_Add_11); fp_mult inst_b0 (Out_B0, Vflow_In[1], b0, Out_Add_12); DELAY inst_d1 (Out_D1, Out_Add_12, clk, rst); fp_mult inst_a1 (Out_A1, Vflow_In[2], Out_D1, a1); fp_mult inst_b1 (Out_B1, Vflow_In[3], Out_D1, b1); fp_add inst_add22 (Yn, Vflow_In[4], Out_Add_21, Out_B0); DELAY inst_d2 (Out_D2, Out_D1, clk, rst); fp_mult inst_a2 (Out_A2, Vflow_In[5], Out_D2, a2); fp_mult inst_b2 (Out_B2, Vflow_In[6], Out_D2, b2); fp_add inst_add11 (Out_Add_11, Vflow_In[7], Out_A1, Out_A2); fp_add inst_add21 (Out_Add_21, Vflow_In[8], Out_B1, Out_B2); Or_9_Bit Vflow(Overflow, Vflow_In[8:0]); endmodule // biquad

51 / 56

Floating Point Adder `define exp_bits 6 `define frac_bits 5 `define bias 31 `define bit_width 12 module fp_add(Sum,Overflow, A, B); input[`bit_width-1:0] A, B; output [`bit_width-1:0] Sum; output Overflow; wire [`bit_width-1:0] Larger, SmallerMan, Diff; wire [`bit_width-2:0] Temp_Sum, Smaller; wire [`frac_bits+1:0] Out_Add_Sub; wire [4:0] OutSub; wire Implied, Implied2, Select, CoutAdd_Sub, Clear; supply0 gnd; supply0 [`bit_width-1:0] gnd_bit_width; equiv_test comp(Clear, A, B, Diff); swap swaper(Diff, Larger[`bit_width-1:0], Smaller[`bit_width-2:0], A, B); implied_Bit #(`bit_width)connotation(Implied,Smaller[`bit_width-2:0]); shift_adjust userAdjustmentDevice(SmallerMan[`frac_bits:0], Larger[`bit_width-2:`frac_bits], Smaller[`bit_width-2:0],Implied); xor x_1(Select, A[`bit_width-1], B[`bit_width-1]); implied_Bit #(`bit_width)allusion(Implied2,Larger[`bit_width-2:0]); add_sub add_suber(Out_Add_Sub[`exp_bits:0], Larger[`frac_bits-1:0], SmallerMan[`frac_bits:0], Select, Implied2); correction corrector(Temp_Sum[`bit_width-2:0], Overflow, Larger[`bit_width-2:`frac_bits], Out_Add_Sub[`frac_bits+1:0]); MUX_Nx1xN #(`bit_width) m12er( Sum[`bit_width-1:0], {Larger[`bit_width-1],Temp_Sum[`bit_width-2:0]}, gnd_bit_width, Clear); endmodule // fp_add module equiv_test(clear, A, B, Diff); output clear; input [`bit_width-1:0] A, B, Diff; Nor_N_bit #(`bit_width-1) isAbitwiseZero(Azero, A[`bit_width-2:0]); Nor_N_bit #(`bit_width-1) isBbitwiseZero(Bzero, B[`bit_width-2:0]); assign Out_And1 = Azero && Bzero; /* are both A and B Zero? */ Nor_N_bit #(`bit_width) areAandBsameMag(Out_Diff, Diff[`bit_width-1:0]); assign Out_XOR = A[`bit_width-1] ^ B[`bit_width-1]; /* are A and B different signs? */ assign Out_And2 = Out_Diff && Out_XOR; /* Are Both Magnitudes the same and Sign Different? */ assign clear = Out_And1 || Out_And2; endmodule // equiv_test module swap(Diff, Larger, Smaller_No_Sign, A, B); /* ensure that the operand with larger magnitude is on input 1 */ input[`bit_width-1:0] A, B; output [`bit_width-1:0] Larger; output [`bit_width-2:0] Smaller_No_Sign;

52 / 56

output [`bit_width-1:0] Diff; supply0 gnd; comparator comparer(Diff[`bit_width-1], Diff[`bit_width-2:0], A[`bit_width-2:0], B[`bit_width-2:0]); MUX_Nx1xN #(`bit_width)muxing(Larger, A, B, Diff[`bit_width-1]); MUX_Nx1xN #(`bit_width)muxed({gnd,Smaller_No_Sign[`bit_width-2:0]}, B, A, Diff[`bit_width-1]); endmodule // swap module shift_adjust(Out, LargerExp, Smaller_no_sign, Implied); /* align the mantissas */ output [`frac_bits:0] Out; input [`bit_width-2:0] Smaller_no_sign; input [èxp_bits-1:0] LargerExp; input Implied; wire [èxp_bits:0] numshift; assign {numshift[èxp_bits],numshift[èxp_bits-1:0]} = LargerExp - Smaller_no_sign[`bit_width-2:`bit_width-(èxp_bits+1)]; assign Out[`frac_bits:0] = {Implied, Smaller_no_sign[`frac_bits-1:0]} >> numshift[èxp_bits:0]; endmodule // shift_adjust module add_sub(Out_Mantissa, LargerFrac, SmallerMan, Select, Implied2); /* add or subtract the mantissas */ output[`frac_bits+1:0] Out_Mantissa; input [`frac_bits:0] SmallerMan; input [`frac_bits-1:0] LargerFrac; input Select , Implied2; wire [`frac_bits+1:0] Sum, Diff; wire Bout; assign Sum[`frac_bits+1:0] = {Implied2, LargerFrac} + SmallerMan; assign {Bout,Diff[`frac_bits:0]} = {Implied2,LargerFrac} - SmallerMan; MUX_Nx1xN #(7)outMux(Out_Mantissa[`frac_bits+1:0], Sum[`frac_bits+1:0], {Bout,Diff[`frac_bits:0]}, Select); endmodule // add_sub module correction(Out,Overflow,Exp,Man); /*use for statement to look for "first "1" in the number and then shift by that amount */ output [`bit_width-2:0] Out; output Overflow; input [`frac_bits+1:0] Man; input [èxp_bits-1:0] Exp; wire [èxp_bits:0] Out_Temp; wire [èxp_bits-1:0] Out_Temp2; wire [èxp_bits-1:0] Out_Mux_2; wire [`frac_bits-1:0] Out_Mux_1, Out_ShifterL; wire [2:0] Select; supply1 vdd; assign Out_And = ~Man[`frac_bits+1] && Man[`frac_bits]; /* The overflow of the implied bit and the "correct" implied bit */

53 / 56

assign Out_Temp[èxp_bits:0] = Exp[èxp_bits-1:0] - Select[2:0]; assign {Overflow,Out_Temp2[èxp_bits-1:0]} = Exp[èxp_bits-1:0] + 1; shifterL goldfish(Out_ShifterL[`frac_bits-1:0],Select[2:0], Man[`frac_bits:0]); MUX_Nx1xN #(5) m5er( Out_Mux_1[`frac_bits-1:0], Out_ShifterL[`frac_bits-1:0], Man[`frac_bits-1:0], Out_And ); MUX_Nx1xN #(5) m5est( Out[`frac_bits-1:0], Out_Mux_1[frac_bits-1:0], Man[frac_bits:1], Man[`frac_bits+1] ); MUX_Nx1xN #(6) m6er( Out_Mux_2[`frac_bits:0], Out_Temp[`frac_bits:0], Exp[èxp_bits-1:0], Out_And ); MUX_Nx1xN #(6) m6est( Out[`bit_width-2:`frac_bits], Out_Mux_2[èxp_bits-1:0], Out_Temp2[èxp_bits-1:0], Man[`frac_bits+1] ); endmodule // correction module comparator(Bout,Diff, A, B); output[`bit_width-2:0] Diff; output Bout; input [`bit_width-2:0] A, B; assign {Bout, Diff} = A[`bit_width-2:0] - B[`bit_width-2:0]; endmodule // comparator module shifterL(Out,Select,In); input [`frac_bits-1:0] In; output [`frac_bits-1:0] Out; output [2:0] Select; reg [`frac_bits-1:0] In_Temp; reg [2:0] Select; reg [2:0] count; always@(In) begin Select = 0; count = 0; In_Temp = In; while (In_Temp[`frac_bits-1] != 1'b1 && count < 5) begin In_Temp = In << count; Select = Select + 1; count = count + 1; end end assign Out[`frac_bits-1:0] = In[`frac_bits-1:0] << Select[2:0]; endmodule // shifterL module MUX_Nx1xN(Out, A, B, S); /* N-bit Mux, 1-bit Sel line, easy to drop in */ parameter N = 1; output [N-1:0] Out; input [N-1:0] A, B; input S; reg [N-1:0] Out; always@(A or B or S)

54 / 56

begin if(~S) begin Out[N-1:0] = A[N-1:0]; end else begin Out[N-1:0] = B[N-1:0]; end end endmodule // MUX_Nx1xN module Nor_N_bit(Out, In); /* N-bit to 1 Nor gate */ parameter N = 1; output Out; input [N-1:0] In; reg Out; supply0 gnd; always@(In) begin if(N % 2 == 0) begin Out = ~(In[N/2-1:0] || In[N-1:N/2]); end else begin Out = ~(In[(N+1)/2-1:0] || {In[N-1:(N+1)/2],gnd}); end end endmodule // Nor_N_bit module implied_Bit(Out, In); /*Also known as OR_11_BIT */ /* aka OR_bit_width_minus_1_bit */ parameter N = 1; parameter M = N-1; output Out; input [M-1:0] In; reg Out; supply0 gnd; always@(In) begin if(M % 2 == 0) begin Out = (In[M/2-1:0] || In[M-1:M/2]); end else begin Out = (In[(M+1)/2-1:0] || {In[M-1:(M+1)/2],gnd}); end end endmodule // implied_Bit

55 / 56

Floating Point Multiplier `define exp_bits 6 `define frac_bits 5 `define bias 31 `define bit_width 12 module signBlock(Out,FP_1_sign, FP_2_sign); /* 1-bit XOR */ output Out; input FP_1_sign, FP_2_sign; assign Out = (FP_1_sign ^ FP_2_sign); endmodule // sign_block module addExp(Out, FP_1_exp, FP_2_exp); /* 6-bit FX_Adder */ output [èxp_bits:0] Out; input [èxp_bits-1:0] FP_1_exp, FP_2_exp; assign Out = FP_1_exp + FP_2_exp; endmodule // add_exp module subBias(Out,MSB, In); /* 6-bit FX_Subtractor */ output [èxp_bits-1:0] Out; output MSB; input [èxp_bits:0] In; wire [èxp_bits:0] tempOut; assign tempOut = In - `bias; assign MSB = tempOut[èxp_bits]; assign Out[èxp_bits-1:0] = tempOut[èxp_bits-1:0]; endmodule // subBias module bigMan(Out, FP_1_man, FP_2_man); /* 6-bit FX_Mult */ output [`bit_width-1:0] Out; input [`frac_bits:0] FP_1_man, FP_2_man; assign Out[`bit_width-1:0] = ( FP_1_man * FP_2_man ); endmodule // bigMan module shifter(Out, Man, tempProduct_1); output [`bit_width-1:0] Out; input [`bit_width-1:0] Man, tempProduct_1; wire [5:0] inputB; MUX_Nx1xN #(`frac_bits)m1( Out[`frac_bits-1:0], Man[`bit_width-3:`frac_bits], Man[`bit_width-2:`frac_bits+1], Man[`bit_width-1] ); assign inputB[èxp_bits-1:0] = tempProduct_1[`bit_width-2:`frac_bits] + 1; MUX_Nx1xN #(èxp_bits)m2( Out[`bit_width-2:`frac_bits], tempProduct_1[`bit_width-2:`frac_bits], inputB, Man[`bit_width-1] ); assign Out[`bit_width-1] = tempProduct_1[`bit_width-1]; endmodule // shifter module MUX_Nx1xN(Out, A, B, S); /* N-bit Mux, 1-bit Sel line, easy to drop in */ parameter N = 1;

56 / 56

output [N-1:0] Out; input [N-1:0] A, B; input S; reg [N-1:0] Out; always@(A or B or S) begin if(~S) begin Out[N-1:0] = A[N-1:0]; end else begin Out[N-1:0] = B[N-1:0]; end end endmodule // MUX_Nx1xN module isInputZero(Out,In1,In2); output Out; input [`bit_width-2:0] In1, In2; assign Out = ~(In1[`bit_width-2:0] || In2[`bit_width-2:0]); endmodule // isInputZero module overflowBlock(Out,In1, MSBSub); output Out; input In1; input MSBSub; assign Out = (~In1 && MSBSub); endmodule // overflowBlock module fp_mult(Product, Overflow, FP_1, FP_2); output [`bit_width-1:0] Product; output Overflow; input [`bit_width-1:0] FP_1, FP_2; wire [`bit_width-1:0] Man, tempProduct_1, tempProduct_2; wire [èxp_bits:0] OutAddExp; wire MSBSub, OutOr, OutisInputZero; supply1 vdd; supply0 [`bit_width-1:0] gnd_bitWidth; signBlock s1(tempProduct_1[`bit_width-1], FP_1[`bit_width-1], FP_2[`bit_width-1]); addExp a1(OutAddExp[èxp_bits:0], FP_1[`bit_width-2:`frac_bits], FP_2[`bit_width-2:`frac_bits]); subBias sb1(tempProduct_1[`bit_width-2:`frac_bits],MSBSub, OutAddExp[èxp_bits:0]); isInputZero i0(OutisInputZero,FP_1[`bit_width-2:0],FP_2[`bit_width-2:0]); overflowBlock v1(Overflow, OutisInputZero,MSBSub); bigMan bm(Man[`bit_width-1:0], {vdd,FP_1[`frac_bits-1:0]}, {vdd,FP_2[`frac_bits-1:0]}); shifter sh1(tempProduct_2[`bit_width-1:0], Man[`bit_width-1:0], tempProduct_1[`bit_width-1:0]); or o1(OutOr, Overflow, OutisInputZero); MUX_Nx1xN #(`bit_width)om1(Product[`bit_width-1:0],tempProduct_2,gnd_bitWidth, OutOr); endmodule // fp_mult

18-525 Spring 2005 Digital Signal Processing “Swiss Army ...ee525/projects/projects2005/final...

Documents

Transcript of 18-525 Spring 2005 Digital Signal Processing “Swiss Army ...ee525/projects/projects2005/final...