mac repairs in London | Mac book pro repair | Mac book air repair
Mac
-
Upload
nishanthm48 -
Category
Documents
-
view
407 -
download
1
Transcript of Mac
A New VLSI Architecture of ParallelMultiplier–Accumulator Based on Radix-2
Modified Booth Algorithm
Abstract
With the recent rapid advances in multimedia and communication systems,
real-time signal processing like audio signal processing, video/image processing, or
large-capacity data processing are increasingly being demanded. The multiplier and
multiplier-and-accumulator (MAC) are the essential elements of the digital signal
processing such as filtering, convolution, transformations and Inner products. There
are different entities that one would like to optimize when designing a VLSI circuit.
These entities can often not be optimized simultaneously, only improve one entity at
the expense of one or more others The design of an efficient integrated circuit in
terms of power, area, and speed simultaneously, has become a very challenging
problem. Power dissipation is recognized as a critical parameter in modern the
objective of a good multiplier is to provide a physically compact, good speed and low
power consuming chip.
This paper proposes a new architecture of multiplier-and-accumulator (MAC)
for high speed and low-power by adopting the new SPST implementing approach.
This multiplier is designed by equipping the Spurious Power Suppression Technique
(SPST) on a modified Booth encoder which is controlled by a detection unit using an
AND gate. The modified booth encoder will reduce the number of partial products
generated by a factor of 2. The SPST adder will avoid the unwanted addition and thus
minimize the switching power dissipation. By combining multiplication with
accumulation and devising a low power equipped carry save adder (CSA), the
performance was improved.
In this project we used Modelsim for logical verification, and further
synthesizing it on Xilinx-ISE tool using target technology and performing placing &
routing operation for system verification on targeted FPGA.
CHAPTER 1 CHAPTER 1
INTRODUCTION
1.1 Introduction:
Power dissipation is recognized as a critical parameter in modern VLSI design
field. To satisfy MOORE’S law and to produce consumer electronics goods with
more backup and less weight, low power VLSI design is necessary.
Fast multipliers are essential parts of digital signal processing systems. The
speed of multiply operation is of great importance in digital signal processing as well
as in the general purpose processors today, especially since the media processing took
off. In the past multiplication was generally implemented via a sequence of addition,
subtraction, and shift operations. Multiplication can be considered as a series of
repeated additions. The number to be added is the multiplicand, the number of times
that it is added is the multiplier, and the result is the product. Each step of addition
generates a partial product. In most computers, the operand usually contains the same
number of bits. When the operands are interpreted as integers, the product is generally
twice the length of operands in order to preserve the information content. This
repeated addition method that is suggested by the arithmetic definition is slow that it
is almost always replaced by an algorithm that makes use of positional representation.
It is possible to decompose multipliers into two parts. The first part is dedicated to the
generation of partial products, and the second one collects and adds them.
The basic multiplication principle is two fold i.e. evaluation of partial products
and accumulation of the shifted partial products. It is performed by the successive
additions of the columns of the shifted partial product matrix. The ‘multiplier’ is
successfully shifted and gates the appropriate bit of the ‘multiplicand’. The delayed,
gated instance of the multiplicand must all be in the same column of the shifted partial
product matrix. They are then added to form the product bit for the particular form.
Multiplication is therefore a multi operand operation. To extend the multiplication to
both signed and unsigned numbers, a convenient number system would be the
representation of numbers in two’s complement format.
The MAC (Multiplier and Accumulator Unit) is used for image processing and
digital signal processing (DSP) in a DSP processor. Algorithm of MAC is Booth's
radix-2 algorithm, Modified Booth Multiplier; 17-bit SPST adder improves speed and
reduces the power.
1.2 Speed and Size
In this, when performance of circuits is compared, it is always done in terms
of circuit speed, size and power. A good estimation of the circuit’s size is to count the
total number of gates used. The actual chip size of a circuit also depends on how the
gates are placed on the chip – the circuit’s layout. Since we do not deal with layout in
this report, the only thing we can say about this is that regular circuits are usually
smaller than non-regular ones (for the same number of gates), because regularity
allows more compact layout. The physical delay of circuits originates from the small
delays in single gates, and from the wiring between them. The delay of a wire
depends on how long it is. Therefore, it is difficult to model the wiring delay; it
requires knowledge about the circuit’s layout on the chip. The gate delay, however,
can easily be modeled by saying that the output is delayed a constant amount of time
from the latest input. What we can say about the wiring delay is that larger circuits
have longer wires, and hence more wiring delay. It follows that a circuit with a
regular layout usually has shorter wires and hence less wiring delay than a non-
regular circuit.
Therefore, if circuit delay is estimated as the total gate delay, one should also
have in minded the circuit’s size and amount of regularity, when comparing it to other
circuits. “Delay” usually refers to the “worst-case delay”. That is, if the delay of the
output is dependent on the inputs given, it is always the largest possible output delay
that sets the speed. Furthermore, if different bits in the output have different worst-
case delays, it is always the slowest bit that sets the delay for the whole output. The
slowest path between any input bit and any output bit is called the “critical path”. If a
circuit is to be speed up, it is always the critical path that should be attacked in the
first place.
1.3 Objective
The main objective of this thesis is to design and implementation of a
Multiplier and Accumulator r. A multiplier which is a combination of Modified Booth
and SPST (Spurious Power Suppression Technique) adder are designed taking into
account the less area consumption of booth algorithm because of less number of
partial products and more speedy accumulation of partial products and less power
consumption of partial products addition using SPST adder approach.
1.4 Basics of Multiplier:
Multiplication is a mathematical operation that at its simplest is an abbreviated
process of adding an integer to itself a specified number of times. A number
(multiplicand) is added to itself a number of times as specified by another number
(multiplier) to form a result (product). In elementary school, students learn to multiply
by placing the multiplicand on top of the multiplier. The multiplicand is then
multiplied by each digit of the multiplier beginning with the rightmost, Least
Significant Digit (LSD). Intermediate results (partial products) are placed one atop the
other, offset by one digit to align digits of the same weight. The final product is
determined by summation of all the partial-products. Although most people think of
multiplication only in base 10, this technique applies equally to any base, including
binary. Figure 1.1 shows the data flow for the basic multiplication technique just
described. Each black dot represents a single digit.
Figure 1.1: basic Multiplication
Here, we assume that MSB represent the sign of the digit. The operation of
multiplication is rather simple in digital electronics. It has its origin from the classical
algorithm for the product of two binary numbers. This algorithm uses addition and
shift left operations to calculate the product of two numbers. Based upon the above
procedure, we can deduce an algorithm for any kind of multiplication which is shown
in figure 1.2. We can check at the initial stage also that whether the product will be
positive or negative or after getting the whole result, MSB of the results tells the sign
of the product.
Figure 1.2 Signed Multiplication Algorithm
1.5 Binary Multiplication
In the binary number system the digits, called bits, are limited to the set [0, 1].
The result of multiplying any binary number by a single binary bit is either 0, or the
original number. This makes forming the intermediate partial-products simple and
efficient. Summing these partial-products is the time consuming task for binary
multipliers. One logical approach is to form the partial-products one at a time and sum
them as they are generated. Often implemented by software on processors that do not
have a hardware multiplier, this technique works fine, but is slow because at least one
machine cycle is required to sum each additional partial-product. For applications
where this approach does not provide enough performance, multipliers can be
implemented directly in hardware. The two main categories of binary multiplication
include signed and unsigned numbers. Digit multiplication is a series of bit shifts and
series of bit additions, where the two numbers, the multiplicand and the multiplier are
combined into the result. Considering the bit representation of the multiplicand x =
xn-1…..x1 x0 and the multiplier y = yn-1…..y1y0 in order to form the product up to n
shifted copies of the multiplicand are to be added for unsigned multiplication. The
entire process consists of three steps, partial product generation, partial product
reduction and final addition.
1.6 Multiplication Process
The simplest multiplication operation is to directly calculate the product of
two numbers by hand. This procedure can be divided into three steps: partial product
generation, partial product reduction and the final addition. To further specify the
operation process, let us calculate the product of 2 two’s complement numbers, for
example, 11012 (−310) and 01012 (510), when computing the product by hand, which
can be described according to figure 1.3.
Fig 1.3 Multiplication calculations by hand
The bold italic digits are the sign extension bits of the partial products. The
first operand is called the multiplicand and the second the multiplier. The intermediate
products are called partial products and the final result is called the product. However,
the multiplication process, when this method is directly mapped to hardware, is shown
in figure 1.2. As can been seen in the figures, the multiplication operation in hardware
consists of PP generation, PP reduction and final addition steps. The two rows before
the product are called sum and carry bits. The operation of this method is to take one
of the multiplier bits at a time from right to left, multiplying the multiplicand by the
single bit of the multiplier and shifting the intermediate product one position to the
left of the earlier intermediate products.
All the bits of the partial products in each column are added to obtain two bits:
sum and carry. Finally, the sum and carry bits in each column have to be summed.
Similarly, for the multiplication of an n-bit multiplicand and an m-bit multiplier, a
product with n + m bits long and m partial products can be generated. The method
shown in figure 1.3 is also called a non-Booth encoding scheme.
Fig 1.4: Multiplication Operation in hardware
CHAPTER 2 CHAPTER 2
LITERATURE REVIEW
2.1 Backgrounds
2.1.1 MAC
In the majority of digital signal processing (DSP) applications
the critical operations are the multiplication and accumulation. Real-
time signal processing requires high speed and high throughput
Multiplier-Accumulator (MAC) unit that consumes low power, which
is always a key to achieve a high performance digital signal
processing system. The purpose of this work is to design and
implementation of a low power MAC unit with block enabling
technique to save power. Firstly, a 1-bit MAC unit is designed, with
appropriate geometries that give optimized power, area and delay.
The delay in the pipeline stages in the MAC unit is estimated based
on which a control unit is designed to control the data flow between
the MAC blocks for low power. Similarly, the N-bit MAC unit is
designed and controlled for low power using a control logic that
enables the pipelined stages at appropriate time. The adder cell
designed has advantage of high operational speed, small Gate count
and low power.
In general, a multiplier uses Booth’s algorithm and array of full adders (FAs),
or Wallace tree instead of the array of FA’s., i.e., this multiplier mainly consists of the
three parts: Booth encoder, a tree to compress the partial products such as Wallace
tree, and final adder. Because Wallace tree is to add the partial products from encoder
as parallel as possible, its operation time is proportional to, where is the number of
inputs. It uses the fact that counting the number of 1’s among the inputs reduces the
number of outputs into. In real implementation, many (3:2) or (7:3) counters are used
to reduce the number of outputs in each pipeline step. The most effective way to
increase the speed of a multiplier is to reduce the number of the partial products
because multiplication precedes a series of additions for the partial products. To
reduce the number of calculation steps for the partial products, MBA algorithm has
been applied mostly where Wallace tree has taken the role of increasing the speed to
add the partial products. To increase the speed of the MBA algorithm, many parallel
multiplication architectures have been researched .Among them, the architectures
based on the Baugh–Wooley algorithm (BWA) have been developed and they have
been applied to various digital filtering calculations.
One of the most advanced types of MAC for general-purpose digital signal
processing has been proposed by Elguibaly . It is an architecture in which
accumulation has been combined with the carry save adder (CSA) tree that
compresses partial products. In the architecture proposed in , the critical path
was reduced by eliminating the adder for accumulation and decreasing the number of
input bits in the final adder. While it has a better performance because of the reduced
critical path compared to the previous MAC architectures, there is a need to improve
the output rate due to the use of the final adder results for accumulation. An
architecture to merge the adder block to the accumulator register in the MAC operator
was proposed in to provide the possibility of using two separate /2-bit adders instead
of one -bit adder to accumulate the –bitMAC results. Recently, Zicari proposed an
architecture that took a merging technique to fully utilize the 4–2 compressor . It also
took this compressor as the basic building blocks for the multiplication circuit.
A new architecture for a high-speed MAC is proposed. In this MAC, the
computations of multiplication and accumulation are combined and a hybrid-type
CSA structure is proposed to reduce the critical path and improve the output rate. It
uses MBA algorithm based on 1’s complement number system. A modified array
structure for the sign bits is used to increase the density of the operands. A carry look-
ahead adder (CLA) is inserted in the CSA tree to reduce the number of bits in the final
adder. In addition, in order to increase the output rate by optimizing the pipeline
efficiency, intermediate calculation results are accumulated in the form of sum and
carry instead of the final adder outputs.
2.1.2. Overview of MAC:
A multiplier can be divided into three operational steps. The first is radix-2
Booth encoding in which a partial product is generated from the multiplicand X and
the multiplier Y. The second is adder array or partial product compression to add all
partial products. The last is the final addition in which the process to accumulate the
multiplied results is included.
The general hardware architecture of this MAC is shown in Fig.1.2. It
executes the multiplication operation by multiplying the input multiplier X and the
multiplicand Y. This is added to the previous multiplication result Z as the
accumulation step.
The N-bit 2’s complement binary number can be expressed as
……..(1)
If (1) is expressed in base-4 type redundant sign digit form in order to apply the radix-
2 Booth’s algorithm.
……………………………………..(2)
……………………(3)
If (2) is used, multiplication can be expressed as
………………………………(4)
If these equations are used, the afore-mentioned multiplication–accumulation results
can be expressed as
…….(5)
Each of the two terms on the right-hand side of (5) is calculated independently
and the final result is produced by adding the two results. The MAC architecture
implemented by (5) is called the standard design [6].
If -bit data are multiplied, the number of the generated partial products is
proportional to N. In order to add them serially, the execution time is also
proportional to N. The architecture of a multiplier, which is the fastest, uses radix-2
Booth encoding that generates partial products. If radix-2 Booth encoding is used, the
number of partial products, is reduced to half, resulting in the decrease in Addition of
Partial Products step. In addition, the signed multiplication based on 2’s complement
numbers is also possible. Due to these reasons, most current used multipliers adopt
the Booth encoding.
2.2 Multiplier and Accumulator Unit
MAC is composed of an adder, multiplier and an accumulator.
Usually adders implemented are Carry- Select or Carry-Save adders,
as speed is of utmost importance in DSP (Chandrakasan, Sheng, &
Brodersen, 1992 and Weste & Harris, 3rd Ed). One implementation of
the multiplier could be as a parallel array multiplier. The inputs for
the MAC are to be fetched from memory location and fed to the
multiplier block of the MAC, which will perform multiplication and
give the result to adder which will accumulate the result and then
will store the result into a memory location. This entire process is to
be achieved in a single clock cycle (Weste & Harris, 3rd Ed). The
architecture of the MAC unit which had been designed in this work
consists of one 16 bit register, one 16-bit Modified Booth Multiplier,
32-bit accumulator. To multiply the values of A and B, Modified
Booth multiplier is used instead of conventional multiplier because
Modified Booth multiplier can increase the MAC unit design speed
and reduce multiplication complexity. SPST Adder is used for the
addition of partial products and a register is used for accumulation..
The operation of the designed MAC unit is as in Equation 2.1. The
product of Ai X Bi is always fed back into the 32-bit accumulator and
then added again with the next product Ai x Bi. This MAC unit is
capable of multiplying and adding with previous product
consecutively up to as many as times.
Fig 2.5: Simple Multiplier and Accumulator Architecture
CHAPTER 3 CHAPTER 3
Design of MAC
In the majority of digital signal processing (DSP) applications
the critical operations usually involve many multiplications and/or
accumulations. For real-time signal processing, a high speed and
high throughput Multiplier-Accumulator (MAC) is always a key to
achieve a high performance digital signal processing system. In the
last few years, the main consideration of MAC design is to enhance
its speed. This is because; speed and throughput rate is always the
concern of digital signal processing system. But for the epoch of
personal communication, low power design also becomes another
main design consideration. This is because; battery energy available
for these portable products limits the power consumption of the
system. Therefore, the main motivation of this work is to investigate
various Pipelined multiplier/accumulator architectures and circuit
design techniques which are suitable for implementing high
throughput signal processing algorithms and at the same time
achieve low power consumption. A conventional MAC unit consists
of (fast multiplier) multiplier and an accumulator that contains the
sum of the previous consecutive products. The function of the MAC
unit is given by the following equation:
F = Σ A i Bi…………………………………………………………(2.1)
The main goal of a DSP processor design is to enhance the speed of
the MAC unit, and at the same time limit the power consumption. In
a pipelined MAC circuit, the delay of pipeline stage is the delay of a
1-bit full adder. Estimating this delay will assist in identifying the
overall delay of the pipelined MAC. In this work, 1-bit full adder is
designed. Area, power and delay are calculated for the full adder,
based on which the pipelined MAC unit is designed for low power.
3.1 High-Speed Booth Encoded Parallel Multiplier Design:
Fast multipliers are essential parts of digital signal processing
systems. The speed of multiply operation is of great importance in
digital signal processing as well as in the general purpose
processors today, especially since the media processing took off. In
the past multiplication was generally implemented via a sequence
of addition, subtraction, and shift operations. Multiplication can be
considered as a series of repeated additions. The number to be
added is the multiplicand, the number of times that it is added is the
multiplier, and the result is the product. Each step of addition
generates a partial product. In most computers, the operand usually
contains the same number of bits. When the operands are
interpreted as integers, the product is generally twice the length of
operands in order to preserve the information content. This
repeated addition method that is suggested by the arithmetic
definition is slow that it is almost always replaced by an algorithm
that makes use of positional representation. It is possible to
decompose multipliers into two parts. The first part is dedicated to
the generation of partial products, and the second one collects and
adds them.
Fig 3.1 Hardware architecture of the proposed MAC.
The basic multiplication principle is two fold i.e. evaluation of
partial products and accumulation of the shifted partial products. It
is performed by the successive additions of the columns of the
shifted partial product matrix. The ‘multiplier’ is successfully shifted
and gates the appropriate bit of the ‘multiplicand’. The delayed,
gated instance of the multiplicand must all be in the same column of
the shifted partial product matrix. They are then added to form the
product bit for the particular form. Multiplication is therefore a multi
operand operation. To extend the multiplication to both signed and
unsigned.
Fig 3.2 Basic arithmetic steps of multiplication and accumulation.
3.2 Derivation of MAC Arithmetic
Basic Concept: If an operation to multiply two –bit numbers and accumulates
into a 2-bit number is considered, the critical path is determined by the 2-bit
accumulation operation. If a pipeline scheme is applied for each step in the standard
design of Fig 1, the delay of the last accumulator must be reduced in order to improve
the performance of the MAC. The overall performance of the proposed MAC is
improved by eliminating the accumulator itself by combining it with the CSA
function. If the accumulator has been eliminated, the critical path is then determined
by the final adder in the multiplier. The basic method to improve the performance of
the final adder is to decrease the number of input bits. In order to reduce this number
of input bits, the multiple partial products are compressed into a sum and a carry by
CSA. The number of bits of sums and carries to be transferred to the final adder is
reduced by adding the lower bits of sums and carries in advance within the range in
which the overall performance will not be degraded. A 2-bit CLA is used to add the
lower bits in the CSA. In addition, to increase the output rate when pipelining is
applied, the sums and carrys from the CSA are accumulated instead of the outputs
from the final adder in the manner that the sum and carry from the CSA in the
previous cycle are inputted to CSA. Due to this feedback of both sum and carry, the
number of inputs to CSA increases, compared to the standard design and . In order to
efficiently solve the increase in the amount of data, a CSA architecture is
modified to treat the sign bit.
Equation Derivation: The aforementioned concept is applied to to express the
proposed MAC arithmetic. Then, the multiplication would be transferred to a
hardware architecture that complies with the proposed concept, in which the feedback
value for accumulation will be modified and expanded for the new MAC.
First, if the multiplication in (4) is decomposed and rearranged, it becomes
If this is divided into the first partial product, sum of the middle partial
products, and the final partial product, it can be reexpressed as . The reason for
separating the partial product addition as is that three types of data are fed back for
accumulation, which are the sum, the carry, and the preadded results of the sum and
carry from lower bits.
Now, the proposed concept is applied to in (5). If is first divided into upper
and lower bits and rearranged, (8) will be derived. The first term of the right-hand
side in (8) corresponds to the upper bits. It is the value that is fed back as the sum and
the carry. The second term corresponds to the lower bits and is the value that is fed
back as the addition result for the sum and carry
The second term can be separated further into the carry term and sum term as
Thus,
the MAC arithmetic is
Fig 3.3 Proposed arithmetic operation of multiplication and accumulation.
Fig 3.4 Hardware architecture of general MAC.
3.3 Modified Booth Encoder:
In order to achieve high-speed multiplication, multiplication
algorithms using parallel counters, such as the modified Booth
algorithm has been proposed, and some multipliers based on the
algorithms have been implemented for practical use. This type of
multiplier operates much faster than an array multiplier for longer
operands because its computation time is proportional to the
logarithm of the word length of operands.
Booth multiplication is a technique that allows for smaller,
faster multiplication circuits, by recoding the numbers that are
multiplied. It is possible to reduce the number of partial products by
half, by using the technique of radix-4 Booth recoding. The basic
idea is that, instead of shifting and adding for every column of the
multiplier term and multiplying by 1 or 0, we only take every second
column, and multiply by ±1, ±2, or 0, to obtain the same results.
The advantage of this method is the halving of the number of partial
products. To Booth recode the multiplier term, we consider the bits
in blocks of three, such that each block overlaps the previous block
by one bit. Grouping starts from the LSB, and the first block only
uses two bits of the multiplier. Figure 3 shows the grouping of bits
from the multiplier term for use in modified booth encoding.
Fig.2.2 Grouping of bits from the multiplier term
Each block is decoded to generate the correct partial product. The
encoding of the multiplier Y, using the modified booth algorithm,
generates the following five signed digits, -2, -1, 0, +1, +2. Each
encoded digit in the multiplier performs a certain operation on the
multiplicand, X, as illustrated in Table 1
For the partial product generation, we adopt Radix-4 Modified
Booth algorithm to reduce the number of partial products for
roughly one half. For multiplication of 2’s complement numbers, the
two-bit encoding using this algorithm scans a triplet of bits. When
the multiplier B is divided into groups of two bits, the algorithm is
applied to this group of divided bits.
Figure 4, shows a computing example of Booth multiplying
two numbers ”2AC9” and “006A”. The shadow denotes that the
numbers in this part of Booth multiplication are all zero so that this
part of the computations can be neglected. Saving those
computations can significantly reduce the power consumption
caused by the transient signals.
Fig.2.3 Illustration of multiplication using modified Booth
encoding
The PP generator generates five candidates of the partial
products, i.e., {-2A,-A, 0, A, 2A}. These are then selected according
to the Booth encoding results of the operand B. When the operand
besides the Booth encoded one has a small absolute value, there
are opportunities to reduce the spurious power dissipated in the
compression tree.
3.4 Partial product generator:
Fig.2.5 Booth partial product selector logic
The multiplication first step generates from A and X a set
of bits whose weights sum is the product P. For unsigned
multiplication, P most significant bit weight is positive, while in 2's
complement it is negative.
The partial product is generated by doing AND between ‘a’
and ‘b’ which are a 4 bit vectors as shown in fig. If we take, four bit
multiplier and 4-bit multiplicand we get sixteen partial products in
which the first partial product is stored in ‘q’. Similarly, the second,
third and fourth partial products are stored in 4-bit vector n, x, y.
Fig.2.6 Booth partial products Generation
The multiplication second step reduces the partial products
from the preceding step into two numbers while preserving the
weighted sum. The sough after product P is the sum of those two
numbers. The two numbers will be added during the third step The
"Wallace trees" synthesis follows the Dadda's algorithm, which
assures of the minimum counter number. If on top of that we
impose to reduce as late as (or as soon as) possible then the
solution is unique. The two binary number to be added during the
third step may also be seen a one number in CSA notation (2 bits
per digit).
Fig.2.7 Booth single partial product selector logic
Multiplication consists of three steps: 1) the first step to
generate the partial products; 2) the second step to add the
generated partial products until the last two rows are remained; 3)
the third step to compute the final multiplication results by adding
the last two rows. The modified Booth algorithm reduces the
number of partial products by half in the first step. We used the
modified Booth encoding (MBE) scheme proposed in. It is known as
the most efficient Booth encoding and decoding scheme. To multiply
X by Y using the modified Booth algorithm starts from grouping Y by
three bits and encoding into one of {-2, -1, 0, 1, 2}. Table I shows
the rules to generate the encoded signals by MBE scheme and Fig. 1
(a) shows the corresponding logic diagram. The Booth decoder
generates the partial products using the encoded signals as shown
in Fig. 1
Fig.2.8 Booth Encoder
Fig.2.9 Booth Decoder
Fig. shows the generated partial products and sign extension
scheme of the 8-bit modified Booth multiplier. The partial products
generated by the modified Booth algorithm are added in parallel
using the Wallace tree until the last two rows are remained. The
final multiplication results are generated by adding the last two
rows. The carry propagation adder is usually used in this step.
Fig 2.10 Truth table for MBE Scheme
3.5 Proposed CSA Architecture
The architecture of the hybrid-type CSA that complies with the operation of
the proposed MAC is shown in Fig. 5, which performs 8-bit operation. In Fig. 2.11
Si is to simplify the sign expansion and Ni is to compensate 1’s complement number
into 2’s complement number. S[i] and C[i] correspond to the ith bit of the feedback
sum and carry. Z[i] is the ith bit of the sum of the lower bits for each partial product
that were added in advance and Z’[i] is the previous result. In addition, Pj[i]
corresponds to the ith bit of the jth partial product. Since the multiplier is for 8 bits,
totally four partial products are generated from the Booth encoder. This CSA requires
at least four rows of FAs for the four partial products. Thus, totally five FA rows are
necessary since one more level of rows are needed for accumulation. For an -bit MAC
operation, the level of CSA is (n/2+1) . The white square in Fig. 2.11 represents an
FA and the gray square is a half adder (HA). The rectangular symbol with five inputs
is a 2-bit CLA with a carry input.
Fig 2.11 Architecture of the proposed CSA tree.
The critical path in this CSA is determined by the 2-bit CLA. It is also
possible to use FAs to implement the CSA without CLA. However, if the lower bits
of the previously generated partial product are not processed in advance by the CLAs,
the number of bits for the final adder will increase. When the entire multiplier or
MAC is considered, it degrades the performance. In Table I, the characteristics of the
proposed CSA architecture have been summarized and briefly compared with other
architectures. For the number system, the proposed CSA uses 1’scomplement, but
ours uses a modified CSA array without sign extension. The biggest difference
between ours and the others is the type of values that is fed back for accumulation.
Ours has the smallest number of inputs to the final adder.
CHAPTER 4 CHAPTER 4
Results and Discussion
3.1 Simulation Results of MAC:
3.2. Introduction to FPGA
FPGA stands for Field Programmable Gate Array which has the array of logic
module, I /O module and routing tracks (programmable interconnect). FPGA can be
configured by end user to implement specific circuitry. Speed is up to 100 MHz but at
present speed is in GHz.
Main applications are DSP, FPGA based computers, logic emulation, ASIC
and ASSP. FPGA can be programmed mainly on SRAM (Static Random Access
Memory). It is Volatile and main advantage of using SRAM programming technology
is re-configurability. Issues in FPGA technology are complexity of logic element,
clock support, IO support and interconnections (Routing).
3.2.1. FPGA Design Flow:
FPGA contains a two dimensional arrays of logic blocks and interconnections
between logic blocks. Both the logic blocks and interconnects are programmable.
Logic blocks are programmed to implement a desired function and the interconnects
are programmed using the switch boxes to connect the logic blocks.
To be more clear, if we want to implement a complex design (CPU for
instance), then the design is divided into small sub functions and each sub function is
implemented using one logic block. Now, to get our desired design (CPU), all the sub
functions implemented in logic blocks must be connected and this is done by
programming the interconnects.
Internal structure of an FPGA is depicted in the following figure.
FPGAs, alternative to the custom ICs, can be used to implement an entire
System On one Chip (SOC). The main advantage of FPGA is ability to reprogram.
User can reprogram an FPGA to implement a design and this is done after the FPGA
is manufactured. This brings the name “FieldProgrammable.”
Custom ICs are expensive and takes long time to design so they are useful
when produced in bulk amounts. But FPGAs are easy to implement with in a short
time with the help of Computer Aided Designing (CAD) tools (because there is no
physical layout process, no mask making, and no IC manufacturing).
Some disadvantages of FPGAs are, they are slow compared to custom ICs as
they can’t handle vary complex designs and also they draw more power.
Xilinx logic block consists of one Look Up Table (LUT) and one FlipFlop. An
LUT is used to implement number of different functionality. The input lines to the
logic block go into the LUT and enable it. The output of the LUT gives the result of
the logic function that it implements and the output of logic block is registered or
unregistered out put from the LUT.
SRAM is used to implement a LUT.A k-input logic function is implemented
using 2^k * 1 size SRAM. Number of different possible functions for k input LUT is
2^2^k. Advantage of such an architecture is that it supports implementation of so
many logic functions, however the disadvantage is unusually large number of memory
cells required to implement such a logic block in case number of inputs is large.
Figure below shows a 4-input LUT based implementation of logic block
LUT based design provides for better logic block utilization. A k-input LUT
based logic block can be implemented in number of different ways with trade off
between performance and logic density.
An n-LUT can be shown as a direct implementation of a function truth-table. Each of
the latch holds the value of the function corresponding to one input combination. For
Example: 2-LUT can be used to implement 16 types of functions like AND , OR,
A+not B .... etc.
A B AND OR ..... ...... ......
0 0 0 0
0 1 0 1
1 0 0 1
1 1 1 1
Interconnects
A wire segment can be described as two end points of an interconnect with no
programmable switch between them. A sequence of one or more wire segments in an
FPGA can be termed as a track.
Typically an FPGA has logic blocks, interconnects and switch blocks
(Input/Output blocks). Switch blocks lie in the periphery of logic blocks and
interconnect. Wire segments are connected to logic blocks through switch blocks.
Depending on the required design, one logic block is connected to another and so on.
FPGA DESIGN FLOW
In this part of tutorial we are going to have a short intro on FPGA design flow. A
simplified version of design flow is given in the flowing diagram.
FPGA Design Flow
Design Entry
There are different techniques for design entry. Schematic based, Hardware
Description Language and combination of both etc. . Selection of a method depends
on the design and designer. If the designer wants to deal more with Hardware, then
Schematic entry is the better choice. When the design is complex or the designer
thinks the design in an algorithmic way then HDL is the better choice. Language
based entry is faster but lag in performance and density.
HDLs represent a level of abstraction that can isolate the designers from the
details of the hardware implementation. Schematic based entry gives designers much
more visibility into the hardware. It is the better choice for those who are hardware
oriented. Another method but rarely used is state-machines. It is the better choice for
the designers who think the design as a series of states. But the tools for state machine
entry are limited. In this documentation we are going to deal with the HDL based
design entry.
Synthesis
The process which translates VHDL or Verilog code into a device netlist
formate. i.e a complete circuit with logical elements( gates, flip flops, etc…) for the
design.If the design contains more than one sub designs, ex. to implement a
processor, we need a CPU as one design element and RAM as another and so on, then
the synthesis process generates netlist for each design element Synthesis process will
check code syntax and analyze the hierarchy of the design which ensures that the
design is optimized for the design architecture, the designer has selected. The
resulting netlist(s) is saved to an NGC( Native Generic Circuit) file (for Xilinx®
Synthesis Technology (XST)).
FPGA Synthesis
Implementation
This process consists a sequence of three steps
1. Translate
2. Map
3. Place and Route
Translate:
Process combines all the input netlists and constraints to a logic design file.
This information is saved as a NGD (Native Generic Database) file. This can be done
using NGD Build program. Here, defining constraints is nothing but, assigning the
ports in the design to the physical elements (ex. pins, switches, buttons etc) of the
targeted device and specifying time requirements of the design. This information is
stored in a file named UCF (User Constraints File). Tools used to create or modify the
UCF are PACE, Constraint Editor etc.
FPGA Translate
Map
Process divides the whole circuit with logical elements into sub blocks such
that they can be fit into the FPGA logic blocks. That means map process fits the logic
defined by the NGD file into the targeted FPGA elements (Combinational Logic
Blocks (CLB), Input Output Blocks (IOB)) and generates an NCD (Native Circuit
Description) file which physically represents the design mapped to the components of
FPGA. MAP program is used for this purpose.
FPGA map
Place and Route :
PAR program is used for this process. The place and route process places the
sub blocks from the map process into logic blocks according to the constraints and
connects the logic blocks. Ex. if a sub block is placed in a logic block which is very
near to IO pin, then it may save the time but it may effect some other constraint. So
trade off between all the constraints is taken account by the place and route process
The PAR tool takes the mapped NCD file as input and produces a completely
routed NCD file as output. Output NCD file consists the routing information.
FPGA Place and route
Device Programming:
Now the design must be loaded on the FPGA. But the design must be
converted to a format so that the FPGA can accept it. BITGEN program deals with the
conversion. The routed NCD file is then given to the BITGEN program to generate a
bit stream (a .BIT file) which can be used to configure the target FPGA device. This
can be done using a cable. Selection of cable depends on the design.
Design Verification
Verification can be done at different stages of the process steps.
Behavioral Simulation (RTL Simulation):
This is first of all simulation steps; those are encountered throughout the
hierarchy of the design flow. This simulation is performed before synthesis process to
verify RTL (behavioral) code and to confirm that the design is functioning as
intended. Behavioral simulation can be performed on either VHDL or Verilog
designs. In this process, signals and variables are observed, procedures and functions
are traced and breakpoints are set. This is a very fast simulation and so allows the
designer to change the HDL code if the required functionality is not met with in a
short time period. Since the design is not yet synthesized to gate level, timing and
resource usage properties are still unknown.
Functional simulation (Post Translate Simulation):
Functional simulation gives information about the logic operation of the
circuit. Designer can verify the functionality of the design using this process after the
Translate process. If the functionality is not as expected, then the designer has to
made changes in the code and again follow the design flow steps.
Static Timing Analysis :
This can be done after MAP or PAR processes Post MAP timing report lists signal
path delays of the design derived from the design logic. Post Place and Route timing
report incorporates timing delay information to provide a comprehensive timing
summary of the design.
3.4 Synthesis Result
The developed MAC design is simulated and verified their functionality. Once
the functional verification is done, the RTL model is taken to the synthesis process
using the Xilinx ISE tool. In synthesis process, the RTL model will be converted to
the gate level netlist mapped to a specific technology library. This MAC design can
be synthesized on the family of Spartan 3E.
Here in this Spartan 3E family, many different devices were available in the
Xilinx ISE tool. In order to synthesis this design the device named as “XC3S500E”
has been chosen and the package as “FG320” with the device speed such as “-4”. The
design of MAC is synthesized and its results were analyzed as follows.
Device utilization summary:
This device utilization includes the following.
Logic Utilization
Logic Distribution
Total Gate count for the Design
The device utilization summery is shown above in which its gives the details of
number of devices used from the available devices and also represented in %. Hence
as the result of the synthesis process, the device utilization in the used device and
package is shown above.
Timing Summary:
Speed Grade: -4
Minimum period: 11.496ns (Maximum Frequency: 86.987MHz)
Minimum input arrival time before clock: 6.892ns
Maximum output required time after clock: 4.394ns
Maximum combinational path delay: No path found
In timing summery, details regarding time period and frequency is shown are
approximate while synthesize. After place and routing is over, we get the exact timing
summery. Hence the maximum operating frequency of this synthesized design is
given as 86.987 MHz and the minimum period as 11.496 ns. Here, OFFSET IN is the
minimum input arrival time before clock and OFFSET OUT is maximum output
required time after clock.
RTL Schematic
The RTL (Register Transfer Logic) can be viewed as black box after
synthesize of design is made. It shows the inputs and outputs of the system. By
double-clicking on the diagram we can see gates, flip-flops and MUX.
Figure 3.6 Schematic with Basic Inputs and Output
3.4 Summary
The developed MAC design is modelled and is simulated using the Modelsim
tool.
The simulation results are discussed by considering different cases.
The RTL model is synthesized using the Xilinx tool in Spartan 3E and their
synthesis results were discussed with the help of generated reports.
INPUT
OUTPUTS
CHAPTER 5 CHAPTER 5
CONCLUSION
A 8x8 multiplier-accumulator (MAC) is presented in this work.
A Radix 4
Modified Booth multiplier circuit is used for MAC architecture.
Compared to other circuits, the Booth multiplier has the highest
operational speed and less hardware count. The basic building
blocks for the MAC unit are identified and each of the blocks is
analyzed for its performance. Power and delay is calculated for the
blocks. 1-bit MAC unit is designed with enable to reduce the total
power consumption based on block enable technique. Using this
block, the N-bit MAC unit is constructed and the total power
consumption is calculated for the MAC unit. The power reduction
techniques adopted in this work. The MAC unit designed in this work
can be used in filter realizations for High speed DSP applications.
Table 12 summarizes the results obtained.