Interconnect Dominant Design Methodology for DSP ... · Interconnect Dominant Design Methodology...
Transcript of Interconnect Dominant Design Methodology for DSP ... · Interconnect Dominant Design Methodology...
Interconnect Dominant Design Methodology
for DSP Architectures
- A Mixed Number System Based Approach
Subramanian Rama
Vasanth Ramesan
Praveen Sathyanarayanan
A Thesis
Submitted to
Waran Research Foundation(WARF)
In Partial Fulfillment of the
Requirements for the
Research Training Program at WARF
March 2003
Interconnect Dominant Design Methodology
for DSP Architectures
- A Mixed Number System Based Approach
Work done by
Subramanian Rama
Vasanth Ramesan
Praveen Sathyanarayanan
Approved by:
Prof.N.Venkateswaran
Director WARF
Date Approved
Acknowledgement
We are eternally grateful to our guru and mentor Prof. N. Venkateswaran,
whom we affectionately call Waran, for initiating us into research. If it
were not for the countless hours of discussions that we had with him, this
thesis would have remained a dream. He is a constant source of inspiration
for numerous students like us. We will always treasure his philosophy and
friendship.
We are thankful to our institution Sri Sivasubramaniya Nadar College of
Engineering for their support and encouragement. We appreciate the help
rendered by Mr.Mahalingam Venkataraman of the VLSI Test group at WARF
in carrying out some of our simulations.
We would also like to thank Prof. Earl Swartzlander Jr. of the Univer-
sity of Texas at Austin for clarifying our queries and Mr. Patrick Lysaght,
Senior Director, Xilinx Research Labs, for his encouragement of our research
proposals.
We are indebted to our parents for putting up with our odd working
hours. Their moral support has helped us stay focussed.
We thank the Almighty for giving us the strength and confidence to pur-
sue our goals.
Subramanian Rama
Vasanth Ramesan
Praveen Sathyanarayanan
To our parents
and
Guru Prof N. Venkateswaran
Abstract
With deep submicron (DSM), the gates have become smaller and faster,
whereas the amount of interconnect on a chip used to connect these small
and fast gates has grown exponentially. The ratio of interconnect delay to
gate delay continues to increase in favor of interconnect delay as DSM designs
continue to get smaller. The result is a shift in the design paradigm based
on interconnect delay dominance.
Buffer insertion techniques have been successful in reducing interconnect
delay. This consumes power and occupies a large amount of the chip area.
The power consumed by these delay optimal devices and wires will increase
as we go into the DSM era.
This thesis investigates the DSM issues in the design of DSP algorithms
and architectures. The DSM issues have been analyzed in great depth with
respect to interconnect dominance in FFT algorithms and architectures, as
well as in DFT. One of the main findings of the thesis is that the FFT
architectures suffer from high degree of interconnect dominance making them
iv
unsuitable for DSM technology when compared with DFT.
High performance, accuracy and low power are the most important design
parameters of DSP architectures. In DSM based technology, while high per-
formance can be achieved, power becomes a critical factor, which needs either
a new architecture or even a new number representation. The computational
complexity of DSP algorithms leads to high power consumption particularly
in high performance applications. An architecture for Arithmetic Processor
based on a mixed number representation is presented. Here, the sign/log
number system is embedded into the residue number system. It is shown
that this mixed number representation called Logarithmic Residue Number
System (LRNS) achieves low power and high performance over the Binary,
Residue and sign/log number systems. It is further shown that unlike the
sign/log number system, LRNS maintains an accuracy of within 1 percent of
the binary number system. A special purpose power efficient instruction set
for the processor is proposed.
The work presented in this thesis is expected to help in developing high
performance low power DSP systems. As a case study, LRNS is shown to
reduce the computational complexity in time frequency transforms like the
Gabor.
v
Contents
Abstract iv
1 Introduction 1
1.1 Discrete Fourier Transform . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Multidimensional DFT . . . . . . . . . . . . . . . . . . . . . . 11
1.3.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 DSM Technological Issues . . . . . . . . . . . . . . . . . . . . 14
1.4.1 Interconnect Dominance . . . . . . . . . . . . . . . . . 14
1.4.2 Effect of Interconnects on Delay and Power . . . . . . . 16
vi
1.5 Influence of Number System on Power . . . . . . . . . . . . . 19
1.6 Contribution of the Thesis . . . . . . . . . . . . . . . . . . . . 22
1.6.1 FFT Vs DFT Interconnect Analysis . . . . . . . . . . . 22
1.6.2 Proposed Mixed Number Representation . . . . . . . . 23
2 Interconnect Complexity Power and Delay of Arithmetic units 24
2.1 Interconnect Complexity of Basic Functional Blocks . . . . . . 25
2.1.1 Full Adder . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.1.2 Brent-Kung Dot Operator . . . . . . . . . . . . . . . . 36
2.2 Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.2.1 Serial Adder . . . . . . . . . . . . . . . . . . . . . . . . 38
2.2.2 Ripple Carry Adder . . . . . . . . . . . . . . . . . . . . 40
2.2.3 Brent-Kung Carry Lookahead Adder . . . . . . . . . . 42
2.2.4 Carry Save Adder . . . . . . . . . . . . . . . . . . . . . 45
2.3 Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.3.1 Parallel Array Multiplier . . . . . . . . . . . . . . . . . 50
2.3.2 Wallace Tree Multiplier . . . . . . . . . . . . . . . . . 51
3 DFT:Power-Delay Analysis 55
3.1 Interconnect and Hardware Complexity . . . . . . . . . . . . . 55
3.2 Hardware Power-Delay Analysis . . . . . . . . . . . . . . . . . 57
3.3 Interconnect Power-Delay Analysis . . . . . . . . . . . . . . . 59
vii
4 FFT:Power-Delay Analysis 61
4.1 Interconnect and Hardware Complexity . . . . . . . . . . . . . 61
4.1.1 FFT Algorithm . . . . . . . . . . . . . . . . . . . . . . 62
4.1.2 FFT Architecture . . . . . . . . . . . . . . . . . . . . . 64
4.2 Hardware Power-Delay Analysis . . . . . . . . . . . . . . . . . 68
4.3 Interconnect Power-Delay Analysis . . . . . . . . . . . . . . . 69
4.4 DFT and FFT Architectures-A DSM Perspective . . . . . . . 69
5 Number Systems and DSP 73
5.1 Characteristics of Number Systems . . . . . . . . . . . . . . . 73
5.2 Binary Number Systems . . . . . . . . . . . . . . . . . . . . . 75
5.2.1 Algorithms for multiplication and division . . . . . . . 77
5.2.2 Multiplication . . . . . . . . . . . . . . . . . . . . . . . 77
5.2.3 Division . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3 Residue Number Systems . . . . . . . . . . . . . . . . . . . . . 81
5.3.1 RNS representation of numbers . . . . . . . . . . . . . 81
5.3.2 Negative Number Representation . . . . . . . . . . . . 83
5.3.3 Arithmetic Identities . . . . . . . . . . . . . . . . . . . 85
5.3.4 Code Conversions . . . . . . . . . . . . . . . . . . . . . 85
5.3.5 Conversion from RNS to BNS- The Chinese Remainder
Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 86
viii
5.3.6 Arithmetic operations in RNS . . . . . . . . . . . . . . 92
5.4 Logarithmic Number Systems . . . . . . . . . . . . . . . . . . 94
5.4.1 LNS Representation . . . . . . . . . . . . . . . . . . . 95
5.4.2 Generation of logarithms for binary numbers . . . . . . 96
5.4.3 Arithmetic Operations . . . . . . . . . . . . . . . . . . 99
6 Logarithmic Residue Number System 103
6.1 Arithmetic operations . . . . . . . . . . . . . . . . . . . . . . 104
6.1.1 Addition and Subtraction . . . . . . . . . . . . . . . . 104
6.1.2 Multiplication . . . . . . . . . . . . . . . . . . . . . . . 106
6.2 LRNS: Area, Power, Performance . . . . . . . . . . . . . . . . 109
6.2.1 LRNS vs Binary . . . . . . . . . . . . . . . . . . . . . 110
6.2.1.1 Addition/Subtraction . . . . . . . . . . . . . 110
6.2.1.2 Multiplication . . . . . . . . . . . . . . . . . . 111
6.2.2 LRNS vs RNS . . . . . . . . . . . . . . . . . . . . . . . 112
6.2.2.1 Multiplication . . . . . . . . . . . . . . . . . . 112
6.2.3 LRNS vs Sign/Log . . . . . . . . . . . . . . . . . . . . 113
6.2.3.1 Addition/Subtraction . . . . . . . . . . . . . 114
6.2.3.2 Multiplication . . . . . . . . . . . . . . . . . . 115
6.3 Accuracy Analysis . . . . . . . . . . . . . . . . . . . . . . . . 115
6.4 LRNS Architecture for DFT and FFT . . . . . . . . . . . . . 117
ix
7 LRNS in Time-Frequency Transforms 120
7.1 The 1-D Discrete Gabor Transform . . . . . . . . . . . . . . . 121
7.2 LRNS in Gabor Transform . . . . . . . . . . . . . . . . . . . . 125
8 Mixed number system Arithmetic Processor-MAP 128
8.1 MAP Architecture . . . . . . . . . . . . . . . . . . . . . . . . 129
8.1.1 Special Purpose Functional Units . . . . . . . . . . . . 131
8.1.2 General Purpose Functional Units . . . . . . . . . . . 131
8.2 Instruction Set . . . . . . . . . . . . . . . . . . . . . . . . . . 132
8.3 Execution Flow for MAP Instructions . . . . . . . . . . . . . . 133
8.4 Verilog Simulation of MAP Instruction Set . . . . . . . . . . . 136
9 Future Work 140
9.1 Reconfigurable FFT Architecture for different Radices . . . . . 140
9.2 Low Power High Performance LRNS based Convolver Design . 141
9.2.1 LRNS Based Convolver Architecture . . . . . . . . . . 143
10 Conclusion 145
A The Generalized Gabor Transform 147
A.1 1-D Discrete Gabor Transformation . . . . . . . . . . . . . . . 148
B CAM-Content Addressable memories 154
x
List of Figures
1.1 DFT Architecture-Pipelined Inner Product Processor(PIPP) . 5
1.2 Sequential Processor . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Pipeline Processor . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Parallel Processor . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Array Processor . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 Scaling effects on interconnect time delay limits . . . . . . . . 16
1.7 Delay for Local and Global Wiring versus Feature Size . . . . 18
1.8 Power for all repeaters and global interconnect where 50 per-
cent of all devices are logic . . . . . . . . . . . . . . . . . . . . 19
2.1 Low-level Characterization of Adders . . . . . . . . . . . . . . 26
2.2 High-level Characterization of Adders-Area . . . . . . . . . . . 27
2.3 High-level Characterization of Adders-Power-Delay Product . 28
2.4 Characterization of Multipliers . . . . . . . . . . . . . . . . . . 29
2.5 Characterization of Multipliers-High level . . . . . . . . . . . . 30
2.6 Interconnects at Gate Level . . . . . . . . . . . . . . . . . . . 33
xi
2.7 A full adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.8 Gate implementation of a full adder . . . . . . . . . . . . . . . 34
2.9 Gate implementation of a dot operator . . . . . . . . . . . . . 37
2.10 A Serial Adder . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.11 A ripple carry adder . . . . . . . . . . . . . . . . . . . . . . . 40
2.12 Carry Block of a Brent Kung CLA(n=8) . . . . . . . . . . . . 42
2.13 Carry Save Adder . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.14 Carry Save Adder Tree for eight operands . . . . . . . . . . . 47
2.15 Parallel Array Multiplier . . . . . . . . . . . . . . . . . . . . . 52
2.16 Multiplier Based on Wallace Tree . . . . . . . . . . . . . . . . 54
4.1 Algorithm level interconnect complexity of FFT . . . . . . . . 63
4.2 Radix 4 FFT architecture for 256 sample points . . . . . . . . 64
4.3 Radix 4 Computational Element . . . . . . . . . . . . . . . . . 65
4.4 Delay Commutator Circuit . . . . . . . . . . . . . . . . . . . . 65
4.5 Communication Complexity of Clock Distribution in the DC
Flip-Flops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.1 Multiplier Hardware . . . . . . . . . . . . . . . . . . . . . . . 78
5.2 Ancient verse of the Chinese Remainder Theorem . . . . . . . 87
5.3 Generalized Block Arithmetic for Addition Subtraction and
Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
xii
5.4 Straight Line Approximation to Logarithmic Curve . . . . . . 97
5.5 Machine Organization to generate and use binary Logs . . . . 98
5.6 LNS Multiplier Divider Hardware . . . . . . . . . . . . . . . . 100
5.7 Hardware for Logarithmic Addition and Subtraction . . . . . . 102
6.1 Execution flow for addition/subtraction operations in LRNS . 105
6.2 Execution Flow for Multiplication in LRNS . . . . . . . . . . . 107
6.3 Flow chart depicting Architecture for Multiplication . . . . . . 108
6.4 PE of a 1-D DFT array . . . . . . . . . . . . . . . . . . . . . . 118
6.5 PE of a radix-4 FFT architecture . . . . . . . . . . . . . . . . 119
7.1 Computational flow for finding the Gabor Coefficients using
Binary and LRNS . . . . . . . . . . . . . . . . . . . . . . . . . 127
8.1 Mixed number system Arithmetic Processor(MAP) . . . . . . 130
8.2 Special purpose MAP Instruction Format . . . . . . . . . . . . 132
8.3 Execution Flow for RAD/RSU Instruction . . . . . . . . . . . 134
8.4 Execution Flow for SLM/SLD Instruction . . . . . . . . . . . 134
8.5 Execution Flow for LRM Instruction . . . . . . . . . . . . . . 135
8.6 Timing diagram for RAD Instruction . . . . . . . . . . . . . . 137
8.7 Timing diagram for RSU Instruction . . . . . . . . . . . . . . 138
8.8 Timing diagram for LRM Instruction . . . . . . . . . . . . . . 138
xiii
8.9 Timing diagram for SLD Instruction . . . . . . . . . . . . . . 139
9.1 Reconfigurable FFT Architecture . . . . . . . . . . . . . . . . 142
9.2 LRNS Based 1-D Convolver Architecture . . . . . . . . . . . . 144
B.1 Typical CMOS SRAM Memory Cell . . . . . . . . . . . . . . . 155
B.2 Typical CAM memory cell . . . . . . . . . . . . . . . . . . . . 158
xiv
List of Tables
1.1 Interconnect and Transistor Scaling Properties . . . . . . . . . 15
1.2 Average Logic Transitions in Multiplication and Division using
LNS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1 Hardware and Interconnect Count for 1024 point DFT . . . . 56
3.2 Hardware and Interconnect Count for 4096 point DFT . . . . 56
3.3 Power-Delay Product for 1024 point DFT . . . . . . . . . . . 58
3.4 Power-Delay Product for 4096 point DFT . . . . . . . . . . . 58
4.1 Hardware and Interconnect Complexity for Radix 4 FFT . . . 67
4.2 Hardware and Interconnect Complexity for Radix 8 FFT . . . 67
4.3 Power-Delay Product for Radix- 4 FFT . . . . . . . . . . . . . 68
4.4 Power-Delay Product for Radix- 8 FFT . . . . . . . . . . . . . 69
5.1 RNS Representation Example . . . . . . . . . . . . . . . . . . 82
6.1 Performance variation for addition operation . . . . . . . . . . 110
6.2 Performance variation for multiplication operation . . . . . . . 112
xv
6.3 LRNS vs RNS: Performance variation for multiplication oper-
ation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.4 LRNS vs LNS :Performance variation for add/sub operation . 114
6.5 Performance variation for multiplication operation . . . . . . . 115
6.6 Accuracy Analysis for 8 Sample point FFT-1 . . . . . . . . . . 116
6.7 Accuracy Analysis for 8 Sample point FFT-2 . . . . . . . . . . 116
xvi
Chapter 1
Introduction
Signal processing incorporates the acquisition, preparation and analysis
of signals. Advancements in digital computing have shifted the focus from
analog to digital signal processing (DSP) techniques. DSP refers to the
operation on discrete time, discrete amplitude signals.
One of the most fundamental operations in signal processing is transfor-
mation of signals from one domain to another. The Fourier transform (FT)
is a mathematical tool that is used in the analysis and design of linear time
invariant systems. The FT is based on the discovery that it is possible to
take any periodic function of time x(t) and resolve it into an equivalent infi-
nite summation of sine and cosine waves with frequencies that start at 0 and
increase in multiples of a base frequency fo=1/T, where T is the period of
x(t).
1
1.1 Discrete Fourier Transform
The Fourier Transform X(ω) of a discrete signal x(n), is a continuous
function of frequency and therefore is not computationally convenient. Rep-
resentation of a sequence x(n) by samples of its spectrum X(ω) leads to the
discrete Fourier transform (DFT). DFT forms the basis of many application
fields such as spectral analysis, digital filtering, image processing and video
transmission. At present, 1-D and 2-D FT are of prime importance in speech
processing, spectrum analysis, tomography and image processing. The 3-D
FT is needed in nuclear magnetic resonance imaging algorithms. Hence, it
is often desirable in modern signal processing applications to perform two-
dimensional or higher dimensional Fourier Transforms. The increasing de-
mand on the speed of performing such transforms necessitates efficient very
large scale integration (VLSI) implementations.
1.1.1 Algorithm
A finite-duration sequence x(n) of length L has a FT [33]
L−1∑n=0
x(n)e−jωn 0 ≤ ω ≤ 2π (1.1)
where the upper and lower indices in the summation reflect the fact that
x(n)=0 outside the range 0 ≤ n ≤ L− 1. When we sample X(ω) at equally
spaced frequencies ωk = 2πk/N k = 0, 1, 2, ...., N − 1, where N ≥ L, the
2
resultant samples are
X(k) ≡ X(
2πkN
)=
L−1∑n=0
x(n)e−j2πkn/N (1.2)
X(k) =N−1∑n=0
x(n)e−j2πkn/N (k = 0, 1, 2, ...., N − 1) (1.3)
where for convenience, the upper index in the sum has been increased from
L-1 to N-1 since x(n)=0 for n ≥ L.
The relation in (1.2) is a formula for transforming a sequence x(n) of
length L ≤ N into a sequence of frequency samples X(k) of length N. Since
the frequency samples are obtained by evaluating the Fourier transformX(ω)
at a set of N (equally spaced) discrete frequencies, the relation in (1.2) is
called the DFT of x(n).
For a complex-valued sequence x(n) of N points, the DFT is expressed as
XR(k) =N−1∑n=0
[xR(n) cos 2πkn
N+ xI(n) sin 2πkn
N
](1.4)
XI(k) = −N−1∑n=0
[xR(n) sin 2πkn
N− xI(n) cos 2πkn
N
](1.5)
The direct computation of (1.4) and (1.5) requires:
1. 2N2 evaluations of trigonometric functions.
2. 4N2 real multiplications.
3. 4N(N-1)real additions.
3
1.1.2 Architecture
The high computational complexity of DFT has led to the evolution of
efficient algorithms for VLSI implementation. Though many architectures
have been proposed for DFT, the most important one is the matrix-column
vector multiplication architecture.
The conventional 1-D DFT architecture executes matrix-column vector
multiplication. Refer Figure 1.1. The basic processing unit of this non-
systolic architecture is the inner product processor. The column vector (sam-
ple points) is preloaded in the inner product processor and the rows of the
twiddle factor matrix are pipelined. For N sample points 1-D DFT (N = γj),
the inner product processor has γ input inner product terms. Parallel array
multiplier using carry save adder (CSA) tree is employed in the inner prod-
uct processor to achieve a pipelining delay of tadd equal to a full adder stage
of the CSA tree. This pipeline rate is achievable provided, the delay of the
final stage carry propagate adder (CPA) is less than tadd. In order to achieve
high-speed addition Brent-Kung carry lookahead adder (CLA) is employed.
The number of pipelining cycles for the PIPP is j × γj.
For example, to compute a 1024-point 1-D DFT, we can partition the
twiddle factor matrix into 64 blocks of 16 × 16 sub-matrices. The column
vector is also partitioned into matrices of size 16 × 1. The multiplication
4
Figure 1.1: DFT Architecture-Pipelined Inner Product Processor(PIPP)
5
of 16 × 16 sub-matrix and 16 × 1 column matrix is performed in the inner
product processor, in which the matrix addition is carried out in the Brent-
Kung Accumulator.
Another way to reduce the hardware cost of implementing DFT is to
use coordinate rotation digital computer(CORDIC) technique [22] [24]. The
idea behind CORDIC computation is that the desired rotation angle is de-
composed into the weighted sum of a set of predefined elementary rotation
angles, so that rotation through each of them can be accomplished using
simple shift-and-add operations. The drawback of CORDIC-based design is
the slow computing speed.
Read Only Memory (ROM) based designs [46] [18] [19] are also efficient
choices for implementing 1-D DFT in certain applications. Among the ROM-
based designs, the distributed arithmetic-based designs [46] [18] and the
memory-based design [19] are two different approaches to realize multipli-
cations using ROMs. Recently, adder-based designs [6] have become popular
for DFT.
1.2 Fast Fourier Transform
Studies on the properties of DFT resulted in an algebraic structure that
could speed up the computation of DFT by orders of magnitude. Such al-
6
gorithms which makes the computation of DFT faster and easier are known
as fast Fourier transforms (FFT) algorithms [21]. The FFT algorithms sim-
plify the computation of (1) by rearranging the input and output data and
repeated partitioning of them into a smaller set of sequences.
Among the two basic classes of FFT algorithms, viz., Decimation in Time
(DIT) and Decimation in Frequency (DIF) algorithms, the DIF algorithm is
found to have better signal-to-noise characteristics compared to the DIT [38].
1.2.1 Algorithm
The FFT is based on the divide-and-conquer approach. To compute the
N-point DFT, where N can be factored as a product of two integers [33], that
is ,
N = LM (1.6)
The input sequence x(n) can be stored in a 2-D array indexed by l for row
and m for column, where 0 ≤ l ≤ L− 1 and 0 ≤ m ≤M − 1. The sequence
x(n) can be stored in a rectangular array in a variety of ways, each of which
depends on the mapping of index n to the indices (l,m). The mapping
n = l +mL (1.7)
stores the first L elements of x(n) in the first column, the next L elements in
the second column and so on. A similar arrangement can be used to store
7
the computed DFT values. The mapping
k = Mp+ q (1.8)
where 0 ≤ p ≤ L − 1 and 0 ≤ q ≤ M − 1, stores the DFT on a row wise
basis, where the first row contains the first M elements of the DFT X(k), the
second row contains the next set of M elements and so on.
Consider that x(n) is mapped into a rectangular array x(l,m) and X(k)
is mapped into a corresponding rectangular array X(p, q). Then the DFT
can be expressed as a double sum over the elements of the rectangular array
multiplied by the corresponding phase factors. If we consider column wise
mapping for x(n) given by (1.7) and a row wise mapping for the DFT given
by (1.8) then
X(p, q) =M−1∑m=0
L−1∑l=0
x(l,m)W(Mp+q)(mL+l)N (1.9)
But
W(Mp+q)(mL+l)N = WMLmp
N WmLqN WMpl
N W lqN (1.10)
However, WNmpN = 1,WmqL
N = WmqN/L = Wmq
M , and WMpLN = W pl
N/M = W plL .
With these simplifications, (1.9) can be expressed as
X(p, q) =L−1∑l=0
W lq
N
[M−1∑m=0
x(l,m)WmqM
]W lp
L (1.11)
The expression in (1.9) involves the computation of DFTs of length M
and length L. The computational complexity is:
8
Complex multiplications: N(M + L+ 1)
Complex additions: N(M + L− 2)
where N = ML. Thus the number of multiplications has been reduced
form N2 to N(M + L + 1) and the number of additions from N(N − 1) to
N(M + L− 2).
1.2.2 Architecture
Dedicated FFT processors can be classified into four categories as shown
in Figures 1.2, 1.3, 1.4 and 1.5. They differ mainly in the degree of paral-
lelism in the computation. The sequential processor, the slowest of all, has a
single arithmetic unit (AU) and performs elementary computations sequen-
tially. The pipeline processor has more parallelism. It has logrN arithmetic
units, where r is the radix of processing and N is the number of input sample
points. Hence, at any instant of time logrN elementary computations can
be done simultaneously. It requires N sequential operations to complete the
N -point FFT computation. The parallel processor consists of N AUs and
hence has to perform only logrN operations in sequence to compute the FFT.
The fastest of all is the array processor which has as many number of AUs as
there are computations, viz., NlogrN and all the computations are carried
out in parallel.
9
Figure 1.2: Sequential Processor
Figure 1.3: Pipeline Processor
Figure 1.4: Parallel Processor
Figure 1.5: Array Processor
10
The pipeline organization is the best suited for high speed real-time FFT
processing. The pipeline processor has the distinct advantage that blocks
of data can be fed in succession into the processor, since the introduction
of delay elements in each stage takes care of the necessary independence
of computation carried out in the different stages. It is capable of high
throughput rates and the speed limitation is set by the slowest computing
element in the pipeline. The pipelining rate is independent of N . Only the
initial delay required in filling up the pipe is dependent on N or number of
stages in the pipe.
1.3 Multidimensional DFT
The computation of the multidimensional DFT is of great importance in
many DSP areas such as image processing, speech processing and spectrum
analysis. Although the multidimensional DFT is a very powerful tool for
analyzing multidimensional signals, it requires huge amount of computations
and there is a need for efficient VLSI implementation of the algorithm.
11
1.3.1 Algorithm
In its most general form, the DFT of an M-dimensional sequence, x(n1, n2..., nM),
is expressed as
X(k1, k2,...., kM) =∑
n1,n2,...,nm
x(n1, n2, ..., nm)W n1k1N1
W n2k2N2
...W nMkMNM
(1.12)
where indices ni,ki run over the domain [0,Ni-1], 1 ≤ i ≤M and
W nikiNi
= e−j2πniki/Ni
Several fast algorithms are known for efficiently computing this weighted
sum. Among them, the row and column decomposition algorithm is highly
modular, that is, one can compute the DFT of an M-dimensional signal by
carrying it out M times over the indices nM , nM − 1, ..., n2, n1 in an iterative
manner. For example, if M = 2, then
X(k1k2) =∑
0≤n1≤N1−1
∑0≤n2≤N2−1
x(n1, n2)Wn2k2N2
W n1k1N1
(1.13)
or if we let
G(n1, k2) =∑
0≤n2≤N2−1
x(n1, n2)Wn2k2N2
(1.14)
then
X(k1k2) =∑
0≤n1≤N1−1
G(n1, k2)Wn1k1N1
(1.15)
12
Thus we can compute X(k1, k2) in two steps by first computing (1.14)
and then computing (1.15).
1.3.2 Architecture
One of the most popular methods for implementing 2-D DFT is the row
column decomposition method, which requires a transposition memory be-
tween two 1-D transforms. The design in [16] is made up of two hybrid ar-
chitectures, each using a butterfly array of processors to perform a 1-D DFT
and an interconnection network to route the intermediate results back to the
array between different phases of the algorithm. One of these architectures
uses a perfect shuffle network [37] while the other uses a rotation network
to align the outputs of the processor array between different phases of the
computation. These networks are difficult to layout in VLSI [13] because of
the interconnection problems.
Another approach for implementing multidimensional DFTs is the sys-
tolic architecture [25] [36] [35]. Although these systolic implementations have
the advantages of simple design due to the regularity of the PEs and the local
interconnections between them, they require a large number of PEs.
Recently, an architecture for multidimensional DFT which makes use of
the conventional pipelined architecture for a 1-D FFT by using a new two
level index mapping scheme has been proposed [50]. This architecture has
13
lesser number of PEs in comparison to systolic architectures. The number of
multipliers needed are also lesser. It has a flexible area/throughput tradeoff
and regular structure with nearest neighbor interconnections only.
1.4 DSM Technological Issues
Deep submicron (DSM) effects have been proposed as potential impedi-
ment to the continuing advancements in integrated circuit performance. Ex-
amples of DSM effects include the rising RC delay of on-chip wiring, noise
issues such as crosstalk and delay deterioration, and increasing power dissi-
pation. These issues have been addressed in a number of recent works with
the general conclusion that interconnect effects will dominate performance
in DSM designs.
The International Technology Roadmap for Semiconductors (ITRS) projects
that by 2011 over one billion transistors will be integrated into a single mono-
lithic die [2]. The wiring system of this billion-transistor die will distribute
clock and other signals and provide power/ground, to and among, the various
circuits/systems functions on a chip.
1.4.1 Interconnect Dominance
Miniaturization of transistors enhances their performance but the same
cannot be said about interconnect miniaturization. Scaling interconnects
14
Technology MOSFET Intrinsic Delay Intrinsic Delay W/F
Switching of Minimum of Reverse
Delay Scaled 1 mm Scaled 1 mm
td = CV/I Interconnect Interconnect
1µm (Al, SiO2) ∼ 20 ps ∼ 5 ps ∼ 5 ps 1
0.1µm (Al, SiO2) ∼ 5 ps ∼ 30 ps ∼ 5 ps ∼ 1.5
35 nm (Cu, Low K) ∼ 2.5 ps ∼ 250 ps ∼ 5 ps ∼ 4.5
Table 1.1: Interconnect and Transistor Scaling Properties
into the nanometer regime is plagued with many challenges, such as resistiv-
ity degradation, material integration issues, high-aspect ratio via and wire
coverage, planarity control, and reliability problems due to electrical, ther-
mal, and mechanical stresses in a multilevel wire stack [2], and even when
these challenges are overcome, minimum interconnect scaling will still de-
grade interconnect delay. For example, Table 1.1 shows that the intrinsic
interconnect delay of a 1-mm length interconnect at the 35-nm technology
node overwhelms the transistor delay by two orders of magnitude [39].
Scaling effects on interconnect latency have been investigated [28], [3] and
are illustrated in the reciprocal length squared versus time delay plane seen
in Figure 1.6 after [3].
15
Figure 1.6: Scaling effects on interconnect time delay limits
1.4.2 Effect of Interconnects on Delay and Power
When designers target a process at 0.5 micron and above, the majority of
chip delay resides in gates. But with DSM, the gates have become smaller
and faster, whereas the amount of interconnect on a chip used to connect
these small and fast gates has grown exponentially.
The result is a shift in the design paradigm based on interconnect delay
dominance. For example, in a 0.5-micron design, the ratio of gate delay
to interconnect delay is 4 to 1, or 80 percent to 20 percent. By the time
designs have reached 0.25 micron, the ratio has flip-flopped, with gate delay
accounting for only 20 percent of the total delay. The ratio continues to
16
increase in favor of interconnect delay as DSM designs continue to get smaller.
A distributed RC network can be used to model a single global on-chip
interconnect. The latency of this interconnect is given by the distributed RC
time delay (assuming rcL2 ≥ L/v) as
τ = rcL2 (1.16)
where
r distributed resistance per unit length
c distributed ground capacitance per unit length
L interconnect length
v speed of electromagnetic wave propagation
Figure 1.7 shows the delay of local and global wiring in future generations
[2].
Interconnect-driven timing optimization techniques, such as wire sizing,
buffer insertion and gate sizing have gained widespread acceptance in DSM
design. In particular, buffer insertion techniques have been successful in
reducing interconnect delay. To the first order, interconnect delay is propor-
tional to the square of the length of the wire. Inserting buffers effectively
divides the wire into smaller segments, which makes the interconnect delay
almost linear in terms length (plus the buffer delays).
Buffer insertion too consumes power and occupies a large amount of the
17
Figure 1.7: Delay for Local and Global Wiring versus Feature Size
chip area. As shown in Figure 1.8 The power consumed by these delay
optimal devices and wires will increase as we go into the DSM era [39].
With the increase in signal frequencies and the corresponding decrease
in signal transition times, the interconnect impedance can behave induc-
tively [48], increasing the on-chip noise. Furthermore, considering inductance
within the design process increases the computational complexity of IC syn-
thesis and analysis tools. However, inductive behavior can also be useful. As
shown in [47], a properly designed inductive line can reduce the total power
dissipated by high-speed clock distribution networks. Clock networks can
dissipate a large portion of the total power dissipated within a synchronous
IC, ranging from 25 percent to 75 percent [11] [10].
18
Figure 1.8: Power for all repeaters and global interconnect where 50 percent
of all devices are logic
1.5 Influence of Number System on Power
The need for low power design is widely known and needs no elucidation.
In the DSM regime it is not just the device power that needs to be accounted
for but also the interconnect power. In designing VLSI signal processing ar-
chitectures especially for battery powered devices, the designer needs to take
the issue of low power very seriously. Power reduction can be looked at from
different angles. The most obvious ones are by decreasing the supply voltage,
by reducing the parasitic capacitance and by reducing the switching activity.
The choice of algorithm is the most highly leveraged decision in meeting the
power constraints. The ability for an algorithm to be parallelized is critical
19
and the basic complexity of the computation must be highly optimized. Shut-
ting down unused arithmetic blocks is also attractive in some applications.
But the most logical way of reducing power is by minimizing the number of
operations. This is normally done by neglecting multiplications or divisions
by 1s or js (in case of complex numbers), additions or subtractions with 0s
etc. If we can have a number representation that reduces the number of
operations our job will be much easier.
Many number systems are used in application specific integrated cir-
cuit (ASIC) design but the most important among them are binary number
system (BNS), residue number system (RNS) and sign/log number system
(LNS). The use of the RNS allows the decomposition of a given dynamic
range in slices of smaller range on which the computation can be efficiently
implemented in parallel. The power dissipation is reduced by taking advan-
tage of the speed-up due to the parallelism of the RNS structure. Arithmetic
operations like addition, subtraction and multiplication can be done much
faster and with fewer operations compared to binary. The typical drawback
presented by the RNS is related to the input-output conversion from binary
to RNS and vice versa. Many new efficient conversion techniques are being
proposed to tackle this problem. Logarithmic Number system(LNS), on the
other hand requires relatively less conversion overhead. With multiplications
becoming additions and divisions becoming subtractions, LNS implementa-
20
Bits Adder Multiplier Divider Multiplier Divider
Wallace / SRT Times Down Times Down
Dadda Radix 4
16 90 573 3757 6.37 41.74
32 182 3874 7293 21.29 40.07
64 366 9548 14365 53.41 32.24
Table 1.2: Average Logic Transitions in Multiplication and Division using
LNS
tion becomes efficient after a moderate number of multiply-add operations.
Table 1.2 gives a good measure of the reduced transitions using LNS for
multiplication and division.
RNS and LNS reduce the number of transitions only for certain opera-
tions. For example, RNS is ill-suited for division among others and LNS is
not suitable for addition and subtraction. These disadvantages mean that
RNS is used only in applications having lesser number of divisions and LNS
in applications having lesser number of additions and subtractions. A sin-
gle number system which reduces the number of transitions for all the basic
arithmetic operations needs to be evolved. Such a number system can be of
great use in a variety of applications irrespective of the operations involved.
The concept of mixed number system put forth in this thesis is an effort in
this direction.
21
1.6 Contribution of the Thesis
The impact of DSM technology on FFT and DFT architectures are dealt
with. In this connection a detailed analysis of interconnect complexity for
various arithmetic blocks commonly used in DSP kernels is carried out.
The DSM issues can also be tackled by resorting to different number rep-
resentation which can reduce the interconnect complexity, power and achieve
increased speed. In this direction a new mixed number representation is put
forth and is extensively analyzed by applying to DFT, FFT and Gabor trans-
form.
1.6.1 FFT Vs DFT Interconnect Analysis
The importance of interconnects cannot be ignored under DSM. There is a
need to evolve architectures in which the interconnect is less dominant. The
analysis presented in the thesis explicitly show the interconnect dominance
in FFT. It is noted that the power-delay product(hardware and intercon-
nect)increases with increase in radix. Our analysis shows that DFT rather
than FFT will be more suitable for multi GHz DSM architecture implemen-
tation.
22
1.6.2 Proposed Mixed Number Representation
A novel number representation called Logarithmic Residue Number Sys-
tem(LRNS) has been designed. This achieves high speed and lower power
consumption over binary and RNS and improved accuracy over LNS for
arithmetic operations. The LRNS scheme can be embedded in a number of
DSP architectures performing both frequency transforms and time-frequency
transforms like Gabor transform and also in DSP filters. Employing LRNS
scheme in Gabor transform drastically reduces the computational complexity.
Further, based on this mixed number system an arithmetic processor
meant for low power, high performance DSP applications is designed. A
Verilog simulation of the instruction set of this arithmetic processor is in-
cluded.
23
Chapter 2
Interconnect Complexity Power
and Delay of Arithmetic units
With rapid developments in VLSI technology and Computer Aided Design
(CAD) techniques, the ever increasing quest for high performance is placing
demands on interconnect performance and highlights the previously negligi-
ble effects of interconnects [2]. Hence, it is necessary that the interconnect
complexity of any circuit be analyzed, as now under DSM technology, the
interconnects contribute to most of the chip power.
In this chapter, the interconnect complexity of basic AUs like adders and
multipliers are modelled at a broader perspective. The hardware complex-
ity, delay and power analysis of the above said arithmetic units were also
performed. Graphs depicting the low level (transistor power) and high level
characterization (active area, interconnect area, total area and power-delay
24
product) of the arithmetic units viz. adders and multipliers are also presented
here.
The interconnect and hardware complexity can be modelled at different
levels namely algorithm level, architecture level, functional level, gate level,
circuit level and layout level. Interconnect complexity modelled at the gate
level is denoted as the `2 level interconnect complexity. The functional level
or the `1 level can be further classified as at full adder level or register level.
This means considering the adder or register count as far as the hardware
complexity is concerned and using the interconnect count between them as
a measure of interconnect complexity. Interconnects across successive levels
are taken to be of unit length for the analysis. Unit delays are assumed for
such interconnects.
2.1 Interconnect Complexity of Basic Func-
tional Blocks
In this section the interconnect complexity of some of the basic functional
blocks viz. full adder and dot operator (Brent-Kung) [34] are dealt in de-
tail. Interconnect complexity is calculated based on the interconnect count
λ(mi,f , ni,f ) and the maximum interconnect length (MIL) as shown in the
Figure 2.6. λ(m,n) is taken as m horizontal interconnects and n vertical in-
25
Figure 2.1: Low-level Characterization of Adders
26
Figure 2.2: High-level Characterization of Adders-Area
27
Figure 2.3: High-level Characterization of Adders-Power-Delay Product
28
Figure 2.4: Characterization of Multipliers
29
Figure 2.5: Characterization of Multipliers-High level
30
terconnects or a total of m+n interconnects each of unit length. The general
equation for the number of interconnects is given by
No.ofinterconnects(IC) =No.ofGates∑
i=1
No.ofFanouts∑f=1
λ(Σmi,f ,Σni,f ) (2.1)
where, m and n are measures of horizontal and vertical interconnects respec-
tively, expressed as multiples of unit length.
The interconnect count is found to be the total sum of mi,fs and ni,fs.
In most cases it can be directly interpreted from the gate level logic diagram.
Further,
f > 1 Multiple outputs per gate
f = 1 Single output per gate
Eg: For the figure 2.6 the interconnects count is given as
IC = λ(2, 0) + λ(1, 1) + λ(1, 1) (2.2)
The Interconnect count for the Figure 2.6 is calculated as 6.
The total interconnect delay is the sum of the delays caused by intercon-
nects of maximum length in each stage.
TD,I or MIL =TotalNo.ofStages∑
s=1
(MILs) (2.3)
where, MILs is the maximum interconnect length due to horizontal and
vertical interconnects causing the maximum delay in the stage s, expressed
31
as multiples of unit length. These measures are calculated as a sum of the
horizontal and vertical interconnects in each stage that lead to a maximum
delay.
The interconnect delay product (IDP) is the product of the total delay
due to the interconnects (TID or MIL) and the interconnect count (IC). This
reflects the Area-Time (AT) product due to the interconnects.
The power due to the interconnects is given as
Interconnect power(IP ) =No.ofGates∑
i=1
τi
No.ofFanouts∑f=1
λ(Σmi,f ,Σni,f ) (2.4)
where, τi is the average transition count due the ith gate and f is the fan-out
of the ith gate.
The interconnect power-delay product (IPD) is given as
IPD = (No.ofGates∑
i=1
τi
No.ofFanouts∑f=1
(λ(Σmi,f ,Σni,f )))× TD,I (2.5)
2.1.1 Full Adder
The Full adder (Figure 2.7) is a circuit that calculates the sum and carry
of three bits. Figure 2.8 shows the NAND and NOT implementation of a full
adder. From the figure the interconnect delay, interconnect complexity and
the power due to the interconnects are calculated as follows.
Interconnect Count(ICFA) = 116 (2.6)
32
Figure 2.6: Interconnects at Gate Level
Figure 2.7: A full adder
33
Figure 2.8: Gate implementation of a full adder
34
The total delay due through the interconnects in a full adder is
TID,FA or MILFA = λ(1, 1)+λ(6, 6)+λ(2, 1) = 17 unit delays (2.7)
The interconnect delay product (IDP) is the product of MILFA and
ICFA.
The interconnect power calculation is as follows
TotalInterconnectPower(IPFA) = τ1(λ(1, 2)+λ(1, 3)+λ(1, 1))+τ2(λ(3, 1)+
λ(3, 3) + λ(3, 4) + λ(3, 6) + λ(1, 1)) + τ3(λ(5, 1) + λ(5, 2) + λ(5, 5) + λ(5, 6) +
λ(1, 1)) + τ4(λ(2, 3) + λ(2, 4) + λ(2, 5) + λ(2, 6)) + τ5(λ(4, 4) + λ(4, 6)) +
τ6(λ(6, 3)+λ(6, 6))+τ7λ(2, 1)+τ8λ(1, 0)+τ9λ(1, 0)+τ10λ(2, 1)+τ11λ(2, 1)+
τ12λ(1, 0) + τ13λ(2, 1)
The above equation for the interconnect power can be further reduced to
Total Interconnect Power(IPFA) = τ1(λ(1, 2)+λ(1, 3)+λ(1, 1))+τ2(λ(3, 1)+
λ(3, 3) + λ(3, 4) + λ(3, 6) + λ(1, 1)) + τ3(λ(5, 1) + λ(5, 2) + λ(5, 5) + λ(5, 6) +
λ(1, 1))+τNOT (λ(2, 3)+λ(2, 4)+λ(2, 5)+λ(2, 6)+λ(4, 4)+λ(4, 6)+λ(6, 3)+
λ(6, 6)) + τNAND4(λ(2, 1) + λ(1, 0) + λ(1, 0) + λ(2, 1)) + τNAND3(λ(2, 1) +
λ(1, 0) + λ(2, 1))
The interconnect power delay product(IPD) is calculated as
IPDFA = 4τ1(λ(1, 2)+λ(1, 3)+λ(1, 1))+9τ2(λ(3, 1)+λ(3, 3)+λ(3, 4)+
λ(3, 6)+λ(1, 1))+11τ3(λ(5, 1)+λ(5, 2)+λ(5, 5)+λ(5, 6)+λ(1, 1))+τNOT (8(λ(2, 3)+
λ(2, 4) + λ(2, 5) + λ(2, 6)) + 10(λ(4, 4) + λ(4, 6)) + 12(λ(6, 3) + λ(6, 6))) +
35
τNAND4(3λ(2, 1) + λ(1, 0) + λ(1, 0) + 3λ(2, 1)) + τNAND3(3λ(2, 1) + λ(1, 0) +
3λ(2, 1))
2.1.2 Brent-Kung Dot Operator
Figure 2.9 shows the gate implementation of a Brent-Kung dot operator.
From the figure the interconnect delay, interconnect complexity, interconnect
delay product, power due to the interconnects and the interconnect power-
delay product are calculated as follows.
ICdot = 20 (2.8)
The total delay due through the interconnects in a dot operator is
TID,dot or MILdot = λ(5, 2) = 7 unit delays (2.9)
The IDP for the dot operator is the product of ICdot and MILdot.
The interconnect power calculation is as follows
Total Interconnect
Power(IPdot) = τ1λ(5, 2)+τ2(λ(3, 1)+λ(3, 0))+τ3λ(2, 1)+τ4λ(1, 0)+τ5λ(2, 1)
The IPD for a dot product is found to be
IPDdot = 7τ1λ(5, 2) + 4τ2(λ(3, 1) + λ(3, 0)) + 3τ3λ(2, 1) + τ4λ(1, 0) +
3τ5λ(2, 1)
36
Figure 2.9: Gate implementation of a dot operator
2.2 Adders
Adders are the most common arithmetic units used in general purpose as
well as in DSP systems. Adders are also used to perform subtraction and
they are the basic components in multiplier and divider units. Adders are
chosen depending upon their speed, area, configuration, interconnect com-
plexity and power. The power-delay product associated with the hardware
and interconnects of select adders are discussed here. The IDP of some of
the adders are discussed in this section. In analyzing the interconnect and
hardware complexity of the adders, the basic element considered is the full
adder and the complexities are represented as a function of this full adder
complexity.
37
Figure 2.10: A Serial Adder
2.2.1 Serial Adder
The serial adder calculates the sum and carry at each bit position. Figure
2.10 [7] shows a Serial adder. The basic element in a serial adder is the full
adder shown in Figure 2.7 and Figure 2.8. The interconnect and hardware
complexity of the serial adder includes the full adder and the D-flip flop. A
serial subtractor can be constructed with minor modification of the serial
adder. The interconnect and hardware complexity of the serial subtractor is
about the same as that of a serial adder. The delay for an n-bit serial adder
38
is
TD,ADD = (n− 1)(TD,FA,ic + TD,FF ) + TD,FA,is (2.10)
where,
TD,FA,is is the delay between the input and sum output
TD,FA,ic is the delay between the input and carry output
TD,FF is the delay due to the D-flip flop
Assuming thatMILFA is the maximum interconnect length complexity of
a Full Adder at the gate level, TID,FA is the delay due to the interconnects
within the Full Adder at the gate level, MILFF is the maximum flip-flop
interconnect length complexity and TID,FF is the flip-flop interconnect delay,
we have the Interconnect delay product (IDP) of the serial adder at the gate
level (`2 level) as
IDP`2 = ICFA × nTID,FA + ICFF × (n− 1)TID,FF (2.11)
Now the IDP for the serial adder at the full adder level(`1) is given below
IDP`1 = 2n− 1 (2.12)
The Hardware power-delay product (HPD) for the serial adder for n- bit
addition is given by
HPD = (n× PFA + (n− 1)× PFF )× TD,ADD (2.13)
39
Figure 2.11: A ripple carry adder
where, PFA is the power within the full adder
PFF is the power within the flip-flop
The Interconnect power delay product (IPD) of the serial adder for n- bit
addition is calculated at a broader perspective namely at the full adder and
flip-flop output level(`1 level) and is given by the following equation
IPDλ1 = PO,FA × n+ PO,FF × (n− 1) (2.14)
where, PO,FA is the FA output line power
PO,FF is the FF output line power
The IPD is calculated only at `1 level throughout this chapter.
2.2.2 Ripple Carry Adder
A n-bit ripple carry adder is the simplest parallel adder constructed by
cascading n full adders as shown in Figure 2.11. In a ripple carry adder the
carry output of a full adder is connected to the input line of the next full
40
adder. The hardware complexity of the ripple carry adder is proportional to
n. The worst case delay is proportional to n.
The delay of a n-bit ripple carry adderis
TD,ADD = nTD,ic (2.15)
where the notations followed are same as those in the previous section.
The interconnect complexity of the ripple adder at `2 level (i.e., the gate
level) is n× ICFA.
The IDP at the gate level is
IDP`2 = ICFA × TID,FA × n (2.16)
The `1 level interconnect complexity excluding the I/O Interconnects
(IC`1) is
IC`1 = n− 1 (2.17)
Now the IDP at the full adder level (`1 level) is given below
IDP`1 = (n− 1)2 (2.18)
The HPD is calculated for a n-bit ripple adder and found to be
HPD = n× PFA × TD,ADD (2.19)
The IPD calculated at the `1 level is
IPD`1 = PO,FA × (n− 1) (2.20)
41
Figure 2.12: Carry Block of a Brent Kung CLA(n=8)
2.2.3 Brent-Kung Carry Lookahead Adder
Through a structured CLA (Brent- Kung CLA) [34], shown in Figure 2.12
much reduction of the delay in obtaining the carry bits can be achieved. The
Brent-Kung CLA is actually a logarithmic adder which has a Binary Tree
and an Inverse Binary Tree for its carry generation block. The binary tree
structure produces k = 0, 1, ...log2n carry block functions for all powers of
two, 2k− 1 < n. The block carry functions for bit levels that lie between the
powers of two are evaluated with an inverse tree.
The Hardware complexity of the Brent-Kung structure in terms of the
42
dot operator function is (n − 1) + (n − 1 − log2n). The value within the
first brace is for the binary tree and that within the second is for the inverse
tree. Note that the inverse binary tree has fewer operators than the binary
tree. This clearly reflects the hardware complexity of the Brent-Kung carry
generation block at the dot operator function level.
The total hardware complexity is equal to the hardware complexity of
the brent-kung structure plus that of the sum generation part.
From here the analysis is done only for the brent-kung structure i.e., only
for the carry generation part and that for the whole adder can be easily
extended. The time-critical path runs from (g0,p0) via (Gn/2−1,Pn/2−1) and
(Gn−2,Pn−2) to the output Cn−1. This path contains the most operators
connected in series. A total of 2log(n/2) operators are encountered along
this path. When computing the circuit delay, it should be noted that the
capacitive loads are not identical, but rather increase towards the middle.
The delay induced along the time-critical path can be expressed as
TD,ADD = TD,g0−Gn−2 =log(n/2)∑
j=1
[TD,OP (j + 2) + TD,OP (j)] (2.21)
where TD,OP is the operator delay. The two summands represent the two bi-
nary trees and the arguments of TD,OP specify the variable load capacitances.
Next, the interconnect complexity of the Brent-Kung structure is ana-
lyzed. The interconnect complexity is analyzed at the dot function level
43
which is the λ1 level in this case.
Interconnect count in Brent-Kung structure is given below
IC`1 = n× (2× log2n− 1) + (n− 1) + (n− 1− log2n) + 2× n+ n (2.22)
The Interconnect complexity of the brent-kung structure at the gate level
is given by
IC`2 = ((n− 1) + (n− 1− log2n))× ICdot (2.23)
It is already noted that ICdot is the maximum interconnect length com-
plexity TID,dot is the interconnect delay within a dot function the interconnect
delay product at the gate level is
IDP`2 = (2log2n− 1)× ICdot × TID,dot (2.24)
Now the interconnect delay product at the `1 level i.e., at the dot function
level assuming that each interconnect offers unit delay is given by
IDPλ1 = (2log2n− 2) (2.25)
The hardware power delay product for the brent-kung structure is given
by
HPD = ((n− 1) + (n− 1− log2n))× Pdot × TD,ADD (2.26)
where, Pdot is the power due to a dot function.
The interconnect power delay product calculated at `1 level is
IPD`1 = PO,dot × IC`1 (2.27)
44
where, PO,dot is the power at the output of the dot function.
2.2.4 Carry Save Adder
The adders dealt upto this point are grouped under the term carry prop-
agate adders (CPA). In these adders the carry signal is first evaluated to
determine the sum. For two operand addition these adders are efficient. But
for multiple operand addition, as encountered in some DSP applications like
the Inner Product calculation in case of a DFT, the adders are not very ef-
ficient. For such addition the most viable alternative would be to use carry
save adders (CSA) [32]. In a CSA the carry signal is not used for the current
addition, but rather for its successor.
Figure 2.13 shows n-bit CSA for adding three operands. Here n full adders
without interconnects among them are used to merge the three operands X,Y
and Z to n sum bits si which constitute the sum vector S and n carry bits
ci+1 which constitute the carry vector C. The carry vector is actually shifted
by one position. If we look into the representation of the sum and carry of
the CSA we find that their representation is redundant as only n + 2 bits
are needed for representing the sum of three operands wherein 2n bits are
employed here. Not until the sum and the carry bits are added in a CPA
the results are complete. Such an adder is popular because of its highly
structured and faster operation.
45
Figure 2.13: Carry Save Adder
A CSA which adds three operands is called a 3 × 2 CSA. In general we
can use any m× 2 CSA.
In case of multiple operand addition, CSAs are used as long as a sum and
a carry vector remain. These two are then merged in a CPA to get the final
sum. Figure 2.14 shows a CSA tree for eight operands.The horizontal arrows
across every carry bit indicate the shifting of the carry bits to the left by one
position at every stage. The final adder at the last stage is a CPA.
The delay of a n-bit CSA is just the delay of a full adder. The delay of
a CSA tree for adding N operands depends on the number of stages in that
tree plus the delay of the final CPA stage. The delay of the CPA stage is
already dealt in detail in the previous sections.
The interconnect complexity, Hardware Complexity and the delay of a
m× 2 CSA tree for adding N operands are presented below.
46
Figure 2.14: Carry Save Adder Tree for eight operands
47
Notations followed throughout this section
N is the number of operands
m is the order of the CSA
n is the wordlength of the input operands S is the number of CSAs in each
stage
TCSA is total no. of CSAs in the tree found by adding the number of CSAs
in each stage
d is the number of stages
TD,ADD is the delay of the adder tree
TD,CPA is the delay of the CPA
Algorithm to find the total number of interconnects and hardware
complexity in a CSA structure
Until (N - N mod m)/m = 0
S =∑
[(N −Nmodm)/m]
TCSA = TCSA + S
N = 2[(N −Nmodm)/m] + N mod m
The above equation is the way N decreases at every Stage ( N is the no. of
inputs at each stage)
d=d+1 continue
Total number of CSAs = TCSA
Total number of interconnects at `1 level, IC`1= (m× TCSA× n) + 2× n
48
Interconnect complexity at `2 level, IC`2=TCSA× n× ICFA
Delay due the CSA tree part alone, TD,CSA=TD,FA × d
Total Delay of the adder tree,TD,ADD=TD,CSA+TD,CPA
The interconnect delay product, power delay product and interconnect
power delay product are calculated for the adder tree part excluding the CPA
stage. The delay products for the CPA are already been dealt thoroughly in
the previous sections.
The interconnect delay product at `1 level is
IDP`1 = d (2.28)
The interconnect delay product at the `2 level is
IDP`2 = TID,FA × d (2.29)
The hardware power delay product of the CSA tree is found to be
HPD = TCSA× n× PFA × TD,CSA (2.30)
The interconnect power delay product of the CSA tree at the `1 level is
IPD`1 = PO,FA × d (2.31)
2.3 Multipliers
Besides addition, Multiplication is a heavily used core operation in signal
processing. In many of the DSP applications where high throughput is the
49
prime concern, fast multipliers are required. Such fast multiplier configu-
rations have high interconnect complexity apart from the higher hardware
complexities and hence higher hardware costs and power. In the following
sections, the interconnect complexity, hardware complexity, delay and power
aspects of various multiplier configurations are presented.
2.3.1 Parallel Array Multiplier
This is the simplest parallel multiplier [4] in which the multiplier-multiplicand
bits are summed up one by one by means of a series of CSAs. The multiplier
has a two dimensional array structure of full adders as shown in Figure 2.15.
Each row except the final row forms a CSA and the final row forms a CPA
or a ripple carry adder.
The hardware complexity (number of logic gates) of the array multiplier
is proportional to n2 where n is assumed to be the wordlength of the inputs
to the multiplier.
The interconnect complexity of the array multiplier at the `1 level (full
adder level) is found to be equal to 3(n− 1)2 + 3n.
The interconnect complexity of the array multiplier at the gate level
namely the `2 level is
IC`2 = n2 × ICFA (2.32)
50
The delay of the array multiplier is proportional to n.
TD,MUL = n× TD,FA (2.33)
The interconnect delay product at the full adder level is
IDP`1 = n (2.34)
The interconnect delay product at the gate level is
IDP`2 = ICFA × TID,FA × n (2.35)
The hardware power delay product of the array multiplier is found to be
HPD = n2 × PFA × TD,MUL (2.36)
The Interconnect power delay product of the array multiplier at the `1
level is
IPD`1 = PO,FA × n (2.37)
2.3.2 Wallace Tree Multiplier
A further alternative to the multiplier implementation is the Wallace Tree
Multiplier [43] in which the partial products are evaluated first and these are
added in a CSA tree. A CSA sums up three binary numbers and produces
two binary numbers (i.e., a partial sum and a partial carry). Therefore, using
51
Figure 2.15: Parallel Array Multiplier
52
n/3 CSAs in parallel, we can reduce the number of multiplicand-multiples
from n to about 2n/3. Then, using about 2n/9 CSAs, we can further reduce
it to 4n/9. Applying this principle repeatedly, the number of multiplicand-
multiples can be reduced to only two. As seen in Section 2.2.4 the final
multiplicand-multiples (the final carry and sum vectors) can be added in a
CPA. Figure 2.16 shows the block diagram of a multiplier based on Wallace
Tree. The delay is small and is proportional to log n when a fast CPA with
O(log n) delay is used for the final addition. The number of logic gates is
about the same as that of an array multiplier and is proportional to n2. The
hardware complexity, interconnect complexity, delay and power analysis are
similar to that carried out for the CSA tree presented in Sub section 2.2.4.
The CSAs used in the Wallace Tree can not necessarily be those of the
3×2 type but can be in general any CSA of the order m×2 depending upon
the suitability for VLSI realization. Note that as m increases the delay of
the m×2 adder tree increases. The interconnect, hardware, delay and power
analysis provided for the CSA tree in Sub section 2.2.4 are for the general
m× 2 case.
53
Figure 2.16: Multiplier Based on Wallace Tree
54
Chapter 3
DFT:Power-Delay Analysis
DFT is a fundamental operation in signal processing. The design of VLSI
architectures for DFT undergoes major changes when DSM technological
implementations are considered. Based on the DSM issues presented in the
earlier chapters there is a strong need to analyze the interconnect, hardware,
power and delay complexities of the DFT architectures.
3.1 Interconnect and Hardware Complexity
In this section, the interconnect and hardware complexity of the DFT
algorithm and the corresponding architecture is presented. This DFT archi-
tecture is realized using PIPP and is discussed in Chapter 1. The intercon-
nect complexity is modelled at the functional level (`1 level). The model is
developed based on the interconnect complexities of the various arithmetic
functional units presented in Chapter 2.
55
Input vector partition size Hardware Count Interconnect Count
4 8280 2392
16 32856 9840
64 131160 36304
256 524376 134480
1024 2097240 530256
Table 3.1: Hardware and Interconnect Count for 1024 point DFT
Input vector partition size Hardware Count Interconnect Count
4 8280 2392
16 32856 9840
64 131160 36304
256 524376 134480
1024 2097240 530256
4096 8388696 2109264
Table 3.2: Hardware and Interconnect Count for 4096 point DFT
The interconnect and hardware complexity depends on the partitioning
of the input vectors. Refer Figure 1.1. It is obvious that the number of
interconnects increases with the input partitioning. On the other hand it
will be seen in the next section that the interconnect power-delay product
variation and the hardware power-delay product variation do not follow the
same trend. The interconnect and hardware count of the DFT architecture
for different input partitioning are shown in Tables 3.1 and 3.2 for 1024 and
4096 sample points.
56
3.2 Hardware Power-Delay Analysis
Power-delay product is an important parameter for an architecture. The
power consumed by a digital system is governed by the following equation.
P = fCV 2 (3.1)
where,
f is the average number of transitions 0 → 1
C is the capacitive load
V is the supply voltage
As seen from the power equation (3.1), for a specific supply voltage and
a given capacitive load, the power consumption is decided by the average
number of transitions. However, the capacitive load is heavily dependent on
the interconnect complexity. The power consumed by an individual transition
could be in the sub micro-Watt range, depending on the technology.
The power delay product analysis presented for DFT architectures in this
section is based on the average transition count at a functional level, like a
full adder or a flip-flop. This power-delay product is classified into intercon-
nect power delay product (IPD), hardware power-delay product (HPD) and
hardware interconnect power-delay product (HIPD) [30].
An analysis of HPD for the DFT architecture is presented below. The
power consumed by the respective hardware units like full adders, flip-flops,
57
Input vector partitioning size IPD HPD HIPD
4 1.75E+20 9.38E+14 6.13E+20
16 4.47E+18 3.38E+14 2.01E+19
64 1.78E+17 1.00E+14 9.79E+17
256 9.59E+15 2.71E+13 5.97E+16
1024 5.86E+14 7.62E+12 4.02E+15
Table 3.3: Power-Delay Product for 1024 point DFT
Input vector partitioning size IPD HPD HIPD
4 2.40E+17 1.13E+23 3.97E+23
16 8.65E+16 2.06E+22 9.26E+22
64 2.56E+16 7.29E+20 4.01E+21
256 6.93E+15 3.91E+19 2.44E+20
1024 1.95E+15 2.34E+18 1.64E+19
4096 5.66E+14 1.46E+17 1.16E+18
Table 3.4: Power-Delay Product for 4096 point DFT
etc is calculated using average transition count. We employ a unit delay
model for the gates. Using this, functional level delay is calculated.
The analysis of hardware power and hardware delay for the PIPP archi-
tecture for DFT with different input partitioning were performed. The results
are presented in Tables 3.3 and 3.4. From the tables it is clear that the HPD
decreases with the input partitioning for a specific number of sample points
and increases with the number of sample points.
58
3.3 Interconnect Power-Delay Analysis
In DSM technology, with the deep scaling down of the devices, more than
the hardware count it is the interconnects that will decide the overall perfor-
mance. Hence, it is necessary to calculate the IPD for any architecture. Like
the hardware power-delay product, the interconnect power delay-product is
calculated from the interconnect power and the interconnect delay. The de-
lay imposed by the global interconnects (without buffers) is more than an
order of magnitude over the device delay and this is predicted to increase by
several order of magnitude beyond the 90nm technology. This is due to the
interconnect capacitance becoming larger than the gate capacitance.
The vast increase in the number of interconnects has made the intercon-
nect capacitance the cause of most of the on chip power dissipation. The
power consumed by the interconnects is calculated using the average number
of transitions of the gates driving the interconnects. With reference to (3.1)
the capacitive load is due to the interconnect complexity, length and fan out.
The delay due to the interconnects is calculated on the basis of unit delay
and unit length as defined in Chapter 2. From these the IPD is determined.
Tables 3.3 and 3.4 show the IPD values for different sample points.
59
Interpretation of Results
A study of Table 3.3 for 1024 point DFT is presented below. For this case
the IPD is found to be more dominant than the HPD and hence the value
of HIPD is primarily influenced by the IPD. The IPD and hence the HIPD
decreases with the input partitioning for a specific number of sample points.
In Table 3.4, for 4096 point DFT the HPD is found to be more dominant
than the IPD and hence the overall HIPD is influenced by the HPD. The
HPD, IPD and hence the HIPD decrease with the input partitioning for a
specific number of sample points.
It can be interpreted from the tables that upto 1024 point DFT the IPD
seems to dominate over the HPD and from 1024 sample points, the HPD is
found to have higher values.
60
Chapter 4
FFT:Power-Delay Analysis
The FFT has greatly reduced computational complexity over DFT. Under
the micron technology this had great impact in reducing the chip area and
power with increased performance. However, the merits of FFT are lost
under DSM technology. This is due to the dominance of interconnect in
FFT architectures. This chapter focusses on the power and delay aspects
due to interconnect and hardware complexity.
4.1 Interconnect and Hardware Complexity
In this section, the interconnect complexity is presented considering the
characteristics of FFT algorithm and the wordlength variation of the partial
results flowing across the butterfly computing stages. At the architecture
level, a detailed analysis of interconnect complexity within and across the
computing element (butterfly) and delay commutator stages is presented.
61
Further, the hardware complexity analysis for the computing element and
delay commutator is presented.
4.1.1 FFT Algorithm
The interconnect complexity (count) for rj point 1-D FFT across the
butterfly computing stages is given by the following recursive formula.
r−1∑nj=0
Wnj(k1+...+rj−2kj−1)
rj [r−1∑
nj−1=0
Wnj−1(k1+...+rj−2kj−1)
rj−1 .....[r−1∑n2=0
W n2k1
r2 (4.1)
[r−1∑n1=0
X(rj−1n1 + ...+ nj)Wn1k1r ]W n2k2
r ].....W nj−1kj−1r ]W njkj
r
where,
X(rj−1n1+...rnj−1+nj) is the word length of input sample point
Ws are the wordlength of the respective twiddle factors
r is the radix
n1,n2,n3,...,njε(0, 1)
k1,k2,k3,...,kjε(0, 1) and
j is the number of stages
Figure 4.1 shows the total number of interconnects across the stages for
different sample points. An important inference from the graph is that higher
radix proves to be more efficient than lower radix from the interconnect point
of view. This is inspite of the fact that the hardware complexity within the
62
Figure 4.1: Algorithm level interconnect complexity of FFT
CE increases with the radix.
With increase in number of sample points, the interconnect complexity
varies sharply for lower radix values. This is evident from the slope of the
curve for radix-2. On the other hand, as the radix value increases the slope
decreases.
63
Figure 4.2: Radix 4 FFT architecture for 256 sample points
4.1.2 FFT Architecture
The basic FFT can be calculated using different radices viz. radix 2,4,8,16
etc. Radix-4 FFT architecture is shown in Figure 4.2. This architecture
has interleaved computational elements (CE) (refer Figure 4.3) and delay
commutators (DC) (refer Figure 4.4). In radix- r pipelined architecture, the
CE performs r-point butterfly computation. Reordering of the input data
stream to the next CE is performed in the DC.
Computational Element
The architecture of the CE is shown in Figure 4.3. This CE performs
radix-4 butterfly operation. In order to achieve a higher pipelining rate,
the CE employs a pipelined Wallace tree multiplier, using CSA blocks. The
64
Figure 4.3: Radix 4 Computational Element
Figure 4.4: Delay Commutator Circuit
65
Brent-Kung CLA is used in the final stage of this multiplier. The Wallace
tree and Brent-Kung CLA architectures are discussed in Chapter 2.
Delay Commutator
The number of shift registers in each DC for radix- r is 2 (r-1) and the
length of the registers is given by k × r(logr M−(s+1)) (where s is the stage
number), k=1,2,3,,r-1. The shift register complexity (flip-flop count) of the
DC is given by Nm − r(≈ Nm).
For a 4096 point radix-4 FFT, the lengths of the different shift registers
needed on the input port and output port of the DCs are 768 word, 512
word, 256 word, 192 word, 128 word, 64 word, 48 word, 32 word, 16 word, 12
word, 8 word, 4 word, 3 word, 2 word and 1 word. With the increase in the
register length, the latency increases in a non-linear fashion as a function of
the radix and the number of sample points.
As the shift registers are very lengthy, the power consumed (during tran-
sitions) by them during pipelining is also significant. More over, the latency
is determined by the sum of the delays of the CEs and the delay through the
serial shift registers of the DCs. The pipelining rate within CEs is given by
tadd. For given sample points of rj, where j is any positive integer the num-
ber of pipelining cycles is j. The total execution time is given by (Latency +
tadd(j − 1)). The latency is the sum of the delay through the CEs and DCs.
66
Sample Points Hardware Count Interconnect Count
16 45321 11715
64 1724492 51976
256 1.4247E+8 214501
1024 1.15823E+10 865606
4096 9.30206E+11 3474199
Table 4.1: Hardware and Interconnect Complexity for Radix 4 FFT
Sample Points Hardware Count Interconnect Count
64 251369 247989
512 2.806936E+07 3734777
4096 9.314962E+09 58103185
Table 4.2: Hardware and Interconnect Complexity for Radix 8 FFT
The total delay through the DCs is given bylogr N∑s=1
r−1∑k=1
k × r(logr N−(s+1)).
Tables 4.1 and 4.2 shows the interconnect and hardware complexity for
radix 4 and radix 8 FFT architecture. The interconnect complexity is more
than that of radix 4. This variation is different from that at the algorithmic
level. It is due to the fact that at the algorithm level, the DCs were not
taken in to account and at the architectural level these play an important
role in deciding the overall interconnect complexity. The analysis of inter-
connect complexities reveal that FFT architectures suffer from interconnect
dominance unlike DFT architectures.
67
Sample Points IPD HPD HIPD
16 36675500 36675500 1.28E+08
64 2.93E+10 7.91E+08 7.87E+10
256 1.38E+13 2.85E+10 3.56E+13
1024 5.34E+15 1.06E+12 1.33E+16
4096 8.83E+21 4.69E+15 4.38E+22
Table 4.3: Power-Delay Product for Radix- 4 FFT
4.2 Hardware Power-Delay Analysis
In this section the hardware power-delay product is analyzed. The power
consumed by the respective hardware units like full adders, flip-flops, mul-
tiplexers etc is calculated using average transition count as in Chapter 2.
We employ a unit delay model for the gates to calculate the functional level
delay.
The results of the HPD analysis are given in Tables 4.3 and 4.4. From
the tables it is clear that as the radix increases the HPD increases initially
and then decreases. Across different sample points the HPD increases within
the same radix. Comparing the HPD of DFT and FFT we find that FFT
has lesser HPD than DFT.
68
Sample Points IPD HPD HIPD
64 63980200 1.63E+09 2.86E+09
512 2.23E+10 6.26E+11 1.08E+12
4096 1.19E+13 2.63E+14 4.58E+14
Table 4.4: Power-Delay Product for Radix- 8 FFT
4.3 Interconnect Power-Delay Analysis
The results of the IPD analysis are given in Tables 4.3 and 4.4. For a given
radix, the IPD increases with increase in the number of sample points. On
the other hand, as the radix increases the IPD decreases for a given number
of sample points.
Till radix 4 FFT, IPD dominates HPD. Hence, the overall HIPD is de-
cided by IPD. But for radices above 8, HPD dominates IPD and the overall
HIPD is influenced by the HPD. Therefore we can determine the optimum
radix value for a given set of sample points to achieve the best power-delay
product.
4.4 DFT and FFT Architectures-A DSM Per-
spective
In DSM technology, with the deep scaling down of the devices, more
than the hardware count it is the interconnects that will decide the overall
performance. Comparing the DFT and FFT architectures from the Tables
69
3.1, 3.2, 3.3, 3.4,4.1, 4.2, 4.3 and 4.4 we have the following observations. The
total power delay product is more pronounced in lower radix FFT compared
to DFT. On the other hand, for a higher radix FFT, the total power delay
product is dominant in DFT. Interconnect count is higher in FFT compared
to DFT irrespective of the radix values. Moreover, as the radix increases,
the interconnect dominance in FFT increases rapidly.
Besides the dominance of local interconnects in FFT, the major techno-
logical drawback is the clock broadcasting to large number of flip-flops of the
DCs present in different stages. The number of flip-flops to be driven by the
clock against the number of sample points for radix-4 is shown by the graph
in Figure 4.5. However, for a given number of sample points the flip-flop
count grows as a non-linear function of the radix.
The DFT PIPP architecture benefits from the absence of switching and
delay elements leading to lower power consumption and no clock broadcast-
ing. The above analysis can be extended to multidimensional DFT wherein
reduction in the interconnect complexity, latency and power will be substan-
tial.
The important factor critical to FFT architecture with regard to inter-
connect complexity is the presence of delay commutators. In general, lower
radix FFT architecture is preferred in order to avoid hardware overhead in the
computing elements. However, for higher radix FFT, IPD, HPD and HIPD
70
Figure 4.5: Communication Complexity of Clock Distribution in the DC
Flip-Flops
71
are of relatively lower values, with the lowest being the IPD. The analysis
further shows that for lower radix FFT the values of the above mentioned
parameters are several orders of magnitude lesser in DFT. As we move to
higher radix FFT the parameter values, except the interconnect count, be-
come higher in DFT. For the analysis presented in Tables 3.1, 3.2, 3.3, 3.4,
4.1, 4.2, 4.3 and 4.4 no word length truncation is performed.
As a whole, the interconnect complexity in FFT increases considerably
over DFT with increase in radix across different sample values. To worsen the
situation further, the global interconnect complexity in the clock distribution
in FFT is very high, grows as a linear function of number of sample points
and a non-linear function of radix.
Based on the ITRS projection with the shift from device to interconnect
dominated design paradigm, DFT will be more suitable for multi GHz DSM
architecture implementation [30].
72
Chapter 5
Number Systems and DSP
The performance, speed and accuracy are important parameters of any
AU. The number system that an AU employs has a significant effect on the
above parameters.The efficiency in the execution of computationally com-
plex DSP algorithms greatly depends on the type of number system em-
ployed. The focus of this chapter is to discuss the various number systems
implemented in arithmetic units.
5.1 Characteristics of Number Systems
Range: The range of a number system is defined as the interval over which
every defined digit can be uniquely represented by the system, i.e without
having two numbers with the same representation. A number system is said
to have infinite range if each defined digit can be uniquely represented. The
decimal number system is an example of a number system with infinite range.
73
Uniqueness: A number system is said to be unique if each number in the
system has only one representation.
Redundancy: A number system is defined to be redundant if there are fewer
defined numbers than there are combinations of digits. Therefore, there could
exist some combinations of the digits, for which a defined number may not
exist. Redundancy could also refer to the situation where there are more than
one representation for a number. Hence, the absence of uniqueness implies
that the system is redundant.
Weighted Number System: A number system is said to be weighted if
there exists a set of weights such that, for any number, it is expressed as
V (A) =N−1∑i=0
airi (5.1)
where ai are the set of defined digits. If the values of r are constant, then the
number system has a fixed base or a fixed radix, e.g. decimal system with
base ten (r=10). Number systems in which the weights are not powers of
the same number are called mixed-radix systems. Weighted number systems
are advantageous because performing magnitude comparison, sign detection
and overflow detection are easy.
74
5.2 Binary Number Systems
Most arithmetic processors conventionally employ weighted number sys-
tems. A number is represented by a series of digits, with a weight attached
to each digit. The value of the number is computed by multiplying each bit
with its associated weight and then adding the same.
For e.g. in binary number systems the numbers are represented in the
form of 1’s and 0’s. The digit set I contains (0,1). The base or radix for
the binary number system is 2. Consider the number A, represented as
an−1an−2...a2a1a0 where, ai ∈ (0, 1) i = 0...N − 1. The value of the number
is then given by (5.1). The same format of representing data is used to
represent fractional numbers of the form an−1an−2...a2a1a0a−1...a−m where,
ai ∈ (0, 1) i = −m...N − 1. The value of the fractional number is given by
V (A) =N−1∑i=−m
airi (5.2)
All arithmetic processors should be made compatible to work on both
positive and negative numbers. The binary number system accomplishes
this by appending a sign bit to the beginning of the binary word. The sign
bit SAis Zero for positive numbers and is One for negative numbers.
Mathematically,
75
SA = 0forA >= 0
SA = 1forA < 0
Arithmetic units employ complement or sign magnitude representation
while operating on negative numbers. Hence, the subtraction of two numbers
is converted into the addition of the the positive number with the complement
of the negative number. The two main types of complements employed for
a radix r system is the radix complement or the r’s compliment and the
diminished radix complement or the (r-1)’s complement.
For a binary system, i.e. with radix-2, the two methods to find the
complement are the 1’s complement and the 2’s complement system. The 1’s
complement system or the (r-1)’s complement of a negative number is found
out by changing all 1s to 0s and vice versa. The 2’s complement system
or complement of any binary number is found out by first computing the
1’s complement of the number and subsequently adding 1 to it. The use of
complements in arithmetic operations (namely subtraction) helps to convert
all subtractions into addition and hence simplifies the type and number of
hardware used.
Representation of floating point numbers
Many DSP applications require operations on floating point numbers.
The standard format of representing these numbers makes use of the IEEE
76
standard floating-point representation. Single precision numbers are encoded
in 32-bit format. The most significant bit is reserved for the sign, 8 bits are
allocated for representing the exponent and the remaining 23 bits are used
to represent the fraction.
5.2.1 Algorithms for multiplication and division
The main drawback in multiplication and division algorithms is that they
suffer from a high degree of repetitive computations. However, these algo-
rithms achieve high performance and accuracy. Two of the classical algo-
rithms for multiplication and division are presented below [17].
5.2.2 Multiplication
Consider the multiplication of two binary numbers xn−1xn−2...x2x1x0 and
yn−1yn−2...y2y1y0. These numbers are stored in registers C and D. Register P
is used to store the product and is initially set to zero.
Step 1: If the LSB of the value stored in register C is 1, then the contents of
register D are added to P, else 00000 is added to P. The sum is stored back
into P.
Step 2: The contents of registers P and C are shifted right and the carry
out of the previous addition operation is moved into the high-order bit of P.
The lower order bit of P is moved into the register C and the rightmost bit
77
Figure 5.1: Multiplier Hardware
of C is shifted out and is not used further in the execution of the algorithm.
The value stored in P and C is the product. Register C holds the lower order
bits after n steps . The hardware to perform the above operation is shown
in Figure 5.1.
5.2.3 Division
The SRT division algorithm described below is an example of a restoring
algorithm . This class of algorithms achieves high performance. We consider
the division of two n bit numbers x and y (x/y). These numbers are loaded
into registers X and Y. Register P stores final remainder and is initially set
to zero.
78
Step 1: If the value stored in Y has leading zeros, then the k leading zeros
of Y are shifted left by k bits.
Step 2: Repeat 2.1, 2.2, 2.3 n-1 times.
(2a) If the top three bits of P are equal , then assign qi = 0 and shift registers
P and X left by one bit
(2b) If the most significant three bits of P are all not equal and P is negative,
set qi = −1 and shift registers P and X left by one bit and add Y.
(2c) Else set qi = 1 and shift registers P and X left by one bit and then
subtract Y.
Step 3: If the final remainder is negative, it is corrected by adding Y, the
quotient is corrected by subtracting one from q0. The remainder is finally
shifted k bits to the right, where k is the initial shift.
79
The binary number system has been popularly used because of its high
accuracy and speed. However, a closer inspection of the algorithms employed
for multiplication and division reveals that the binary number system is in-
efficient for multiplication and division operations. Moreover, the large word
length required to represent numbers of high magnitude and the problems
associated with carry propagation has lead to the use of alternative number
systems for simplifying the arithmetic operations.
80
5.3 Residue Number Systems
The Residue number system (RNS) [15] [40] is based on the principle
of breaking down a given number into a set of smaller word length residues.
Smaller wordlengths lead to the absence of carry propagation and lesser mem-
ory requirement needed for storage.
5.3.1 RNS representation of numbers
A number in RNS is represented by associating it with a set of radices/base.
However, unlike a fixed-radix number system, the base for residue numbers
is not a single radix, but an N-tuple of integers m1,m2mn etc where each
element of the set is called ’modulus’.
A k bit number is represented as (xk−1|...|x2|x1|x0)
Let M represent the N Tuple set of position weights (moduli) that are mu-
tually prime M = mk−1,mk−2, ...m1,m0
With the condition that
mk−1 > mk−2 > ... > m1 > m0
The residue codes for the number is given by
ri = x mod mi; ri = [0,mi − 1] (5.3)
81
Number 3 5 7 Number 3 5 7
-15 0 0 6 0 0 0 0
-14 1 1 0 1 1 1 1
-13 2 2 1 2 2 2 2
-12 0 3 2 3 0 3 3
-11 1 4 3 4 1 4 4
-10 2 0 4 5 2 0 5
-09 0 1 5 6 0 1 6
-08 1 2 6 7 1 2 0
-07 2 3 0 8 2 3 1
-06 0 4 1 9 0 4 2
-05 1 0 2 10 1 0 3
-04 2 1 3 11 2 1 4
-03 0 2 4 12 0 2 5
-02 1 3 5 13 1 3 6
-01 2 4 6 14 2 4 7
Table 5.1: RNS Representation Example
The Residue Number System is inherently redundant as it is periodic.
Any number can be uniquely represented in RNS only if the number lies
within the dynamic range of the system. The dynamic range is equal to
the product of the moduli. The dynamic range is called as the interval of
definition. Table 5.1 shows the representation of numbers from -15 to +14
with the moduli set (3,5,7)
82
5.3.2 Negative Number Representation
Variables often take negative values in arithmetic calculations. Two ways
of representing negative numbers are listed, with the latter being the most
commonly used. The first technique is to represent the absolute magni-
tude of a number in residue code and use an external sign bit to represent
the sign. In the second method, the sign of the number is included within
the residue code, similar to the complement representation in binary. In
the dynamic range M, residue numbers in the range [0, (M/2)-1] are taken
as positive and residue numbers in the range [ M/2 ,M-1] are considered
negative. Therefore, if X is represented as r1r2, ...rN , -X is represented by
(m1 − r1), (m2 − r2), ..., (mN − rN).
Given the RNS representation of a number X , the representation of -X is
calculated by complementing each of the digits xi with respect to moduli mi
(0 digits will remain unchanged).
Representational Efficiency
The representational efficiency of the residue number system is defined as
the ratio of dynamic range of the system to the total number of states that
the bits in the residue number system can represent.
Eg. Dynamic range of RNS (7—5—3) is 7× 5× 3 = 105
Each binary bit encoded with the Residue set above requires 3+3+2 = 8bits
83
8 bits can represent 256 unique states,
Hence, the Representational efficiency = 105256
= 41.01 percent.
Selection of Moduli
The selection of appropriate moduli is critical since it affects the represen-
tational efficiency and the complexity of the arithmetic processor involved.
The magnitude of the largest modulus directly affects the speed of the arith-
metic operations. To improve the efficiency, the moduli must be made com-
parable such that each modulus is nearly as large as the modulus of largest
magnitude. This does not affect the speed of the arithmetic operation. The
above conditions are bound by the constraint that the moduli so selected
must be relatively prime to each other.
Consider the RNS with the moduli (17—13—11—7—3—2). The maxi-
mum range that is represented is given by the dynamic range (M)= 102102 .
The number of Bits required is = 5+4+4+3+2+1 = 19
The speed of arithmetic operations is dictated by the largest wordlength i.e
5 bits. The moduli 2 and 13 and 3 and 7 can be combined with no apparent
penalty resulting in an RNS with moduli set defined as (26—21—17—11).
The later representation still needs 5+5+5+4 = 19 bits. However, there are
two fewer modules leading to better hardware and computational efficiency.
84
5.3.3 Arithmetic Identities
Additive inverse
The additive inverse of a number, X, is a number that when added to
X yields a result of zero. Since the modulus is congruent to 0, the additive
inverse is defined as X + (-X) = m or (-X) = m - X.
Eg. The additive inverse of 2 over a modulus of 5 is (−X) = 5− 2 = 3.
This property is used to define negative numbers, with the top half of the
range being the negative numbers of the bottom half of the range.
Multiplicative inverse
The multiplicative inverse of a number, X, is a number that when multi-
plied by X yields the result of 1. Unlike the additive inverse, the multiplica-
tive inverse of a number does not always exist. The multiplicative inverse of
a number, X, exists only if X is relatively prime to the modulus M (i.e. GCD
(X, M)=1.
5.3.4 Code Conversions
Conversion to Residue Code
The residue of a number is defined by (5.5). In conventional comput-
ers, this calculation is performed by dividing X by mi and determining the
remainder. In a residue computer, which is capable of residue addition, mul-
tiplication, etc. a more efficient method is used to determine the residue
85
representation. A number is represented in the binary system as
X = 2nbn + ...+ 22b2 + 21b1 + b0 (5.4)
Where bi are the binary digits of the integer X. Taking the modulo yields,
|X|mi= (2n
mod mi)bn + ...+ (22mod mi)b2 + (21
mod mi)b1 + b0 (5.5)
If powers of 2 modulo mi are directly available, either through look up
tables or through special purpose hardware, then the value of |X|miis calcu-
lated by adding those modulo values of the powers of 2 where the bits bi are
one.
5.3.5 Conversion from RNS to BNS- The Chinese Re-
mainder Theorem
The Chinese scholar Sun-Tsu in the first century A.D, described a rule
called t’ai- yen (great generalization) to determine a number having the re-
mainders 2, 3, and 2 when divided by the numbers 3, 5, and 7. When
the secret of this general technique to determine numbers based on residues
was discovered, it became known as the Chinese Remainder Theorem(CRT).
This theorem has been extensively used to convert numbers represented in
the Residue Number system to a weighted number system.
The mathematical form of the CRT is given as follows
n =L∑
i=1
[mi · (xi · m−1i )modmi]modM (5.6)
86
Figure 5.2: Ancient verse of the Chinese Remainder Theorem
where n is the resulting number, L is the number of sub-rings, xi is an element
of the ith sub-ring. mi is the modulus of the ith sub-ring, M is the modulus
of the overall system ring. mi = Mmi
and m−1i is the multiplicative inverse of
over the ring modulo mi. The CRT is explained with the help of a suitable
example.
Example: The number 23 is represented using the moduli 3, 5, and 7 as (2,
3, 2). These are the residues (remainders) after integer division by the moduli
(23/3=7 remainder 2). To convert (2, 3, 2) back to a decimal representation
we use the CRT
M = 3× 5× 7 = 105
mi = 105mi
=(35,21,15)
35 mod 3=2, 2× 2 mod 3=1 therefore m−11 = 2
87
21 mod 5=1 therefore m−12 = 1
15 mod 7=1 therefore m−13 = 1
n=[[(35)(2×2)mod 3]mod 105 + [(21)(3×1)mod 5]mod 105+[(15)(2×1)mod
7]mod 105 ]mod 105
=(35+63+30)mod 105
=23
Mixed Radix Conversion
The Chinese Remainder Theorem is one method of converting residue
numbers to binary. The disadvantage of this method is the mod M operation,
which would not make it feasible for residue machines, that are designed to
perform modulo mi operations. The mixed radix conversion presented here,
on the other hand, is easier to implement in a residue machine, since it
involves only mod mi operations.
The mixed radix representation is of great importance in residue compu-
tation due to two reasons. 1) The mixed radix system is a weighted system
and hence is used in magnitude comparison. 2) Conversion from residue to
certain mixed-radix systems is relatively fast in residue computers.
Mixed Radix System
A number may be expressed in mixed radix form as
x = aN
N−1∏i=1
Ri + ...a3R1R2 + a2R1 + a1 (5.7)
88
where Ri are the radices and the ai represent the mixed radix digits. For
a given set of radices, the mixed radix representation of x is denoted by
< aN , aN−1, , , , a1 > The multipliers of the mixed radix digits are the weights.
Conversion to the Mixed-Radix System
Consider a system where for a set of moduli m1,m2,m3mN , a set of
radices are chosen such that mi = Ri, then the mixed radix system is said to
be associated with the residue number system. Then the equation for mixed
radix systems is transformed into
x = aN
N−1∏i=1
mi + ...a3m1m2 + a2m1 + a1 (5.8)
where ai the are mixed radix coefficients. The coefficients are determined
sequentially starting with ai. Taking mod m1 of the above equation, yields
ai since the remaining terms are multiples of m1. ai is the first residue digit.
To obtain a2, first the residue code of x − a1 is formed. This quantity is
divisible by m1. It is seen that
a2 =∣∣∣∣x− a1
m1
∣∣∣∣m2
(5.9)
In the same manner all the other mixed radix digits are obtained. In
general the mixed radix digits are found for i > 1 by
ai =
∣∣∣∣∣ x
m1m2...mi−1
∣∣∣∣∣mi
(5.10)
89
Mixed Radix Systems for Overflow detection, Comparison and
Base Extension.
Overflow detection, scaling and base extension are RNS operations that
are easier than general division, but are considerably more difficult to im-
plement than addition, subtraction and multiplication. In all three cases the
mixed radix converter forms the basis of the operation, since a mixed radix
representation is required as an intermediate step in the procedure.
Overflow detection
In order to determine if overflow has occurred, it is necessary to provide
additional dynamic range in the RNS. The result of a computation is then
checked for overflow into the ”extra” range. The overflow need to be checked
only when the residues are converted back to binary. This is because overflow
has no meaning within residue arithmetic.
Adding a redundant modulus whose purpose is overflow detection pro-
vides the extra range needed for this purpose. A necessary and sufficient con-
dition to check for overflow with one redundant modulus is that it should be
the largest modulus. The occurrence of overflow is then detected if aL+1 6= 0,
where aL+1 is the highest order mixed radix digit of the redundant RNS
representation of X. This assumes that the quantity being tested, which has
possibly overflowed the original RNS range, is not so large as to overflow
the augmented range of the redundant system. This illustrates that overflow
90
detection requires a mixed radix converter, designed to accommodate the
augmented residue representation needed for redundancy.
Overflow detection and mixed radix conversion are similar in complexity
and are both considerably more complicated than RNS addition and mul-
tiplication. It is fortunate that overflow detection is a relatively infrequent
operation in many signal processing problems, in contrast to much more fre-
quently required addition and multiplication.
Scaling
In conventional mixed-radix arithmetic, two commonly used operations
are multiplication and division by a power of the base. This operation is
implemented easily in a digital computer by shifting the operand. Since
shifting is fast, multiplication and division by a power of the radix offer
obvious advantages over multiplying or dividing by an arbitrary number. In
terms of residue arithmetic, an analogy to mixed radix arithmetic would be
division by a predetermined number that is a product of any of the moduli,
which comprise the dynamic range M. This division operation is referred to
as scaling
Extension of Base
Frequently it is necessary to find the residue representation of a number
in one base depending on its representation in another base. In most cases,
the new base will be the extension of the original base, with one or more
91
extra moduli. The procedure, termed extension of base, is a mixed radix
conversion with an additional step. The new mixed radix representation of
the numbers with higher dynamic range is given by
x = aN+1
N∏i=1
mi + aN
N−1∏i=1
mi + ...+ a3m1m2 + a2m1 + a1 (5.11)
For any number in the original range, the value of aN+1 is zero.
5.3.6 Arithmetic operations in RNS
Addition and Subtraction
RNS is superior to weighted number systems, since the absence of carries
inherently results in higher speeds. In weighted number systems in order
to eliminate the carry propagation, extensive hardware is needed to imple-
ment carry look ahead logic. The hardware required in RNS for conversion
replaces the additional hardware in a weighted number system. Second, the
sum is obtained modulo M, hence if the number exceeds M, an ambiguity
arises, since numbers of the form |a|M and |a+ kM |Mhave the same residue
representation, hence M must be chosen large enough to guarantee results
within the dynamic range and to avoid overflow.
Multiplication
Multiplication, like addition and subtraction is done by performing the
modulo multiplication of the corresponding residues.
92
Figure 5.3: Generalized Block Arithmetic for Addition Subtraction and Mul-
tiplication
93
The representation of data in RNS is attractive as the absence of carry
propagation leads to a large speed increase for multiplication operations.
RNS provides a way of partitioning large dynamic range operations into
completely independent and parallel smaller dynamic range operations. Ad-
dition and multiplication operations are performed faster. RNS offers many
benefits, such as being able to skew clocks to reduce overall switching cur-
rent/power and system noise. RNS has been used to implement a number of
DSP related architectures [27] [5] [12].
RNS is not a weighted number system. Hence, it does not have many
of the advantageous properties listed for weighted number systems, such as
magnitude comparison, sign detection and overflow detection. Systems with
many channels will tend to be unbalanced, i.e the channels with larger moduli
have larger loads than channels with smaller moduli.
5.4 Logarithmic Number Systems
A majority of the early arithmetic units utilized variations of the weighted
binary number systems for fractional, integer and floating-point representa-
tion. The conventional architectures for multiplication and division are of
high circuit complexity and suffers from a speed complexity compromise.
Various number systems like RNS and LNS have been proposed to improve
94
the computational efficiency.
LNS proposed in [9] greatly speeds up the multiplication and division
processes. The early techniques utilized ROMs to perform multiplication
[42]. However these techniques were inefficient because of the large number
of memory bits required for storage.
[9] is an early concise description of a general logarithmic number sys-
tem, capable of representing a wide range of both positive and negative real
numbers. Algorithms for the four basic operations (addition, subtraction,
multiplication and division) are discussed.
5.4.1 LNS Representation
Any number is represented by its sign SA, and the binary logarithm of its
magnitude.
SA =1 if A < 0
SA = 0 if A > 0
SA = 0 or 1 if A = 0
The representation of numbers in logarithmic format has the disadvan-
tages that negative numbers cannot be represented. Scaling solves this prob-
lem, i.e. multiplying the number by a scaling factorτ such that the logarithm
95
of the scaled number yields a positive logarithm.
LA = log (|τA|) if |A| ≥ 1
τ(5.12)
LA = 0if |A| ≤ 1
τ
Let∑A represent the number A, then the value of
∑A = (1− 2SA)LA
is value of the number represented in LNS. The original number is found out
by the following formula
A = (1− 2SA)(
1
τ
)2LA (5.13)
To represent the number LA in a discrete fashion, the number is quan-
tized to form the number KA.
KA =
(12
+ 2b−1 log2 (|τA|))
2b−1(5.14)
[X] represents the floor function and returns the largest integer greater
than X. The 12
term is useful in rounding off. Hence the error is less (the
rounding off error is substantially lesser than the truncation error).
To represent KA as a finite precision number,it is represented in n-bits.
KA = (KnKn−1...Kb...K1) =n∑
i=1Ki2
i−b
5.4.2 Generation of logarithms for binary numbers
Many methods have been used to obtain the base two logarithms for binary
numbers. Both memory based and hardware circuits have been explored to
96
Figure 5.4: Straight Line Approximation to Logarithmic Curve
compute the same. While, memory based methods are inconvenient because
of the large sizes, hardware implementations suffer from the disadvantage of
increased machine time for calculating logarithms. Methods for generation
of binary logarithms are discussed below
The principle idea of the method in [29] is to substitute the log curve
between integer values by a straight line approximation. This introduces an
error but the simplicity of the proposed method makes it attractive in some
applications. The algorithm and its hardware implementation are discussed
below.
97
Figure 5.5: Machine Organization to generate and use binary Logs
98
Algorithm for the generation of approximate logarithm using shift
and count principle
Registers A and B contain 2 numbers, each of size 8 bits. Hence largest
characteristic will be 7. (x3, x2, x1) and (y3, y2, y1) each will initially contain
111
Step 1 :Shift a and b left until their most significant bit ”ONE” bits are in
the left most position and count down x3, x2, x1 and y3, y2, y1 during shifting.
Step 2: Bits 0-6 of A and B are shifted to C and D respectively. Now C and
D contain the values of the approximate logarithm. These approximate log-
arithms are added for multiplication and subtracted for division operations.
The logarithm of the result is stored in a new register, say E.
Step 3: Decode z4, z3, z2, z1 and insert a ”ONE” in approximate position of
E immediately to the right of this ”ONE”. F now contains the result.
5.4.3 Arithmetic Operations
Multiplication and Division
The biggest advantage of the Sign/log number system is that it can transform
computationally complex multiplication and division into simple addition
and subtraction operations. For ex. To multiply 789 and 234 the logarithms
of the numbers are added to obtain the logarithm of the result, which can
then be suitable decoded to obtain the output. The sign of the result is
99
Figure 5.6: LNS Multiplier Divider Hardware
obtained by taking the XOR of the individual sign bits of the operands. A
similar technique is adopted for dividing two numbers. Here, we subtract
the logarithms of the two numbers to be divided to yield the logarithm of
the output. The scaling factor that was initially added to the operands is
removed from the result.
Addition and Subtraction
Addition and subtraction in LNS is a cumbersome process and involves
significant hardware complexity. The process of addition or subtraction in-
volves converting the add or sub expression into equivalent multiplication or
division operations.
Addition/subtraction Algorithm
Consider the addition of two numbers A and B.
S = A+B
100
S = A(1 +
B
A
)(5.15)
The addition expression is now converted into a multiplication of the first
number with a function of the ratio of the two numbers, S = Aψ(
BA
)where
ψ (X) = 1 +X
First, the ratio(
BA
)is calculated. The value of ψ is then determined. The
value returned by the function ψ is then multiplied by A.
The sign of the output is the sign of the largest number.KA and KB
denote the logarithms of the numbers A and B
If KA ≥ KB
SS = SD = SA
KS = KA + β (KB −KA)
KD = KA + γ (KB −KA) (5.16)
If KA ≤ KB
SS = SD = SB
KS = KA + β (KA −KB)
KD = KA + γ (KA −KB) (5.17)
where, β (X) = log2(1 + 2x) and γ (X) = log2(1− 2x)
The above equations are realized through a comparator,adder , subtractor
101
Figure 5.7: Hardware for Logarithmic Addition and Subtraction
and ROM. The size of the ROM is a limiting factor in the implementation
of circuits for large word length operations.
102
Chapter 6
Logarithmic Residue Number
System
This chapter explores the emergence of a new number system to overcome
the computational complexity of conventional binary number systems. RNS
has the advantage of carry free arithmetic operations. Thus, it is greatly
useful in speeding up addition, subtraction and multiplication when com-
pared to BNS. However, RNS suffers from inefficient division and magnitude
comparison operations.
Unlike RNS, sign/log number system achieves higher speed and much
reduced hardware complexity, particularly for multiplication and division
operations at the cost of accuracy. Considering the fact that RNS involves a
hardware intensive multiplication and an inefficient division operation, it is
attractive to embed sign/log number system in RNS. Such a mixed number
103
system namely LRNS will give better performance, lower power consumption
and improved accuracy. The improvement in the accuracy is because sign/log
number system is applied to the residue code, which has a far reduced bit
length, compared to its binary counterpart.
The following sections will detail the architectures used in arithmetic
operations and list the advantages of LRNS over binary, RNS and LNS with
respect to power,performance and area.
6.1 Arithmetic operations
This section describes the execution of the various arithmetic operations
in LRNS. Addition and subtraction under this new scheme is performed in
RNS, while multiplication is performed in LNS.
6.1.1 Addition and Subtraction
The operands for the addition/subtraction operation are converted to
residue codes. This representation of operands into many segments of smaller
word length is advantageous to the addition operation as the individual
residues can be operated in parallel. Moreover, the smaller wordlength has
led to absence of carry propagation and lesser memory required for storage.
The execution flow chart for addition/subtraction operations under LRNS is
shown below.
104
Figure 6.1: Execution flow for addition/subtraction operations in LRNS
105
6.1.2 Multiplication
Most DSP algorithms have a high degree of multiplicative complexity.
LRNS plays a vital role in the efficient execution of such algorithms. The
advantage of using LRNS is due to two main reasons 1) The conversion of the
operand into residues allows for the parallel processing of the operands lead-
ing to higher speed 2) The use of logarithms to compute the product greatly
reduces the processing time as the multiplicative operation is replaced by a
single addition operation. The inherent inaccuracy in using the logarithms
is also reduced as the the bit length of the operand is reduced.
The circuits for the addition of the logarithms are discussed in Chapter
on 5. A factor that has to be taken into account when designing the archi-
tectures for LRNS is the computation of log(0). Quite often, the residues of
an operand could take the value 0. As the logarithm of zero is not defined,
the following set of rules are used.
Log (0) = X
Anti log(X) = 0(6.1)
A detailed flowchart depicting the possible architecture employed in the
multiplication operation is shown in Figure 6.3
Figure 6.3 includes a pre-checker circuit. This circuit is used to check if
any of the operands is zero in the case of multiplication. If such a condition
arises, then the pre-checker circuit bypasses the computational units and
106
Figure 6.2: Execution Flow for Multiplication in LRNS
107
Figure 6.3: Flow chart depicting Architecture for Multiplication
108
drives the output to zero. This step saves valuable computation time. In the
case of division, this circuit can be used to check whether the divisor is zero
and correspondingly drive an error bit high.
A checker circuit is employed to check if the generated residue codes
are zero. If any of the residues is zero, then the computation of logs and
antilogs follow the rule stated in (6.1). The checker and pre-checker circuits
are essential for the correct and efficient working of the LRNS multiplication
algorithm.
The memory module for log and anti-log are of Content Addressable
Memory) (CAM) type, thereby greatly reducing the time involved in the
access of the memory. The residue codes of the results from multiplication or
addition/subtraction operation flow down the architecture and gets converted
back to binary code in the final stage through a procedure called the CRT.
This theorem is explained with an example in Chapter 5.
6.2 LRNS: Area, Power, Performance
The new mixed number representation LRNS has been projected as supe-
rior to BNS and RNS for certain computationally intensive DSP algorithms.
To quantify the performance improvement of LRNS, it was compared with
other number systems w.r.t to hardware complexity, interconnect complexity,
109
Number Full adder Gate Interconnect Delay Accuracy Power
System Complexity Complexity Complexity
Binary 64 1056 576 Tadd Highest High
RNS 65 1104 613 ∼ Tadd Highest High
Table 6.1: Performance variation for addition operation
delay, accuracy and power. The results of the analysis are presented below.
6.2.1 LRNS vs Binary
LRNS was compared with binary for different operations like addition
subtraction and multiplication.
6.2.1.1 Addition/Subtraction
In LRNS, addition and subtraction operations are carried out in RNS,
thus giving the system the capability of performing dynamic parallel compu-
tation on the individual residues, with the additional advantage of carry-free
operations. The hardware complexity is calculated at the functional level
(full adder level) and also at the logic level (gate level). The interconnect
count (IC) has been taken across the functional units. While estimating the
power, BNS was taken as a benchmark and labelled ’high’, and the power
involved in arithmetic computations in other number systems was compared
against this benchmark.
The results of the analysis are listed in Table 6.1. It shows the variation
110
in the different parameter for 64-bit operations under the moduli set (2k, 2k−
1, 2k−1−1). The hardware complexity at the full adder level is almost similar,
whereas there is a slight increase in gate count for the RNS. The increased
gate count leads to a larger value for the interconnect complexity across the
gates. We have employed a CSA tree based adder architecture. Hence, the
delay for this architecture will be the pipelining delay equal to the delay
of a single full adder stage. RNS maintains perfect accuracy as long as the
operations are carried out within its dynamic range. This dynamic range can
be extended by increasing the number and magnitude of the prime moduli
used. As the hardware complexity in RNS for add/sub operations is nearly
comparable to BNS, the power consumed can also be termed ’high’.
6.2.1.2 Multiplication
Multiplication is a cumbersome operation in BNS. The basic operation
involved is the repeated addition of the multiplier to the multiplicand. This
repeated addition is carried out using CSA tree based structures. In LRNS,
the repetitive additions are replaced by a single addition of the corresponding
logarithm of the operands.
The hardware used for multiplication in LRNS is the logarithmic mul-
tiplier shown in Chapter 5. LRNS, therefore greatly reduces the hardware
required to perform multiplication. An order of magnitude reduction in hard-
111
Number Full adder Gate Interconnect Delay Accuracy Power
System Complexity Complexity Complexity
Binary 3968 47616 12032 7Tadd Highest High
LRNS 65 1104 613 ∼ 2Tadd Higher Low
Table 6.2: Performance variation for multiplication operation
ware can be achieved. The reduced hardware count has an impact on inter-
connects. Moreover, the delay of the circuit is very much less on account of
the lesser number of computations to be performed. The low hardware count
results in reduction in the power consumed.
The reduction in hardware, interconnects, delay and power make LRNS
very attractive when compared with BNS for the execution of DSP algorithms
which are dominated primarily by multiplication.
6.2.2 LRNS vs RNS
LRNS is different from RNS only in its use of logarithms for performing
multiplication operations. Hence , there is no performance variation between
RNS and LRNS for addition and subtraction operations
6.2.2.1 Multiplication
The performance variation for multiplication operation when performed in
RNS and LRNS is shown in Table 6.3. By using the property of logs, LRNS
achieves very high degree of reduction of hardware complexity, interconnect
112
Number Full adder Gate Interconnect Delay Accuracy Power
System Complexity Complexity Complexity
RNS 1279 15348 3967 < 7Tadd Highest High
LRNS 65 1104 613 ∼ 2Tadd Higher Low
Table 6.3: LRNS vs RNS: Performance variation for multiplication operation
complexity, power etc. The power reduction is by virtue of reduced hardware
count and computational complexity. The vast decrease in interconnect count
is of great importance under DSM technology. However, the accuracy, is not
as high as in RNS due to truncation effects. The minor loss in accuracy is
a small price to pay when compared to the drastic improvement in other
performance parameters in LRNS.
6.2.3 LRNS vs Sign/Log
Though the Sign/Log number system provides a computationally efficient
method for performing multiplication and division operations, it cannot be
used to perform efficient addition/subtraction operation. Another major
drawback of the Sign/log number system is its inaccuracy. This is mainly due
to the truncation effects, which are more pronounced in operations on large
word operands. LRNS reduces this truncation error as the initial conversion
to residue codes reduce the word length of the operand.
113
Number Full adder Gate Interconnect Delay Accuracy Power
System Complexity Complexity Complexity
Sign/log 128 2517 1285 4Tadd Low Higher
RNS 65 1104 613 ∼ Tadd Highest High
Table 6.4: LRNS vs LNS :Performance variation for add/sub operation
6.2.3.1 Addition/Subtraction
The hardware to perform addition in LNS is quite complex and is dis-
cussed in Chapter 5. The performance difference between LRNS and Sign/log
number systems for the computation of addition/subtraction operation is
listed in Table 6.4
As is evident from the results in Table 6.4, LNS is not suited to perform
addition or subtraction operations. LNS suffers from high power consump-
tion due to increased hardware count. The delay associated with add/sub
operation is 4Tadd, which is four times the delay associated with a similar
operation in BNS.
While the high delay, low accuracy and increased power make LNS unattrac-
tive, LRNS maintains the same performance as binary as the add/sub oper-
ations are performed in RNS.
114
Number Full adder Gate Interconnect Delay Accuracy Power
System Complexity Complexity Complexity
Sign/Log 64 1056 576 2Tadd Low Low
LRNS 65 1104 613 ∼ 2Tadd Higher Low
Table 6.5: Performance variation for multiplication operation
6.2.3.2 Multiplication
LNS is highly efficient for multiplication and division operations. A com-
parison between LNS and LRNS for multiplication is shown in the Table
6.5
The results shown in Table 6.5 reveal that the conversion to residue codes
do not affect the hardware or interconnect to a large extent. It however helps
increase the accuracy of the system as the truncation errors on small word
lengths are less.
6.3 Accuracy Analysis
It is well known that the sign/log number system suffers from high degree of
inaccuracy. LRNS provides better accuracy than LNS. This is because the
error in calculating the logarithm is lesser for operands of small wordlength.
Table 6.6 and 6.7, show for an example case of 8 sample points, radix-2
FFT operation.The accuracy of LRNS is within 1 percent of the correspond-
ing values in the Binary Number system.
115
Sample Point Binary FFT LRNS FFT
(16,16) (128,128) (128,128)
(16,16) (0,0) (0,0)
(16,16) (0,0) (0,0)
(16,16) (0,0) (0,0)
(16,16) (0,0) (0,0)
(16,16) (0,0) (0,0)
(16,16) (0,0) (0,0)
(16,16) (0,0) (0,0)
Table 6.6: Accuracy Analysis for 8 Sample point FFT-1
Sample Point Binary FFT LRNS FFT
(2,0) (12,0) (12,0)
(1,0) (1,-2.414) (1,-2.42)
(2,0) (0,0) (0,0)
(1,0) (1,-0.414) (1,-0.42)
(2,0) (0,0) (0,0)
(1,0) (1,-0.414) (1,-0.42)
(2,0) (0,0) (0,0)
(1,0) (1,2.414) (1,2.42)
Table 6.7: Accuracy Analysis for 8 Sample point FFT-2
116
6.4 LRNS Architecture for DFT and FFT
Figure 6.4 shows a PE for a 1-D DFT computed using a matrix-column
vector multiplication architecture. It consists of a LRNS multiplier and an
RNS adder. It is a time-dependent weighting function as the index m changes
cyclically with each clock pulse. Since W is complex, complex valued results
must be calculated in the PE. One complex multiplication can be broken
down into four real multiplications and two real additions. But by employing
LRNS we are converting one complex multiplication into six additions. The
partial product addition is carried out by using RNS addition.
Figure 6.5 shows the LRNS architecture for a radix-4 butterfly element.
The inputs are converted form binary to RNS in the first stage and then to
LNS using the log ROM. Multiplication with twiddle factors is done with the
help of LRNS multipliers. A ROM is used to store the LRNS values of the
twiddle factors. Anti-log ROM is used to convert the results back to RNS.
The binary additions are replaced by RNS additions. The results are finally
converted back to binary.
117
Figure 6.4: PE of a 1-D DFT array
118
Figure 6.5: PE of a radix-4 FFT architecture
119
Chapter 7
LRNS in Time-Frequency
Transforms
A traditional approach in the frequency domain analysis of time and space
domain signals is using FT. This transform is reversible, but applicable only
for stationary signals. With this transform it is not possible to predict the
variation of frequency component with time, examples being speech and EEG
signals. In the frequency domain no time information is available, it is im-
possible to tell where in time a given frequency component occurs, we only
know that it is present in the signal. Hence what we need is a transform
giving a time-frequency representation.
The traditional solution is to use a Short Term (Windowed) Fourier Trans-
form (STFT) that divides the signal into small segments (windows), where
the signal can be assumed to be stationary and FT is computed within each
120
window.
Such time-frequency transforms are applied in areas like speech and image
processing. The Gabor transform (Appendix A) is one such which is much
better than most of its counterparts for image representation. Its major
advantage being that it achieves the lower bound on the joint entropy. It is
found that majority of the mammalian visual profiles match quite well to this
type of representation. The main problem that prevents its widespread usage
is the computational complexity involved in finding the Gabor coefficients.
The following section presents a fast and efficient method for the computation
of the Gabor transformation by changing the number representation.
7.1 The 1-D Discrete Gabor Transform
The discrete Gabor transformation is expressed in matrix notation. The
Gabor coefficients is found by multiplying the inverse of the Gabor matrix
and the signal vector. The Gabor matrix is decomposed into the product
of a sparse constant complex matrix (which has known inverse) and another
sparse matrix which depends only on the window function.
For a finite 1-D signal f(x), x=0,1,...,X-1, X=KM, the complete Gabor
transformation[49] is expressed as
f(x) =K−1∑m=0
M−1∑r=0
amrgmr(x) (7.1)
121
which means, a discrete f(x) with KM sample points has KM coefficients
amr,m=0,1,...,K-1,r=0,1,...,M-1.
The Gabor transformation 7.1 written in matrix form is
f = Ga (7.2)
where
f =
f(0)
f(1)
.
.
.
f(KM − 1)
and a =
a0
a
.
.
.
aKM−1
(7.3)
and G is a KM∗×KM matrix. G can be expressed as
G =
G00 G01 . . . G0,K−1
G10 G11 . . . G1,K−1
.
.
.
GK−1,0 GK−1,1 . . . GK−1,K−1
(7.4)
The elements of G matrix are M ×M matrices. Expanding further (7.2)
becomes
f = CD
E∗ 0 . . . 0
0 E∗ . . . 0
.
.
.
0 0 . . . E∗
a (7.5)
122
where
C =
I 0 0 . . 0
0 (−1)M−1I 0 . . 0
0 0 I . . 0
.
.
0 0 0 . . (−1)M−1I
(7.6)
D =
D0 D−1 D−2 . . D−(K−1)
0 (−1)M−1I 0 . . D−(K−2)
D2 D1 D0 D−(K−3)
.
.
DK−1 DK−2 DK−3 . . D0
(7.7)
Note that the matrix D is a block-Toeplitz matrix. It is derived from
(7.5) that
a =1
M
E 0 . . . 0
0 E . . . 0
.
.
.
0 0 . . . E
D−1Cf (7.8)
where, D−1 is the inverse matrix to D and D−1 is written as
D−1 =
D(0)0 D
(0)−1 D
(0)−2 . . D
(0)−(K−1)
D(1)1 D
(1)0 D
(1)−1 . . D
(1)−(K−2)
D(2)2 D
(2)1 D
(2)0 D
(2)−(K−3)
.
.
D(K−1)K−1 D
(K−1)K−2 D
(K−1)K−3 . . D
(K−1)0
(7.9)
123
Basically the computation time of an algorithm for the inversion of an nth
Toeplitz matrix is bounded by O(n2). However algorithms for fast inversion
of banded Toeplitz matrices by circular decompositions are proposed in [1].
According to [1] the computation time of inverse of an nth order banded
Toeplitz matrix can be reduced to O(nlogn).
Rewriting equations (7.5) and (7.8) the Gabor coefficients are computed
from the following equations (7.10), (7.11) and (7.12) which involve mere
matrix multiplication.
x = Cf (7.10)
y = D−1x (7.11)
a =1
M
E 0 . . . 0
0 E . . . 0
.
.
.
0 0 . . . E
y (7.12)
The computational complexity of the Gabor transformation in binary
number system is analyzed at the algorithm level as follows. The computation
of x in (7.10) is a matrix column vector multiplication in which the matrix
C consists of 1s and 0s only. Hence no multiplication is involved in this case
and this equation can be computed in O(MK) time.
124
In the computation of y in (7.11), O(M2K2) multiplications are involved
and O((M-1)K2) additions are involved as seen from the equation (7.11) and
the time complexity is O(Klog K).
In the computation of a in (7.12) the order of multiplications is O(M2K)
and the order of additions is O((M-1)K)and the computation of a can be
achieved in O(MK) time.
The total computational complexity is found by adding the individual
computational complexities in each matrix multiplications. For the calcula-
tion of all amrs the overall computational complexity gets multiplied by M/2
(it does not get multiplied by M as perceived because half of the coefficients
are complex conjugates of the other half). The overall time complexity of
the Gabor transformation algorithm is found to be O(MK log MK) (in terms
of multiplication) which can be comparable to that of a discrete FFT. It is
clearly seen that the computational complexity of the Gabor transformation
is estimated to be very high which restricts its widespread use.
7.2 LRNS in Gabor Transform
Using LRNS in Gabor transform nullifies all the multiplications involved
as the multiplications become additions. The computational complexity of y
is only the addition complexity given by O(M2K2) addition complexity. The
125
addition complexity of the final stage becomes O(M2K). The overall com-
putational complexity becomes approximately O(M2K2). Hence the time
complexity of the Gabor transformation is far reduced from the original mul-
tiplicative complexity of O(MK log MK) to corresponding addition complex-
ity.
The computational flow of finding the Gabor Coefficients for the given
function f(t) is shown at the functional level both in BNS and LRNS in
Figure 7.1
The functional level schematic depicted in Figure 7.1 brings out the su-
periority of LRNS in reducing the time complexity involved in executing
the different arithmetic stages of finding Gabor coefficients. The applica-
tion of LRNS can also be extended to the multidimensional discrete Gabor
Transformation wherein the reduction in computational complexity and time
complexity are even more predominant.
126
Figure 7.1: Computational flow for finding the Gabor Coefficients using Bi-
nary and LRNS127
Chapter 8
Mixed number system
Arithmetic Processor-MAP
High performance, accuracy and low power are the most important de-
sign parameters of DSP architectures. In DSM based technology, while high
performance can be achieved, power becomes a critical factor, which needs
either a new architecture or even a new number system. The computational
complexity of DSP algorithms leads to high power consumption particularly
in high performance applications. We present an architecture for arithmetic
processor based on a mixed number system [31] to achieve low power and
improved accuracy without sacrificing on performance in the context of DSM
technology. The uniqueness of this architecture is that unlike conventional
architectures, there is support for four different types of number systems.
This architecture is expected to fill the gap between the conventional archi-
128
tectures and the DSM technology, achieving reduced interconnect complexity
and low power.
8.1 MAP Architecture
The Architecture for the Mixed number system based Arithmetic Proces-
sor (MAP) is shown in Figure 8.1. This architecture has quite a few special
functional units corresponding to RNS, sign/log and Binary system. These
functional units help achieve very high performance with reduced power con-
sumption for executing computationally intensive DSP algorithms.
The on-MAP memory modules for storing RNS and BNS data are in-
cluded . The RNS memory could be of an SRAM cache type and the BNS is
of DRAM type. The reason for including an on-chip cache is that, the volume
of RNS data is bound to be much more than BNS data. This is because the
computational complexity in DSP algorithms is dominated by multiplication
operations which are performed using LRNS in the MAP architecture.
The processor is supported by a uni-bus architecture consisting of data
/control /address buses. The data bus is provided with the capability of
handling both RNS and Binary data. The RNS data is carried as a set of
residue codes requiring a bus width larger than the Binary data. The control
unit carries out loading of either the RNS residue code or the Binary data
129
Figure 8.1: Mixed number system Arithmetic Processor(MAP)
130
by appropriately driving the respective bus lines to tri-state.
8.1.1 Special Purpose Functional Units
In majority of signal processing applications, the input to the processor is
from analog-digital converter (ADC). The output of the ADC gets converted
to residue codes in the BRC unit. This unit performs modulo division op-
eration. The divider unit performs multiplicative division using a pipelined
architecture. Another major functional unit of this processor is LMU. This
receives inputs from the LAU which is a CAM system. The basic operations
performed in the LMU are addition and subtraction. The PMA performs
RNS addition and subtraction.
8.1.2 General Purpose Functional Units
Besides all these special purpose functional units, a general purpose Binary
ALU is provided to perform any pre-processing needed in executing particular
DSP algorithms and also modulo division. The Binary ALU does not not
contain any high speed multiplier or divider unit. After execution of an
algorithm, the results that are in RNS form are converted back to BNS in
the RBC [14] [44] unit.
131
Figure 8.2: Special purpose MAP Instruction Format
8.2 Instruction Set
This arithmetic processor includes a powerful instruction set having both
special purpose and general-purpose characteristics. The general-purpose
instructions include the conventional arithmetic, load, store and control in-
structions. The special purpose instructions are used for performing opera-
tions under sign/log, RNS and LRNS.
The list of special purpose instructions is given below and their instruction
format is given in Figure 8.2. The instruction format includes individual
fields corresponding to different number systems. This is bound to make the
decoding logic simple.
RNS Instructions
RAD- RNS Addition
RSU- RNS Subtraction
MOD- Modulo operation
RBC- RNS to BNS Conversion
BRC- Binary to RNS Conversion
132
Sign/Log Instructions
SMU- sign/log Multiplication
SDIV- sign/log Division
LOG- Calculates logarithm
ALO- Calculates anti-logarithm
LRNS Instructions
LRM- LRNS Multiplication
The instruction set includes quite a few special purpose instructions for
faster execution. Hence it is better to avoid complex addressing modes. The
above special purpose instruction set supports Immediate and Direct ad-
dressing modes only. A MAP compiler can be designed to map the execution
of an algorithm into a balanced set of instructions for executing the different
arithmetic operations using relevant number systems.
8.3 Execution Flow for MAP Instructions
The data and control flow for the special purpose instructions are given
in Figures 8.3, 8.4 and 8.5. The bold lines represent the control flow and the
ordinary lines represent the data flow.
RAD/RSU
Figure 8.3 shows the execution flow for the instruction RAD/RSU. The
133
Figure 8.3: Execution Flow for RAD/RSU Instruction
Figure 8.4: Execution Flow for SLM/SLD Instruction
134
Figure 8.5: Execution Flow for LRM Instruction
operands to be added are converted into residues by using an appropriate
moduli set. The residues are added(in case of subtraction complementary
addition is performed) and the resultant residue set is converted to binary
using CRT.
SLM/SLD
Figure 8.4 shows the execution flow for the instruction SLM/SLD. The
logarithm of the operands,read form the Log/Anti-log ROM, are added/subtracted.
The anti-log of the resultant residue set is read form the Log/Anti-log ROM
to obtain the result in binary.
LRM
Figure 8.5 shows the execution flow for the instruction LRM .The operands
135
to be multiplied are converted into residues by using an appropriate moduli
set. The logarithm of the corresponding residues, read form the Log/Anti-log
ROM, are added. The anti-log of the resultant residue set is read form the
Log/Anti-log ROM and is finally converted to binary using CRT.
8.4 Verilog Simulation of MAP Instruction
Set
To verify the architectural behavior of the MAP Processor, the execution
of selected arithmetic operations in RNS, LNS and LRNS is simulated in
Verilog and the data flow timing diagrams are provided.
RAD
The timing simulation for RAD instruction is shown in Figure 8.6. a and
b signals are operands which takes values 10 and 5 respectively. The signals
m1, m2, m3 represent the moduli set used (3,5,7). The output signal is ad1
which has value 15 and f is the sign flag of the output which goes high to
indicate a positive sign. The output occurs after 2 ns.
RSU
The timing simulation for RSU instruction is shown in Fig.8.7 . a and b
signals are operands which takes values 3 and -14 respectively. The signals
m1, m2, m3 represent the moduli set used (3,5,7). The output signal is d1
136
Figure 8.6: Timing diagram for RAD Instruction
which has value 11 and f is the sign flag of the output which goes low to
indicate a negative sign. The output occurs after 2 ns.
LRM
The timing simulation for LRM instruction is shown in Figure 8.8 . a and
b signals are operands which takes values 6 and -3 respectively. The signals
m1, m2, m3 represent the moduli set used (3,5,7). The output signal is d1
which has value -18 and f is the sign flag of the output which goes low to
indicate a negative sign. The output occurs after 4 ns.
SLD
The timing simulation for SLD instruction is shown in Figure 8.9 . a1
and b1 signals are operands which takes values -8 and -4 respectively. The
137
Figure 8.7: Timing diagram for RSU Instruction
Figure 8.8: Timing diagram for LRM Instruction
138
Figure 8.9: Timing diagram for SLD Instruction
signals y1 and y2 indicate the sign of the operands a1 and b1 respectively.
Both y1 and y2 takes value 0 as both a and b are negative. The signals m1,
m2, m3 represent the moduli set used (3,5,7). The output signal is c1 which
has value -18 and f is the sign flag of the output which goes high to indicate
a positive sign. The output occurs after 2 ns.
139
Chapter 9
Future Work
The LRNS mixed number representation presented in this thesis can find
wide applications in image and digital signal processing.
9.1 Reconfigurable FFT Architecture for dif-
ferent Radices
Based on the applications and number of sample points, different radices
based FFT architectures may be needed. Designing an FFT architecture
in which the different radices can be brought in by proper reconfiguration
of the hardware is a difficult task. This design will involve enormous hard-
ware resources and hence will consume more power. Application of LRNS
in such complex reconfigurable FFT architecture will be great importance.
The overhead involved in reconfiguration can get greatly compensated by the
introduction of LRNS which eliminates the multiplier units.
140
The reconfiguration across different radices can be achieved by employ-
ing proper multi-stage interconnection network like Delta, Omega, Banyan
and Clos.The efficiency of interconnection for FFT reconfiguration across dif-
ferent radices will have to be investigated employing these MIN structures.
A possible functional level architecture for this radix reconfigurable FFT is
shown in Figure. This has several alternate stages of butterflies correspond-
ing to different radices and MIN structures. The different butterfly structures
from input stages to output stages can get connected according to the re-
configuration strategy and the capability of the MIN structures. A detailed
investigation on designing this architecture, the simulation etc will be taken
up at a later stage.
9.2 Low Power High Performance LRNS based
Convolver Design
The Convolution of two sequences x(n) and w(n) in LRNS is defined as
y(n) =∞∑
k=−∞ALOG(xLRNS(k) + wLRNS(n− k)) (9.1)
Suppose x(n) and w(n) are causal sequences and each is of finite length
N , i.e., n = 0, 1, 2, ..., N − 1, the (linear) convolution of these two sequences
is a causal sequence, computed in LRNS as
y(n) =N−1∑k=0
ALOG(xLRNS(k) + wLRNS(n− k)) (9.2)
141
Figure 9.1: Reconfigurable FFT Architecture
142
where, n = 0, 1, 2, ...2N − 2
The Computational complexity is reduced to 2N-1 addition complexity
by using LRNS. In case of binary it is N multiplications and N-1 additions
[23].
The 2-D convolution formula in LRNS is given as follows
y(n) =n1∑
k1=0
n2∑k2=0
ALOG(xLRNS(k1, k2) + wLRNS(n1 − k1, n2, k2)) (9.3)
where n1, n2ε0, 1, ..., 2N − 2
The number of computations in 2-D convolution is usually very large
for binary. By employing LRNS the number of computations is reduced to
(2N − 1)2 additions alone.
9.2.1 LRNS Based Convolver Architecture
The systolic architecture for 1-D convolution employing LRNS is shown
in Figure 9.2. The hardware delay and power analysis of the LRNS based
convolver architecture and a comparative study against binary based archi-
tecture will be taken up.
143
Figure 9.2: LRNS Based 1-D Convolver Architecture
144
Chapter 10
Conclusion
The thesis deals with the interconnect dominance in FFT architectures
and shows the need for shift in the design paradigm from device dominated
design to interconnect dominated design methodologies.
A mixed number representation called LRNS is evolved by embedding
the sign/log number system into the RNS. Major advantages of the mixed
number representation over the binary, Residue and the sign/log number
system are demonstrated. Based on this mixed number system, an arithmetic
processor (MAP) has been designed. MAP instruction set comprising of
general purpose and special purpose instructions has been presented.
MAP is expected to fill the gap between the conventional architectures
and the DSM technology, achieving reduced interconnect complexity and low
power without sacrificing on performance. LRNS can be widely applied in
embedded DSP systems to reduce the computational complexity of multipli-
145
cation and the corresponding power. A major application of this could be in
the development of high performance low power multidimensional convolver
which has several important applications in DSP and image processing.
146
Appendix A
The Generalized Gabor
Transform
For a given function f(t), tεR, the generalized Gabor transform [20] finds
a set of coefficients amr such that
f(t) =∞∑
m=−∞
∞∑r=−∞
amrg(t−mT )exp(i2πrt
T ′) (A.1)
where T, T ′ > 0. In some literatures
Ω =2π
T ′(A.2)
is also used in the exponential term in A.1. When TΩ = 2π, or T = T ′, A.1
becomes the original Gabor Transform proposed proposed by D.Gabor [8]
and it is also called the critical sampling Gabor Transform. When TΩ < 2π,
or T < T ′, A.1 is called the over sampling Gabor transform.
147
The 1-D generalized Gabor elementary functions (GEF) are defined as
gmr(t) = g(t−mT )exp(i2πrt
T ′) (A.3)
where, T, T ′ > 0 as seen above, and g(t)εL2(R). For a given function f(t),
tεR, the 1-D generalized Gabor transform finds a linear expansion against
the above set of Gabor elementary functions
f(t) =∞∑
m=−∞
∞∑r=−∞
amrgmr(t) (A.4)
The set of coefficients amr are called the Gabor coefficients of the function
f(t). One well known method for the computation of the Gabor coefficients for
an arbitrary function f(t) is to introduce an auxiliary biorthogonal function
γ(t) depending on g(t), such that the Gabor coefficients can be expressed as
[26]
amr =
∞∫−∞
f(t)γ∗(t−mT )exp(−i2πrtT ′
dt) (A.5)
A.1 1-D Discrete Gabor Transformation
For a finite 1-D signal f(x), x=0,1,...,X-1, X=KM, the complete Gabor
transformation [49] is expressed as
f(x) =K−1∑m=0
M−1∑r=0
amrgmr(x) (A.6)
That is, a discrete f(x) with KM sample points has KM coeffients amr,
m=0,1,...,K-1,r=0,1,...,M-1.
148
For any integer M > 0, if g is a real function,
gmr(x) = g∗m,M−r−1(x) (A.7)
and if both f and g are real functions,
amr(x) = a∗m,M−r−1(x) (A.8)
We first introduce two M ×M matrices E and E∗. For u,v=0,1,...,M-1,
E’s element (u,v) is defined as
euv = exp(−i2πh(u)h(v)M
) (A.9)
E∗ is conjugate of E. That is, E∗’s element (u,v) is defined as
e∗uv = exp(i2πh(u)h(v)
M) (A.10)
It is easy to verify that EE∗ = E∗E = MI, where I is a unit matrix.
We can rewrite the Gabor transformation A.11 as
f(x) =KM−1∑
l=0
algl(x) (A.11)
where, al corresponds to the original amr, m =[l/M], r=l mod M, and g(x)
has the same form corresponding to gmr(x). The Gabor transformation A.11
written in matrix form is
f = Ga (A.12)
149
where
f =
f(0)
f(1)
.
.
.
f(KM − 1)
and a =
a0
a
.
.
.
aKM−1
(A.13)
and G is a KM∗×KM matrix. G can be expressed as
G =
G00 G01 . . . G0,K−1
G10 G11 . . . G1,K−1
.
.
.
GK−1,0 GK−1,1 . . . GK−1,K−1
(A.14)
where each Gpq, p,q=0,1,...,K-1 is an M ×M matrix.
Let g(pq)uv be the element (u,v) of matrix Gpq.
gpquv = gqv(u+pM) = (−1)p(M−1)g(h(u)+(p−q)M)exp(i
2πh(v)h(u)
M) (A.15)
where u,v=0,1,...M-1. We can see that
Gpq = (−1)p(M−1)Dp−qE∗ (A.16)
where Dp−q is an M × M diagonal matrix with the uth diagonal element
being
dp−quu = g(h(u) + (p− q)M) (A.17)
150
where u=0,1,...,M-1. Thus
f = CD
E∗ 0 . . . 0
0 E∗ . . . 0
.
.
.
0 0 . . . E∗
a (A.18)
where
C =
I 0 0 . . 0
0 (−1)M−1I 0 . . 0
0 0 I . . 0
.
.
0 0 0 . . (−1)M−1I
(A.19)
D =
D0 D−1 D−2 . . D−(K−1)
0 (−1)M−1I 0 . . D−(K−2)
D2 D1 D0 D−(K−3)
.
.
DK−1 DK−2 DK−3 . . D0
(A.20)
The following is derived from A.12 (note that C=C−1)
a =1
M
E 0 . . . 0
0 E . . . 0
.
.
.
0 0 . . . E
D−1Cf (A.21)
151
D−1 is the inverse matrix to D. Since most of the elements in D are 0’s,
it is relatively easy to find the inverse D−1. If If g is a real function, D is a
real matrix. This makes the computation even easier. D−1 can be rewritten
in similar format
D−1 =
D(0)0 D
(0)−1 D
(0)−2 . . D
(0)−(K−1)
D(1)1 D
(1)0 D
(1)−1 . . D
(1)−(K−2)
D(2)2 D
(2)1 D
(2)0 D
(2)−(K−3)
.
.
D(K−1)K−1 D
(K−1)K−2 D
(K−1)K−3 . . D
(K−1)0
(A.22)
where each D(p)p−q, p,q=0,1,...,K-1, is also an M ×M diagonal matrix.
Since D is a block Toeplitz matrix, the inversion of D follows simply
that of any Toeplitz matrix inversion algorithm. Fast algorithms for finite
Toeplitz matrix inversion are suggested in [45] [41] [1]
The fast Gabor transformations can be considered in two views. First,
since matrices D,E and C are independent to input signal f, these matri-
ces and their products can be precalculated for any fixed window function
g(t) and K,M values. So the Gabor transformation becomes a product of a
constant matrix and the input signal f. Second, we can rewrite (A.18) and
(A.21) to
x = Cf (A.23)
152
y = D−1x (A.24)
a =1
M
E 0 . . . 0
0 E . . . 0
.
.
.
0 0 . . . E
y (A.25)
The computation and time complexity of the above equations are dis-
cussed in Chapter 7 for both binary and LRNS. (A.21) can be rewritten
to
am =1
ME
K−1∑k=0
(−1)k(M − 1)D(m)m−kfk (A.26)
where
f =
f(0 +mM)
f(1 +mM)
.
.
.
f(M − 1 +mM)
and a =
am0
am1
.
.
.
am,M−1
(A.27)
If both f and g are real functions, E is the only complex matrix in (A.26).
So, the real part and imaginary part of amr can be calculated separately. In
this case, we can easily verify, for m = 0, 1, ..., K − 1 and r = 0, 1, ...,M − 1,
am,M−r−1 = a∗mr (A.28)
This means half of the coefficients are complex conjugates of another half.
Only half of them need to be computed.
153
Appendix B
CAM-Content Addressable
memories
A CAM (Content-addressable Memory) is an advanced memory device
that has many applications.It is highly advantageous in applications that
require fast searches of a database,list or pattern like image and voice recog-
nition,or computer and communication networks. CAMs obtain an order-of-
magnitude reduction in the search time over other memory search algorithms,
such as binary or tree-based searches by simultaneously comparing the de-
sired information against the entire list of stored values.
The working of a CAM is better understood by comparing it with a
RAM. RAM is an acronym for Random Access Memory, which emphasizes
the ability to examine each stored data independently of any other piece of
data. Data is stored at a particular location, called an address. In a RAM,
154
Figure B.1: Typical CMOS SRAM Memory Cell
the address is supplied and the data at that location is retrieved. The depth
of the memory, or number of locations, is limited by the ability to address
the memory.
For example, if the address bus is eight bits wide, only 256 memory
locations can be addressed, since in binary math,28 = 256. Binary logic is
used, because signal lines normally have only two states, HIGH and LOW.
RAM chips are composed of arrays of cells of transistors. Each cell repre-
sents one bit and contains one or more transistors depending on the type of
RAM. CMOS Static RAMs commonly use six transistors per cell, as shown
in figure B.1; four are cross-coupled to store the state of the bit, and two are
used to alter or read out the state of the bit.
155
This configuration is called Static because the state of the bit remains
at one level or the other until deliberately changed, or power is removed.
Dynamic RAMs, on the other hand, get their name from the transient nature
of their storage mechanism, which commonly consists of a single transistor
along with a capacitor to store the bit information.
During a read, the charge on the capacitor is drained to the bit line, re-
quiring a rewrite of the bit, called a restore operation. Additionally, because
the DRAM capacitor is not perfect, it loses charge over time, and needs to
have its charge refreshed at regular intervals. Thus, dynamic memories are
accompanied by controller circuits to rewrite the bit and refresh the stored
charge on a regular basis.
Neither SRAMs nor DRAMs retain information when power is removed,
but SRAMs are often used to store important configuration information, with
battery back-up as SRAM does not require refreshing.
CAMs are organized differently. In a CAM, data is stored in locations
in a somewhat random fashion. The locations can be selected by an address
bus, or the data can be written directly into the first empty location because
every location has a pair of special status bits . These bits keep track of
whether the location has valid information in it or is empty and available for
overwriting. The data stored in memory is located by comparing every bit
in memory with data placed in a special Comparand register. If there is a
156
perfect match for every bit that is compared then a Match Line is asserted
to indicate that the data in the comparand register is found in memory.
A priority encoder is used to retrieve the address of the matching location
that has highest priority. Thus, with a CAM, the data is supplied and the
address is retrieved. As the CAM doesn’t need address lines to find data, the
depth of a memory system using CAMs can be extended as far as desired,
but the width (wordsize) is limited by the size of the chip. The depth can be
easily extended as the addressing is all self-contained. Extending the width
takes additional routines due to the difficulty in extending 1024 match lines
from chip to chip.
CAMs are based on memory cells that have been modified by the addition
of extra transistors that compare the state of the bit stored with the state
stored in a Comparand register. Logically, CAMs perform an exclusive-NOR
function, so that a match is only indicated if both the stored bit and the
corresponding Comparand bit are in the same state.
The static CAM cell shown in fig B.2 is composed of a six-transistor
SRAM memory cell plus four transistors to accomplish the exclusive-NOR
function and match line driving operation.
For writing and reading, each Static CAM cell acts like a normal SRAM
cell, with differential bit lines to latch the value into the cell when writing,
and sense amplifiers to detect the stored value when reading. When writing,
157
Figure B.2: Typical CAM memory cell
158
the word line is energized, turning on the pass transistors which then force
the cross-coupled transistors to the levels on the bit lines. When the word
line is de-energized, the cross-coupled transistors remain in the same states.
For reading, the bit lines are precharged to the same intermediate voltage
level, the word line is energized, and the bit lines are forced to the levels
stored by the cross-coupled transistors. The sense amplifiers respond to the
difference in the bit lines and report the stored state.
For comparing, the match line is precharged to a high level, the bit lines
are driven by the levels of the bit stored in the Comparand register, but the
word line is not energized, so the state of the cross-coupled transistors is not
affected. The exclusive-NOR transistors compare the internally stored state
of the cross-coupled transistors with the levels of the Comparand bit, and if
they don’t agree, the Match line is pulled down, indicating a non-matching
bit. All the bits in a stored entry are connected to the same Match line, so
that if any bit in a word doesn’t match with its corresponding Comparand
bit, that Match line is pulled down. Only the entries where the Match line
stays HIGH are considered matches. All the Match lines are fed to a Priority
encoder that determines whether any match exists, whether more than one
match exists, and which matching location is considered the highest priority.
CAM though is expensive and consumes high power, it is employed wher-
ever high performance is an important criterion.
159
Index
Additive inverse, 85
Anti-log ROM, 117
Area-Time(AT) product, 32
Base extension, 91
Binary number system, 75
multiplication, 77
representation, 86
SRT division, 78
subtraction, 76
Binary number system(BNS), 103
Biorthogonal function, 148
Block Toeplitz matrix, 123, 152
Brent-Kung CLA, 42
hardware complexity, 42
Brent-Kung dot operator, 36
Buffer insertion, 17
Carry save adder(CSA), 45
delay, 46
Chinese Remainder Theorem(CRT),
86
Chinese remainder theorem(CRT),
109
Chip power, 24
Code conversion
RNS to BNS, 86
RNS to MRNS, 89
Code conversions, 85
BNS to RNS, 85
Comparand Register, 156
Complement, 76
Complete Gabor transform, 148
Complete Gabor transformation, 121
Content Addressable memories(CAM),
160
154
Static CAM, 157
Content Addressable Memory(CAM),
109
CORDIC, 6
CSA tree
delay, 49
Decimal number system, 73
Deep submicron
effects, 14
Deep submicron(DSM)
impact, 22
DFT Vs FFT, 72
Diminished radix complement, 76
Discrete Fourier Transform(DFT),
2
algorithm, 3
architecture, 4
hardware count, 56
HIPD, 60
HPD, 57
interconnect count, 56
IPD, 59
multidimensional, 11
algorithm, 12
pipeline architecture, 13
systolic architecture, 13
twodimensional, 13
Discrete Gabor transform, 121
Discrete Gabor transfrom
multidimensional, 126
Dynamic range, 82, 83, 111
Fast Fourier Transform(FFT), 7
algorithm, 7
interconnect complexity, 62
array processor, 9
clock broadcasting, 70
DIF, 7
DIT, 7
HPD, 68
161
IPD, 69
parallel processor, 9
pipeline processor, 9, 11
sequential processor, 9
Floating point, 76
Fourier Transform(FT), 1, 2
Full adder, 32
Gabor coefficients, 148
Gabor elementary functions, 148
Gabor transform, 121
binary, 124
critical sampling, 147
LRNS, 125
matrix, 122, 149
over sampling, 147
Generalized Gabor transform, 147
Hardware power delay product(HPD)
Brent-Kung structure, 44
CSA tree, 49
parallel array multiplier, 51
Wallace tree, 53
Hardware power-delay product(HPD)
ripple adder, 41
serial adder, 39
High level characterization
adders, 24, 25
multipliers, 25
Interconnect
delay, 15
scaling effects, 15
Interconnect complexity, 24
functional level, 25
gate level, 25
Interconnect complexity(IC)
ripple adder, 41
Interconnect count(IC), 31
Brenk-Kung structure, 44
Brent-Kung dot operator, 36
full adder, 32
parallel array multiplier, 50
162
wallace tree, 53
interconnect count(IC)
CSA tree, 48
Interconnect delay, 31
Brent-Kung dot operator, 36
full adder, 35
Interconnect delay product(IDP), 32
Brent-Kung dot operator, 36
Brent-Kung structure, 44
CSA tree, 49
full adder, 35
parallel array multiplier, 51
ripple adder, 41
serial adder, 39
Wallace tree, 53
Interconnect power, 32
Brent-Kung dot operator, 36
full adder, 35
Interconnect power delay product(IPD)
Brent-Kung dot operator, 36
Brent-Kung structure, 44
CSA tree, 49
full adder, 35
parallel array multiplier, 51
serial adder, 40
Wallace tree, 53
Interconnect power-delay product(IPD),
32
ripple adder, 41
ITRS, 14, 17
Log curve, 97
Log ROM, 117
Logarithmic number system(LNS),
20, 95, 103
addition, 100
disadvantages, 21, 113
division, 99
multiplication, 99
representation, 95
subtraction, 100
truncation error, 113
163
Logarithmic Residue Number Sys-
tem(LRNS), 23
Logarithmic residue number system(LRNS),
104
accuracy, 113–115
addition, 104, 110
delay, 111
hardware complexity, 110
interconnect count(IC), 110
power, 110
advantages, 106
delay, 114
DFT architecture, 117
multiplication, 111, 112
delay, 112
hardware complexity, 112
hardware count, 112
interconnect complexity, 113
power, 113
power, 114
radix-4 butterfly, 117
subtraction, 104, 110
Logarithms
generation, 97
algorithm, 97
hardware, 98
Low level characterization
adders, 24
multipliers, 24
MAP, 128
architecture, 129
data memory, 129
execution flow
LRM, 135
RAD/RSU, 133
SLM/SLD, 135
general purpose functional units,
131
instruction set, 132
LNS instructions, 133
LRNS instructions, 133
164
RNS instructions, 132
special purpose functional units,
131
timing
LRM, 137
RAD, 136
RSU, 136
SLD, 137
Maximum interconnect length(MIL),
31
Mixed number system, 21
arithmetic processor, 23
Mixed radix system, 88
Mixed-radix systems, 74
Modulus, 81, 84
Multiple operand addition, 45, 46
Multiplication, 49
Multiplicative inverse, 85, 87
Number Systems
range, 73
redundancy, 74
uniqueness, 74
Overflow detection, 90
Parallel array multiplier, 50
delay, 51
hardware complexity, 50
Power equation, 57
Power reduction, 19
Power-delay product, 57
Pre-checker circuit, 106
Prioriy Encoder, 157
Radix 4 FFT architecture, 64
computational element, 64
delay commutator, 64
Radix complement, 76
Random Access Memory (RAM)
Dynamic RAM(DRAM), 156
Static RAM(SRAM), 155
Random Access Memory(RAM), 154
Read Only Memory(ROM), 95, 102
165
Refresh, 156
Residue number system(RNS), 20,
81
addition, 92
advantages, 92, 103
drawback, 103
drawbacks, 21, 94
multiplication, 92
negative numbers, 83
representation, 81
representational efficiency, 83
subtraction, 92
Restore, 156
Ripple carry adder, 40
delay, 41
Rounding off error, 96
Scaling, 91
Scaling factor, 95
Serial adder, 38
delay, 38
Short Term Fourier Transform (STFT),
120
Sign bit, 75, 77, 83, 100
Sign magnitude, 76
Signal processing, 1
Single precision, 77
Straight line approximation, 97
Truncation error, 96
Wallace tree multiplier, 51
Weighted Number System, 74
166
Bibliography
[1] A.K.Jain, Fast inversion of banded toeplitz matrices by circular decom-
position, IEEE Trans. Acoust., Speech, Signal Processing ASSP-26.
[2] Semiconductor Industry Association, International technology roadmap
for semiconductors: 1999, Austin, TX: SEMATECH (1999).
[3] H. Bakoglu, Circuits, interconnections, and packaging for vlsi, Reading,
MA: Addison-Wesley, 1990.
[4] C.R. Baugh and B.A. Wooly, A two’s compliment parallel array multi-
plier, IEEE Trans. Comput. C-22.
[5] G.A. Jullien Ben-Dau Tseng and William C.Miller, Implementation of
fft structures using the residual number system, IEEE Trans. Comput.
C-28.
[6] T. S. Chang and C.W. Jen, Hardware efficient transform designs with
cyclic convolution and subexpression sharing, Proc. ISCAS (1998), 398–
401.
167
[7] Wai-Kai Chen, The vlsi handbook, CRC Press LLC, Florida, 2000.
[8] D.Gabor, Theory of communications, J.Inst.Elec.Engr. 93.
[9] E.E.Swartzlander and A.G.Alexopolous, The sign/logarithm number
system, IEEE Trans. Comput. C-24.
[10] C.J. Anderson et al., Physical design of a fourth-generation power ghz
microprocessor, Proceedings of the IEEE International Solid-State Cir-
cuits Conference (2001), 232–233.
[11] P. E. Gronowski et al., High-performance microprocessor design, IEEE
Journal of Solid-State Circuits 33.
[12] A. Skavantzos F.J. Taylor, G. Papadourakis and A. Stouratis, A radix-4
fft using complex rns arithmetic, IEEE Trans. Comput. C-34.
[13] M.A. Franklin, Vlsi performnce of banyan and crossbar communication
networks, IEEE Trans. Comput. C-30.
[14] M. Re G. C. Cardarilli and R. Lojacono, Rns-to-binary conversion for
efficient vlsi implementation, IEEE Trans. Circuit Syst 47.
[15] Harvey L. Garner, , the residue number system, IRE Transactions on
Electronic Computers (1959), 140–147).
168
[16] I. Gertner and M. Shamash, Vlsi architectures for multidimensional
fourier transform processing, IEEE Trans. Comput. C-36.
[17] John L Hennessy and David Patterson, Computer architecture-a quan-
titative approach, Morgan Kaufmann Publishers, Inc., California, 2000.
[18] T. Aboulnasr J. A. Beraldin and W. Steenart, Efficient one-dimensional
systolic array realization of discrete Fourier transform, IEEE Trans. on
Circuits and Systems 36(1) (1989), 95–100.
[19] C.M. Liu J. I. Guo and C.W.Jen, The efficient memory-based vlsi array
designs for DFT and DCT, IEEE Trans. on Circuits and Systems Part
II, 39(10) (1992), 723–733.
[20] Patrick Krolak Jie Yao and Charlie Steele, The generalized gabor trans-
form, IEEE Trans. Image Processing 4.
[21] J.W.Cooley and J.W.Tukey, An algorithm for the machine calculation
of complex Fourier series, Math. Comput. 19.
[22] K. Kocsis, A fully pipelined high speed DFT architecture, Proc. ICASSP
(1991), 1569–1572.
[23] S.Y Kung, VLSI array processors, Prentice Hall, New Jersey, 1988.
169
[24] L. Hartimo L. Wang and T.Laakso, A novel double decomposition method
for systolic implementation of DFT, Proc. ISCAS (1992), 1085–1088.
[25] H.S. Lim and Jr. E.E. Swartzlander, A systolic array for 2-D DFT and
2-DDCT, Proc. Int. Conf. Application-Specific Array Processors (1994),
123–131.
[26] M.Bastiaans, Gabor’s expansion of a signal into gaussian elementary
signals, Opt.Eng. 20.
[27] Mahesh Mehendale and Sunil D.Shrelekar, Vlsi synthesis of dsp kernels
algorithmic and architectural transformations, Kluwer Academic Pub-
lishers, Boston, 2001.
[28] J. Meindl, Low power microelectronics: Retrospective and prospect, Proc.
IEEE 83.
[29] J.N. Mitchell, Computer multiplication and division using binary loga-
rithms, IRE Transactions on Electronic Computers EC-11.
[30] N. Venkateswaran, S. Praveen, R. Subramanian and Vasanth Ramesan,
Emerging impact of DSM technology of DFT and FFT architectures,
selected for publication at International Signal Processing Conference,
Dallas, USA (2003).
170
[31] N. Venkateswaran, Vasanth Ramesan, R. Subramanian and S. Praveen,
A mixed number system based low power, high performance arithmetic
processor for DSP applications, selected for publication at International
Signal Processing Conference, Dallas, USA (2003).
[32] Peter Pirsch, Architectures for digital signal processing, John Wiley and
Sons, New York, 1998.
[33] John G. Proakis and Dimitriz G. Manolakis, Digital signal processing:
Principles, Algorithms and Applications, Prentice Hall of India, New
Delhi, 1997.
[34] H. T. Kung R. P. Brent, A regular layout for parallel adders, IEEE
Trans. Comput. C-31.
[35] T.D. Roziner and M.G. Karpovsky, Multidimensional fourier transform
by systolic architetcures, J. VLSI Signal Process. 4.
[36] I. Sedukhin S. Peng and S. Sedukhin, Design of array processors for 2-d
discrete Fourier transform, IEICE Trans. Inform. Syst. E80-D.
[37] H.S. Stone, Parallel processing with perfect shuffle, IEEE Trans. Com-
put. C-20.
[38] M. Sundaramurthy and V. Umapathy Reddy, Some results in fixed point
fast Fourier tranform error analysis, IEEE Trans. Comput. C-26.
171
[39] Dennis Sylvester and Kurt Keutzer, A global wiring paradigm for deep
submicron design, IEEE Transactions on Computer Aided Design of
Integrated Circiuts and Systems 19.
[40] N.S. Szabo and R.I Tanaka, Residue arithmetic and its applications to
computer arithmetic, Mc Graw-Hill, New York, 1967.
[41] S.Zohar, Toeplitz matrix inversion: The algorithm of w.f.trench,
J.Assoc.Comput.Mach. 16.
[42] John C Becker Thomas A. Brubaker, Multiplication using logarithms
implemented with read- only memory, IEEE Trans. Comput. C-24.
[43] C. S. Wallace, A suggestion for a fast multiplier, IEEE Trans. Elec.
Comput. EC-13.
[44] M. O. Ahmad Wei Wang, M. N. S. Swamy and Yuke Wang, A high-speed
residue-to-binary converter for three-moduli (2k, 2k−1, 2k−1 − 1) rns and
a scheme for its vlsi implementation, IEEE Trans. Circuit Syst 77.
[45] W.F.Trench, An algorithm for the inversion of finite toeplitz matrices,
J.SIAM 12.
[46] S. A. White, Applications of distributed arithmetic to digital signal pro-
cessing: a tutorial review, IEEE ASSP Magazine 6 (1989), 4–19.
172
[47] E. G. Friedman Y. I. Ismail and J. L. Neves, Exploiting on-chip induc-
tance in high speed clock distribution networks, IEEE Transactions on
Very Large Scale Integration (VLSI) Systems 9.
[48] , Figure of merit to characterize the importance of on-chip in-
ductance, IEEE Transactions on Very Large Scale Integration (VLSI)
Systems 7.
[49] Jie Yao, Complete gabor transformation for signal representation, IEEE
Trans. Image Processing 2.
[50] Sungwook Yu and Jr. Earl.E.Swartzlander, A pipelined architecture for
the multidimensional dft, IEEE Trans. Signal Processing 9.
173