Interconnect Dominant Design Methodology for DSP ... · Interconnect Dominant Design Methodology...

Interconnect Dominant Design Methodology

for DSP Architectures

- A Mixed Number System Based Approach

Subramanian Rama

Vasanth Ramesan

Praveen Sathyanarayanan

A Thesis

Submitted to

Waran Research Foundation(WARF)

In Partial Fulfillment of the

Requirements for the

Research Training Program at WARF

March 2003

Interconnect Dominant Design Methodology

for DSP Architectures

- A Mixed Number System Based Approach

Work done by

Subramanian Rama

Vasanth Ramesan


Approved by:

Prof.N.Venkateswaran

Director WARF

Date Approved

Acknowledgement

We are eternally grateful to our guru and mentor Prof. N. Venkateswaran,

whom we affectionately call Waran, for initiating us into research. If it

were not for the countless hours of discussions that we had with him, this

thesis would have remained a dream. He is a constant source of inspiration

for numerous students like us. We will always treasure his philosophy and

friendship.

We are thankful to our institution Sri Sivasubramaniya Nadar College of

Engineering for their support and encouragement. We appreciate the help

rendered by Mr.Mahalingam Venkataraman of the VLSI Test group at WARF

in carrying out some of our simulations.

We would also like to thank Prof. Earl Swartzlander Jr. of the Univer-

sity of Texas at Austin for clarifying our queries and Mr. Patrick Lysaght,

Senior Director, Xilinx Research Labs, for his encouragement of our research

proposals.

We are indebted to our parents for putting up with our odd working

hours. Their moral support has helped us stay focussed.

We thank the Almighty for giving us the strength and confidence to pur-

sue our goals.

Subramanian Rama

Vasanth Ramesan


To our parents

and

Guru Prof N. Venkateswaran

Abstract

With deep submicron (DSM), the gates have become smaller and faster,

whereas the amount of interconnect on a chip used to connect these small

and fast gates has grown exponentially. The ratio of interconnect delay to

gate delay continues to increase in favor of interconnect delay as DSM designs

continue to get smaller. The result is a shift in the design paradigm based

on interconnect delay dominance.

Buffer insertion techniques have been successful in reducing interconnect

delay. This consumes power and occupies a large amount of the chip area.

The power consumed by these delay optimal devices and wires will increase

as we go into the DSM era.

This thesis investigates the DSM issues in the design of DSP algorithms

and architectures. The DSM issues have been analyzed in great depth with

respect to interconnect dominance in FFT algorithms and architectures, as

well as in DFT. One of the main findings of the thesis is that the FFT

architectures suffer from high degree of interconnect dominance making them

iv

unsuitable for DSM technology when compared with DFT.

High performance, accuracy and low power are the most important design

parameters of DSP architectures. In DSM based technology, while high per-

formance can be achieved, power becomes a critical factor, which needs either

a new architecture or even a new number representation. The computational

complexity of DSP algorithms leads to high power consumption particularly

in high performance applications. An architecture for Arithmetic Processor

based on a mixed number representation is presented. Here, the sign/log

number system is embedded into the residue number system. It is shown

that this mixed number representation called Logarithmic Residue Number

System (LRNS) achieves low power and high performance over the Binary,

Residue and sign/log number systems. It is further shown that unlike the

sign/log number system, LRNS maintains an accuracy of within 1 percent of

the binary number system. A special purpose power efficient instruction set

for the processor is proposed.

The work presented in this thesis is expected to help in developing high

performance low power DSP systems. As a case study, LRNS is shown to

reduce the computational complexity in time frequency transforms like the

Gabor.

v

Contents

Abstract iv

1 Introduction 1

1.1 Discrete Fourier Transform . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . . 6

1.2.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3 Multidimensional DFT . . . . . . . . . . . . . . . . . . . . . . 11

1.3.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.3.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . 13

1.4 DSM Technological Issues . . . . . . . . . . . . . . . . . . . . 14

1.4.1 Interconnect Dominance . . . . . . . . . . . . . . . . . 14

1.4.2 Effect of Interconnects on Delay and Power . . . . . . . 16

vi

1.5 Influence of Number System on Power . . . . . . . . . . . . . 19

1.6 Contribution of the Thesis . . . . . . . . . . . . . . . . . . . . 22

1.6.1 FFT Vs DFT Interconnect Analysis . . . . . . . . . . . 22

1.6.2 Proposed Mixed Number Representation . . . . . . . . 23

2 Interconnect Complexity Power and Delay of Arithmetic units 24

2.1 Interconnect Complexity of Basic Functional Blocks . . . . . . 25

2.1.1 Full Adder . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.1.2 Brent-Kung Dot Operator . . . . . . . . . . . . . . . . 36

2.2 Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.2.1 Serial Adder . . . . . . . . . . . . . . . . . . . . . . . . 38

2.2.2 Ripple Carry Adder . . . . . . . . . . . . . . . . . . . . 40

2.2.3 Brent-Kung Carry Lookahead Adder . . . . . . . . . . 42

2.2.4 Carry Save Adder . . . . . . . . . . . . . . . . . . . . . 45

2.3 Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2.3.1 Parallel Array Multiplier . . . . . . . . . . . . . . . . . 50

2.3.2 Wallace Tree Multiplier . . . . . . . . . . . . . . . . . 51

3 DFT:Power-Delay Analysis 55

3.1 Interconnect and Hardware Complexity . . . . . . . . . . . . . 55

3.2 Hardware Power-Delay Analysis . . . . . . . . . . . . . . . . . 57

3.3 Interconnect Power-Delay Analysis . . . . . . . . . . . . . . . 59

vii

4 FFT:Power-Delay Analysis 61

4.1 Interconnect and Hardware Complexity . . . . . . . . . . . . . 61

4.1.1 FFT Algorithm . . . . . . . . . . . . . . . . . . . . . . 62

4.1.2 FFT Architecture . . . . . . . . . . . . . . . . . . . . . 64

4.2 Hardware Power-Delay Analysis . . . . . . . . . . . . . . . . . 68

4.3 Interconnect Power-Delay Analysis . . . . . . . . . . . . . . . 69

4.4 DFT and FFT Architectures-A DSM Perspective . . . . . . . 69

5 Number Systems and DSP 73

5.1 Characteristics of Number Systems . . . . . . . . . . . . . . . 73

5.2 Binary Number Systems . . . . . . . . . . . . . . . . . . . . . 75

5.2.1 Algorithms for multiplication and division . . . . . . . 77

5.2.2 Multiplication . . . . . . . . . . . . . . . . . . . . . . . 77

5.2.3 Division . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.3 Residue Number Systems . . . . . . . . . . . . . . . . . . . . . 81

5.3.1 RNS representation of numbers . . . . . . . . . . . . . 81

5.3.2 Negative Number Representation . . . . . . . . . . . . 83

5.3.3 Arithmetic Identities . . . . . . . . . . . . . . . . . . . 85

5.3.4 Code Conversions . . . . . . . . . . . . . . . . . . . . . 85

5.3.5 Conversion from RNS to BNS- The Chinese Remainder

Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 86

viii

5.3.6 Arithmetic operations in RNS . . . . . . . . . . . . . . 92

5.4 Logarithmic Number Systems . . . . . . . . . . . . . . . . . . 94

5.4.1 LNS Representation . . . . . . . . . . . . . . . . . . . 95

5.4.2 Generation of logarithms for binary numbers . . . . . . 96

5.4.3 Arithmetic Operations . . . . . . . . . . . . . . . . . . 99

6 Logarithmic Residue Number System 103

6.1 Arithmetic operations . . . . . . . . . . . . . . . . . . . . . . 104

6.1.1 Addition and Subtraction . . . . . . . . . . . . . . . . 104

6.1.2 Multiplication . . . . . . . . . . . . . . . . . . . . . . . 106

6.2 LRNS: Area, Power, Performance . . . . . . . . . . . . . . . . 109

6.2.1 LRNS vs Binary . . . . . . . . . . . . . . . . . . . . . 110

6.2.1.1 Addition/Subtraction . . . . . . . . . . . . . 110

6.2.1.2 Multiplication . . . . . . . . . . . . . . . . . . 111

6.2.2 LRNS vs RNS . . . . . . . . . . . . . . . . . . . . . . . 112

6.2.2.1 Multiplication . . . . . . . . . . . . . . . . . . 112

6.2.3 LRNS vs Sign/Log . . . . . . . . . . . . . . . . . . . . 113

6.2.3.1 Addition/Subtraction . . . . . . . . . . . . . 114

6.2.3.2 Multiplication . . . . . . . . . . . . . . . . . . 115

6.3 Accuracy Analysis . . . . . . . . . . . . . . . . . . . . . . . . 115

6.4 LRNS Architecture for DFT and FFT . . . . . . . . . . . . . 117

ix

7 LRNS in Time-Frequency Transforms 120

7.1 The 1-D Discrete Gabor Transform . . . . . . . . . . . . . . . 121

7.2 LRNS in Gabor Transform . . . . . . . . . . . . . . . . . . . . 125

8 Mixed number system Arithmetic Processor-MAP 128

8.1 MAP Architecture . . . . . . . . . . . . . . . . . . . . . . . . 129

8.1.1 Special Purpose Functional Units . . . . . . . . . . . . 131

8.1.2 General Purpose Functional Units . . . . . . . . . . . 131

8.2 Instruction Set . . . . . . . . . . . . . . . . . . . . . . . . . . 132

8.3 Execution Flow for MAP Instructions . . . . . . . . . . . . . . 133

8.4 Verilog Simulation of MAP Instruction Set . . . . . . . . . . . 136

9 Future Work 140

9.1 Reconfigurable FFT Architecture for different Radices . . . . . 140

9.2 Low Power High Performance LRNS based Convolver Design . 141

9.2.1 LRNS Based Convolver Architecture . . . . . . . . . . 143

10 Conclusion 145

A The Generalized Gabor Transform 147

A.1 1-D Discrete Gabor Transformation . . . . . . . . . . . . . . . 148

B CAM-Content Addressable memories 154

x

List of Figures

1.1 DFT Architecture-Pipelined Inner Product Processor(PIPP) . 5

1.2 Sequential Processor . . . . . . . . . . . . . . . . . . . . . . . 10

1.3 Pipeline Processor . . . . . . . . . . . . . . . . . . . . . . . . 10

1.4 Parallel Processor . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.5 Array Processor . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.6 Scaling effects on interconnect time delay limits . . . . . . . . 16

1.7 Delay for Local and Global Wiring versus Feature Size . . . . 18

1.8 Power for all repeaters and global interconnect where 50 per-

cent of all devices are logic . . . . . . . . . . . . . . . . . . . . 19

2.1 Low-level Characterization of Adders . . . . . . . . . . . . . . 26

2.2 High-level Characterization of Adders-Area . . . . . . . . . . . 27

2.3 High-level Characterization of Adders-Power-Delay Product . 28

2.4 Characterization of Multipliers . . . . . . . . . . . . . . . . . . 29

2.5 Characterization of Multipliers-High level . . . . . . . . . . . . 30

2.6 Interconnects at Gate Level . . . . . . . . . . . . . . . . . . . 33

xi

2.7 A full adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.8 Gate implementation of a full adder . . . . . . . . . . . . . . . 34

2.9 Gate implementation of a dot operator . . . . . . . . . . . . . 37

2.10 A Serial Adder . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.11 A ripple carry adder . . . . . . . . . . . . . . . . . . . . . . . 40

2.12 Carry Block of a Brent Kung CLA(n=8) . . . . . . . . . . . . 42

2.13 Carry Save Adder . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.14 Carry Save Adder Tree for eight operands . . . . . . . . . . . 47

2.15 Parallel Array Multiplier . . . . . . . . . . . . . . . . . . . . . 52

2.16 Multiplier Based on Wallace Tree . . . . . . . . . . . . . . . . 54

4.1 Algorithm level interconnect complexity of FFT . . . . . . . . 63

4.2 Radix 4 FFT architecture for 256 sample points . . . . . . . . 64

4.3 Radix 4 Computational Element . . . . . . . . . . . . . . . . . 65

4.4 Delay Commutator Circuit . . . . . . . . . . . . . . . . . . . . 65

4.5 Communication Complexity of Clock Distribution in the DC

Flip-Flops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.1 Multiplier Hardware . . . . . . . . . . . . . . . . . . . . . . . 78

5.2 Ancient verse of the Chinese Remainder Theorem . . . . . . . 87

5.3 Generalized Block Arithmetic for Addition Subtraction and

Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

xii

5.4 Straight Line Approximation to Logarithmic Curve . . . . . . 97

5.5 Machine Organization to generate and use binary Logs . . . . 98

5.6 LNS Multiplier Divider Hardware . . . . . . . . . . . . . . . . 100

5.7 Hardware for Logarithmic Addition and Subtraction . . . . . . 102

6.1 Execution flow for addition/subtraction operations in LRNS . 105

6.2 Execution Flow for Multiplication in LRNS . . . . . . . . . . . 107

6.3 Flow chart depicting Architecture for Multiplication . . . . . . 108

6.4 PE of a 1-D DFT array . . . . . . . . . . . . . . . . . . . . . . 118

6.5 PE of a radix-4 FFT architecture . . . . . . . . . . . . . . . . 119

7.1 Computational flow for finding the Gabor Coefficients using

Binary and LRNS . . . . . . . . . . . . . . . . . . . . . . . . . 127

8.1 Mixed number system Arithmetic Processor(MAP) . . . . . . 130

8.2 Special purpose MAP Instruction Format . . . . . . . . . . . . 132

8.3 Execution Flow for RAD/RSU Instruction . . . . . . . . . . . 134

8.4 Execution Flow for SLM/SLD Instruction . . . . . . . . . . . 134

8.5 Execution Flow for LRM Instruction . . . . . . . . . . . . . . 135

8.6 Timing diagram for RAD Instruction . . . . . . . . . . . . . . 137

8.7 Timing diagram for RSU Instruction . . . . . . . . . . . . . . 138

8.8 Timing diagram for LRM Instruction . . . . . . . . . . . . . . 138

xiii

8.9 Timing diagram for SLD Instruction . . . . . . . . . . . . . . 139

9.1 Reconfigurable FFT Architecture . . . . . . . . . . . . . . . . 142

9.2 LRNS Based 1-D Convolver Architecture . . . . . . . . . . . . 144

B.1 Typical CMOS SRAM Memory Cell . . . . . . . . . . . . . . . 155

B.2 Typical CAM memory cell . . . . . . . . . . . . . . . . . . . . 158

xiv

List of Tables

1.1 Interconnect and Transistor Scaling Properties . . . . . . . . . 15

1.2 Average Logic Transitions in Multiplication and Division using

LNS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1 Hardware and Interconnect Count for 1024 point DFT . . . . 56

3.2 Hardware and Interconnect Count for 4096 point DFT . . . . 56

3.3 Power-Delay Product for 1024 point DFT . . . . . . . . . . . 58

3.4 Power-Delay Product for 4096 point DFT . . . . . . . . . . . 58

4.1 Hardware and Interconnect Complexity for Radix 4 FFT . . . 67

4.2 Hardware and Interconnect Complexity for Radix 8 FFT . . . 67

4.3 Power-Delay Product for Radix- 4 FFT . . . . . . . . . . . . . 68

4.4 Power-Delay Product for Radix- 8 FFT . . . . . . . . . . . . . 69

5.1 RNS Representation Example . . . . . . . . . . . . . . . . . . 82

6.1 Performance variation for addition operation . . . . . . . . . . 110

6.2 Performance variation for multiplication operation . . . . . . . 112

xv

6.3 LRNS vs RNS: Performance variation for multiplication oper-

ation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6.4 LRNS vs LNS :Performance variation for add/sub operation . 114

6.5 Performance variation for multiplication operation . . . . . . . 115

6.6 Accuracy Analysis for 8 Sample point FFT-1 . . . . . . . . . . 116

6.7 Accuracy Analysis for 8 Sample point FFT-2 . . . . . . . . . . 116

xvi

Chapter 1

Introduction

Signal processing incorporates the acquisition, preparation and analysis

of signals. Advancements in digital computing have shifted the focus from

analog to digital signal processing (DSP) techniques. DSP refers to the

operation on discrete time, discrete amplitude signals.

One of the most fundamental operations in signal processing is transfor-

mation of signals from one domain to another. The Fourier transform (FT)

is a mathematical tool that is used in the analysis and design of linear time

invariant systems. The FT is based on the discovery that it is possible to

take any periodic function of time x(t) and resolve it into an equivalent infi-

nite summation of sine and cosine waves with frequencies that start at 0 and

increase in multiples of a base frequency fo=1/T, where T is the period of

x(t).

1

1.1 Discrete Fourier Transform

The Fourier Transform X(ω) of a discrete signal x(n), is a continuous

function of frequency and therefore is not computationally convenient. Rep-

resentation of a sequence x(n) by samples of its spectrum X(ω) leads to the

discrete Fourier transform (DFT). DFT forms the basis of many application

fields such as spectral analysis, digital filtering, image processing and video

transmission. At present, 1-D and 2-D FT are of prime importance in speech

processing, spectrum analysis, tomography and image processing. The 3-D

FT is needed in nuclear magnetic resonance imaging algorithms. Hence, it

is often desirable in modern signal processing applications to perform two-

dimensional or higher dimensional Fourier Transforms. The increasing de-

mand on the speed of performing such transforms necessitates efficient very

large scale integration (VLSI) implementations.

1.1.1 Algorithm

A finite-duration sequence x(n) of length L has a FT [33]

L−1∑n=0

x(n)e−jωn 0 ≤ ω ≤ 2π (1.1)

where the upper and lower indices in the summation reflect the fact that

x(n)=0 outside the range 0 ≤ n ≤ L− 1. When we sample X(ω) at equally

spaced frequencies ωk = 2πk/N k = 0, 1, 2, ...., N − 1, where N ≥ L, the

2

resultant samples are

X(k) ≡ X(

2πkN

)=

L−1∑n=0

x(n)e−j2πkn/N (1.2)

X(k) =N−1∑n=0

x(n)e−j2πkn/N (k = 0, 1, 2, ...., N − 1) (1.3)

where for convenience, the upper index in the sum has been increased from

L-1 to N-1 since x(n)=0 for n ≥ L.

The relation in (1.2) is a formula for transforming a sequence x(n) of

length L ≤ N into a sequence of frequency samples X(k) of length N. Since

the frequency samples are obtained by evaluating the Fourier transformX(ω)

at a set of N (equally spaced) discrete frequencies, the relation in (1.2) is

called the DFT of x(n).

For a complex-valued sequence x(n) of N points, the DFT is expressed as

XR(k) =N−1∑n=0

[xR(n) cos 2πkn

N+ xI(n) sin 2πkn

N

](1.4)

XI(k) = −N−1∑n=0

[xR(n) sin 2πkn

N− xI(n) cos 2πkn

N

](1.5)

The direct computation of (1.4) and (1.5) requires:

1. 2N2 evaluations of trigonometric functions.

2. 4N2 real multiplications.

3. 4N(N-1)real additions.

3

1.1.2 Architecture

The high computational complexity of DFT has led to the evolution of

efficient algorithms for VLSI implementation. Though many architectures

have been proposed for DFT, the most important one is the matrix-column

vector multiplication architecture.

The conventional 1-D DFT architecture executes matrix-column vector

multiplication. Refer Figure 1.1. The basic processing unit of this non-

systolic architecture is the inner product processor. The column vector (sam-

ple points) is preloaded in the inner product processor and the rows of the

twiddle factor matrix are pipelined. For N sample points 1-D DFT (N = γj),

the inner product processor has γ input inner product terms. Parallel array

multiplier using carry save adder (CSA) tree is employed in the inner prod-

uct processor to achieve a pipelining delay of tadd equal to a full adder stage

of the CSA tree. This pipeline rate is achievable provided, the delay of the

final stage carry propagate adder (CPA) is less than tadd. In order to achieve

high-speed addition Brent-Kung carry lookahead adder (CLA) is employed.

The number of pipelining cycles for the PIPP is j × γj.

For example, to compute a 1024-point 1-D DFT, we can partition the

twiddle factor matrix into 64 blocks of 16 × 16 sub-matrices. The column

vector is also partitioned into matrices of size 16 × 1. The multiplication

4

Figure 1.1: DFT Architecture-Pipelined Inner Product Processor(PIPP)

5

of 16 × 16 sub-matrix and 16 × 1 column matrix is performed in the inner

product processor, in which the matrix addition is carried out in the Brent-

Kung Accumulator.

Another way to reduce the hardware cost of implementing DFT is to

use coordinate rotation digital computer(CORDIC) technique [22] [24]. The

idea behind CORDIC computation is that the desired rotation angle is de-

composed into the weighted sum of a set of predefined elementary rotation

angles, so that rotation through each of them can be accomplished using

simple shift-and-add operations. The drawback of CORDIC-based design is

the slow computing speed.

Read Only Memory (ROM) based designs [46] [18] [19] are also efficient

choices for implementing 1-D DFT in certain applications. Among the ROM-

based designs, the distributed arithmetic-based designs [46] [18] and the

memory-based design [19] are two different approaches to realize multipli-

cations using ROMs. Recently, adder-based designs [6] have become popular

for DFT.

1.2 Fast Fourier Transform

Studies on the properties of DFT resulted in an algebraic structure that

could speed up the computation of DFT by orders of magnitude. Such al-

6

gorithms which makes the computation of DFT faster and easier are known

as fast Fourier transforms (FFT) algorithms [21]. The FFT algorithms sim-

plify the computation of (1) by rearranging the input and output data and

repeated partitioning of them into a smaller set of sequences.

Among the two basic classes of FFT algorithms, viz., Decimation in Time

(DIT) and Decimation in Frequency (DIF) algorithms, the DIF algorithm is

found to have better signal-to-noise characteristics compared to the DIT [38].

1.2.1 Algorithm

The FFT is based on the divide-and-conquer approach. To compute the

N-point DFT, where N can be factored as a product of two integers [33], that

is ,

N = LM (1.6)

The input sequence x(n) can be stored in a 2-D array indexed by l for row

and m for column, where 0 ≤ l ≤ L− 1 and 0 ≤ m ≤M − 1. The sequence

x(n) can be stored in a rectangular array in a variety of ways, each of which

depends on the mapping of index n to the indices (l,m). The mapping

n = l +mL (1.7)

stores the first L elements of x(n) in the first column, the next L elements in

the second column and so on. A similar arrangement can be used to store

7

the computed DFT values. The mapping

k = Mp+ q (1.8)

where 0 ≤ p ≤ L − 1 and 0 ≤ q ≤ M − 1, stores the DFT on a row wise

basis, where the first row contains the first M elements of the DFT X(k), the

second row contains the next set of M elements and so on.

Consider that x(n) is mapped into a rectangular array x(l,m) and X(k)

is mapped into a corresponding rectangular array X(p, q). Then the DFT

can be expressed as a double sum over the elements of the rectangular array

multiplied by the corresponding phase factors. If we consider column wise

mapping for x(n) given by (1.7) and a row wise mapping for the DFT given

by (1.8) then

X(p, q) =M−1∑m=0

L−1∑l=0

x(l,m)W(Mp+q)(mL+l)N (1.9)

But

W(Mp+q)(mL+l)N = WMLmp

N WmLqN WMpl

N W lqN (1.10)

However, WNmpN = 1,WmqL

N = WmqN/L = Wmq

M , and WMpLN = W pl

N/M = W plL .

With these simplifications, (1.9) can be expressed as

X(p, q) =L−1∑l=0

W lq

N

[M−1∑m=0

x(l,m)WmqM

]W lp

L (1.11)

The expression in (1.9) involves the computation of DFTs of length M

and length L. The computational complexity is:

8

Complex multiplications: N(M + L+ 1)

Complex additions: N(M + L− 2)

where N = ML. Thus the number of multiplications has been reduced

form N2 to N(M + L + 1) and the number of additions from N(N − 1) to

N(M + L− 2).

1.2.2 Architecture

Dedicated FFT processors can be classified into four categories as shown

in Figures 1.2, 1.3, 1.4 and 1.5. They differ mainly in the degree of paral-

lelism in the computation. The sequential processor, the slowest of all, has a

single arithmetic unit (AU) and performs elementary computations sequen-

tially. The pipeline processor has more parallelism. It has logrN arithmetic

units, where r is the radix of processing and N is the number of input sample

points. Hence, at any instant of time logrN elementary computations can

be done simultaneously. It requires N sequential operations to complete the

N -point FFT computation. The parallel processor consists of N AUs and

hence has to perform only logrN operations in sequence to compute the FFT.

The fastest of all is the array processor which has as many number of AUs as

there are computations, viz., NlogrN and all the computations are carried

out in parallel.

9

Figure 1.2: Sequential Processor

Figure 1.3: Pipeline Processor

Figure 1.4: Parallel Processor

Figure 1.5: Array Processor

10

The pipeline organization is the best suited for high speed real-time FFT

processing. The pipeline processor has the distinct advantage that blocks

of data can be fed in succession into the processor, since the introduction

of delay elements in each stage takes care of the necessary independence

of computation carried out in the different stages. It is capable of high

throughput rates and the speed limitation is set by the slowest computing

element in the pipeline. The pipelining rate is independent of N . Only the

initial delay required in filling up the pipe is dependent on N or number of

stages in the pipe.

1.3 Multidimensional DFT

The computation of the multidimensional DFT is of great importance in

many DSP areas such as image processing, speech processing and spectrum

analysis. Although the multidimensional DFT is a very powerful tool for

analyzing multidimensional signals, it requires huge amount of computations

and there is a need for efficient VLSI implementation of the algorithm.

11

1.3.1 Algorithm

In its most general form, the DFT of an M-dimensional sequence, x(n1, n2..., nM),

is expressed as

X(k1, k2,...., kM) =∑

n1,n2,...,nm

x(n1, n2, ..., nm)W n1k1N1

W n2k2N2

...W nMkMNM

(1.12)

where indices ni,ki run over the domain [0,Ni-1], 1 ≤ i ≤M and

W nikiNi

= e−j2πniki/Ni

Several fast algorithms are known for efficiently computing this weighted

sum. Among them, the row and column decomposition algorithm is highly

modular, that is, one can compute the DFT of an M-dimensional signal by

carrying it out M times over the indices nM , nM − 1, ..., n2, n1 in an iterative

manner. For example, if M = 2, then

X(k1k2) =∑

0≤n1≤N1−1

∑0≤n2≤N2−1

x(n1, n2)Wn2k2N2

W n1k1N1

(1.13)

or if we let

G(n1, k2) =∑

0≤n2≤N2−1

x(n1, n2)Wn2k2N2

(1.14)

then

X(k1k2) =∑

0≤n1≤N1−1

G(n1, k2)Wn1k1N1

(1.15)

12

Thus we can compute X(k1, k2) in two steps by first computing (1.14)

and then computing (1.15).

1.3.2 Architecture

One of the most popular methods for implementing 2-D DFT is the row

column decomposition method, which requires a transposition memory be-

tween two 1-D transforms. The design in [16] is made up of two hybrid ar-

chitectures, each using a butterfly array of processors to perform a 1-D DFT

and an interconnection network to route the intermediate results back to the

array between different phases of the algorithm. One of these architectures

uses a perfect shuffle network [37] while the other uses a rotation network

to align the outputs of the processor array between different phases of the

computation. These networks are difficult to layout in VLSI [13] because of

the interconnection problems.

Another approach for implementing multidimensional DFTs is the sys-

tolic architecture [25] [36] [35]. Although these systolic implementations have

the advantages of simple design due to the regularity of the PEs and the local

interconnections between them, they require a large number of PEs.

Recently, an architecture for multidimensional DFT which makes use of

the conventional pipelined architecture for a 1-D FFT by using a new two

level index mapping scheme has been proposed [50]. This architecture has

13

lesser number of PEs in comparison to systolic architectures. The number of

multipliers needed are also lesser. It has a flexible area/throughput tradeoff

and regular structure with nearest neighbor interconnections only.

1.4 DSM Technological Issues

Deep submicron (DSM) effects have been proposed as potential impedi-

ment to the continuing advancements in integrated circuit performance. Ex-

amples of DSM effects include the rising RC delay of on-chip wiring, noise

issues such as crosstalk and delay deterioration, and increasing power dissi-

pation. These issues have been addressed in a number of recent works with

the general conclusion that interconnect effects will dominate performance

in DSM designs.

The International Technology Roadmap for Semiconductors (ITRS) projects

that by 2011 over one billion transistors will be integrated into a single mono-

lithic die [2]. The wiring system of this billion-transistor die will distribute

clock and other signals and provide power/ground, to and among, the various

circuits/systems functions on a chip.

1.4.1 Interconnect Dominance

Miniaturization of transistors enhances their performance but the same

cannot be said about interconnect miniaturization. Scaling interconnects

14

Technology MOSFET Intrinsic Delay Intrinsic Delay W/F

Switching of Minimum of Reverse

Delay Scaled 1 mm Scaled 1 mm

td = CV/I Interconnect Interconnect

1µm (Al, SiO2) ∼ 20 ps ∼ 5 ps ∼ 5 ps 1

0.1µm (Al, SiO2) ∼ 5 ps ∼ 30 ps ∼ 5 ps ∼ 1.5

35 nm (Cu, Low K) ∼ 2.5 ps ∼ 250 ps ∼ 5 ps ∼ 4.5

Table 1.1: Interconnect and Transistor Scaling Properties

into the nanometer regime is plagued with many challenges, such as resistiv-

ity degradation, material integration issues, high-aspect ratio via and wire

coverage, planarity control, and reliability problems due to electrical, ther-

mal, and mechanical stresses in a multilevel wire stack [2], and even when

these challenges are overcome, minimum interconnect scaling will still de-

grade interconnect delay. For example, Table 1.1 shows that the intrinsic

interconnect delay of a 1-mm length interconnect at the 35-nm technology

node overwhelms the transistor delay by two orders of magnitude [39].

Scaling effects on interconnect latency have been investigated [28], [3] and

are illustrated in the reciprocal length squared versus time delay plane seen

in Figure 1.6 after [3].

15

Figure 1.6: Scaling effects on interconnect time delay limits

1.4.2 Effect of Interconnects on Delay and Power

When designers target a process at 0.5 micron and above, the majority of

chip delay resides in gates. But with DSM, the gates have become smaller

and faster, whereas the amount of interconnect on a chip used to connect

these small and fast gates has grown exponentially.

The result is a shift in the design paradigm based on interconnect delay

dominance. For example, in a 0.5-micron design, the ratio of gate delay

to interconnect delay is 4 to 1, or 80 percent to 20 percent. By the time

designs have reached 0.25 micron, the ratio has flip-flopped, with gate delay

accounting for only 20 percent of the total delay. The ratio continues to

16

increase in favor of interconnect delay as DSM designs continue to get smaller.

A distributed RC network can be used to model a single global on-chip

interconnect. The latency of this interconnect is given by the distributed RC

time delay (assuming rcL2 ≥ L/v) as

τ = rcL2 (1.16)

where

r distributed resistance per unit length

c distributed ground capacitance per unit length

L interconnect length

v speed of electromagnetic wave propagation

Figure 1.7 shows the delay of local and global wiring in future generations

[2].

Interconnect-driven timing optimization techniques, such as wire sizing,

buffer insertion and gate sizing have gained widespread acceptance in DSM

design. In particular, buffer insertion techniques have been successful in

reducing interconnect delay. To the first order, interconnect delay is propor-

tional to the square of the length of the wire. Inserting buffers effectively

divides the wire into smaller segments, which makes the interconnect delay

almost linear in terms length (plus the buffer delays).

Buffer insertion too consumes power and occupies a large amount of the

17

Figure 1.7: Delay for Local and Global Wiring versus Feature Size

chip area. As shown in Figure 1.8 The power consumed by these delay

optimal devices and wires will increase as we go into the DSM era [39].

With the increase in signal frequencies and the corresponding decrease

in signal transition times, the interconnect impedance can behave induc-

tively [48], increasing the on-chip noise. Furthermore, considering inductance

within the design process increases the computational complexity of IC syn-

thesis and analysis tools. However, inductive behavior can also be useful. As

shown in [47], a properly designed inductive line can reduce the total power

dissipated by high-speed clock distribution networks. Clock networks can

dissipate a large portion of the total power dissipated within a synchronous

IC, ranging from 25 percent to 75 percent [11] [10].

18

Figure 1.8: Power for all repeaters and global interconnect where 50 percent

of all devices are logic

1.5 Influence of Number System on Power

The need for low power design is widely known and needs no elucidation.

In the DSM regime it is not just the device power that needs to be accounted

for but also the interconnect power. In designing VLSI signal processing ar-

chitectures especially for battery powered devices, the designer needs to take

the issue of low power very seriously. Power reduction can be looked at from

different angles. The most obvious ones are by decreasing the supply voltage,

by reducing the parasitic capacitance and by reducing the switching activity.

The choice of algorithm is the most highly leveraged decision in meeting the

power constraints. The ability for an algorithm to be parallelized is critical

19

and the basic complexity of the computation must be highly optimized. Shut-

ting down unused arithmetic blocks is also attractive in some applications.

But the most logical way of reducing power is by minimizing the number of

operations. This is normally done by neglecting multiplications or divisions

by 1s or js (in case of complex numbers), additions or subtractions with 0s

etc. If we can have a number representation that reduces the number of

operations our job will be much easier.

Many number systems are used in application specific integrated cir-

cuit (ASIC) design but the most important among them are binary number

system (BNS), residue number system (RNS) and sign/log number system

(LNS). The use of the RNS allows the decomposition of a given dynamic

range in slices of smaller range on which the computation can be efficiently

implemented in parallel. The power dissipation is reduced by taking advan-

tage of the speed-up due to the parallelism of the RNS structure. Arithmetic

operations like addition, subtraction and multiplication can be done much

faster and with fewer operations compared to binary. The typical drawback

presented by the RNS is related to the input-output conversion from binary

to RNS and vice versa. Many new efficient conversion techniques are being

proposed to tackle this problem. Logarithmic Number system(LNS), on the

other hand requires relatively less conversion overhead. With multiplications

becoming additions and divisions becoming subtractions, LNS implementa-

20

Bits Adder Multiplier Divider Multiplier Divider

Wallace / SRT Times Down Times Down

Dadda Radix 4

16 90 573 3757 6.37 41.74

32 182 3874 7293 21.29 40.07

64 366 9548 14365 53.41 32.24

Table 1.2: Average Logic Transitions in Multiplication and Division using

LNS

tion becomes efficient after a moderate number of multiply-add operations.

Table 1.2 gives a good measure of the reduced transitions using LNS for

multiplication and division.

RNS and LNS reduce the number of transitions only for certain opera-

tions. For example, RNS is ill-suited for division among others and LNS is

not suitable for addition and subtraction. These disadvantages mean that

RNS is used only in applications having lesser number of divisions and LNS

in applications having lesser number of additions and subtractions. A sin-

gle number system which reduces the number of transitions for all the basic

arithmetic operations needs to be evolved. Such a number system can be of

great use in a variety of applications irrespective of the operations involved.

The concept of mixed number system put forth in this thesis is an effort in

this direction.

21

1.6 Contribution of the Thesis

The impact of DSM technology on FFT and DFT architectures are dealt

with. In this connection a detailed analysis of interconnect complexity for

various arithmetic blocks commonly used in DSP kernels is carried out.

The DSM issues can also be tackled by resorting to different number rep-

resentation which can reduce the interconnect complexity, power and achieve

increased speed. In this direction a new mixed number representation is put

forth and is extensively analyzed by applying to DFT, FFT and Gabor trans-

form.

1.6.1 FFT Vs DFT Interconnect Analysis

The importance of interconnects cannot be ignored under DSM. There is a

need to evolve architectures in which the interconnect is less dominant. The

analysis presented in the thesis explicitly show the interconnect dominance

in FFT. It is noted that the power-delay product(hardware and intercon-

nect)increases with increase in radix. Our analysis shows that DFT rather

than FFT will be more suitable for multi GHz DSM architecture implemen-

tation.

22

1.6.2 Proposed Mixed Number Representation

A novel number representation called Logarithmic Residue Number Sys-

tem(LRNS) has been designed. This achieves high speed and lower power

consumption over binary and RNS and improved accuracy over LNS for

arithmetic operations. The LRNS scheme can be embedded in a number of

DSP architectures performing both frequency transforms and time-frequency

transforms like Gabor transform and also in DSP filters. Employing LRNS

scheme in Gabor transform drastically reduces the computational complexity.

Further, based on this mixed number system an arithmetic processor

meant for low power, high performance DSP applications is designed. A

Verilog simulation of the instruction set of this arithmetic processor is in-

cluded.

23

Chapter 2

Interconnect Complexity Power

and Delay of Arithmetic units

With rapid developments in VLSI technology and Computer Aided Design

(CAD) techniques, the ever increasing quest for high performance is placing

demands on interconnect performance and highlights the previously negligi-

ble effects of interconnects [2]. Hence, it is necessary that the interconnect

complexity of any circuit be analyzed, as now under DSM technology, the

interconnects contribute to most of the chip power.

In this chapter, the interconnect complexity of basic AUs like adders and

multipliers are modelled at a broader perspective. The hardware complex-

ity, delay and power analysis of the above said arithmetic units were also

performed. Graphs depicting the low level (transistor power) and high level

characterization (active area, interconnect area, total area and power-delay

24

product) of the arithmetic units viz. adders and multipliers are also presented

here.

The interconnect and hardware complexity can be modelled at different

levels namely algorithm level, architecture level, functional level, gate level,

circuit level and layout level. Interconnect complexity modelled at the gate

level is denoted as the `2 level interconnect complexity. The functional level

or the `1 level can be further classified as at full adder level or register level.

This means considering the adder or register count as far as the hardware

complexity is concerned and using the interconnect count between them as

a measure of interconnect complexity. Interconnects across successive levels

are taken to be of unit length for the analysis. Unit delays are assumed for

such interconnects.

2.1 Interconnect Complexity of Basic Func-

tional Blocks

In this section the interconnect complexity of some of the basic functional

blocks viz. full adder and dot operator (Brent-Kung) [34] are dealt in de-

tail. Interconnect complexity is calculated based on the interconnect count

λ(mi,f , ni,f ) and the maximum interconnect length (MIL) as shown in the

Figure 2.6. λ(m,n) is taken as m horizontal interconnects and n vertical in-

25

Figure 2.1: Low-level Characterization of Adders

26

Figure 2.2: High-level Characterization of Adders-Area

27

Figure 2.3: High-level Characterization of Adders-Power-Delay Product

28

Figure 2.4: Characterization of Multipliers

29

Figure 2.5: Characterization of Multipliers-High level

30

terconnects or a total of m+n interconnects each of unit length. The general

equation for the number of interconnects is given by

No.ofinterconnects(IC) =No.ofGates∑

i=1

No.ofFanouts∑f=1

λ(Σmi,f ,Σni,f ) (2.1)

where, m and n are measures of horizontal and vertical interconnects respec-

tively, expressed as multiples of unit length.

The interconnect count is found to be the total sum of mi,fs and ni,fs.

In most cases it can be directly interpreted from the gate level logic diagram.

Further,

f > 1 Multiple outputs per gate

f = 1 Single output per gate

Eg: For the figure 2.6 the interconnects count is given as

IC = λ(2, 0) + λ(1, 1) + λ(1, 1) (2.2)

The Interconnect count for the Figure 2.6 is calculated as 6.

The total interconnect delay is the sum of the delays caused by intercon-

nects of maximum length in each stage.

TD,I or MIL =TotalNo.ofStages∑

s=1

(MILs) (2.3)

where, MILs is the maximum interconnect length due to horizontal and

vertical interconnects causing the maximum delay in the stage s, expressed

31

as multiples of unit length. These measures are calculated as a sum of the

horizontal and vertical interconnects in each stage that lead to a maximum

delay.

The interconnect delay product (IDP) is the product of the total delay

due to the interconnects (TID or MIL) and the interconnect count (IC). This

reflects the Area-Time (AT) product due to the interconnects.

The power due to the interconnects is given as

Interconnect power(IP ) =No.ofGates∑

i=1

τi

No.ofFanouts∑f=1

λ(Σmi,f ,Σni,f ) (2.4)

where, τi is the average transition count due the ith gate and f is the fan-out

of the ith gate.

The interconnect power-delay product (IPD) is given as

IPD = (No.ofGates∑

i=1

τi

No.ofFanouts∑f=1

(λ(Σmi,f ,Σni,f )))× TD,I (2.5)

2.1.1 Full Adder

The Full adder (Figure 2.7) is a circuit that calculates the sum and carry

of three bits. Figure 2.8 shows the NAND and NOT implementation of a full

adder. From the figure the interconnect delay, interconnect complexity and

the power due to the interconnects are calculated as follows.

Interconnect Count(ICFA) = 116 (2.6)

32

Figure 2.6: Interconnects at Gate Level

Figure 2.7: A full adder

33

Figure 2.8: Gate implementation of a full adder

34

The total delay due through the interconnects in a full adder is

TID,FA or MILFA = λ(1, 1)+λ(6, 6)+λ(2, 1) = 17 unit delays (2.7)

The interconnect delay product (IDP) is the product of MILFA and

ICFA.

The interconnect power calculation is as follows

TotalInterconnectPower(IPFA) = τ1(λ(1, 2)+λ(1, 3)+λ(1, 1))+τ2(λ(3, 1)+

λ(3, 3) + λ(3, 4) + λ(3, 6) + λ(1, 1)) + τ3(λ(5, 1) + λ(5, 2) + λ(5, 5) + λ(5, 6) +

λ(1, 1)) + τ4(λ(2, 3) + λ(2, 4) + λ(2, 5) + λ(2, 6)) + τ5(λ(4, 4) + λ(4, 6)) +

τ6(λ(6, 3)+λ(6, 6))+τ7λ(2, 1)+τ8λ(1, 0)+τ9λ(1, 0)+τ10λ(2, 1)+τ11λ(2, 1)+

τ12λ(1, 0) + τ13λ(2, 1)

The above equation for the interconnect power can be further reduced to

Total Interconnect Power(IPFA) = τ1(λ(1, 2)+λ(1, 3)+λ(1, 1))+τ2(λ(3, 1)+

λ(3, 3) + λ(3, 4) + λ(3, 6) + λ(1, 1)) + τ3(λ(5, 1) + λ(5, 2) + λ(5, 5) + λ(5, 6) +

λ(1, 1))+τNOT (λ(2, 3)+λ(2, 4)+λ(2, 5)+λ(2, 6)+λ(4, 4)+λ(4, 6)+λ(6, 3)+

λ(6, 6)) + τNAND4(λ(2, 1) + λ(1, 0) + λ(1, 0) + λ(2, 1)) + τNAND3(λ(2, 1) +

λ(1, 0) + λ(2, 1))

The interconnect power delay product(IPD) is calculated as

IPDFA = 4τ1(λ(1, 2)+λ(1, 3)+λ(1, 1))+9τ2(λ(3, 1)+λ(3, 3)+λ(3, 4)+

λ(3, 6)+λ(1, 1))+11τ3(λ(5, 1)+λ(5, 2)+λ(5, 5)+λ(5, 6)+λ(1, 1))+τNOT (8(λ(2, 3)+

λ(2, 4) + λ(2, 5) + λ(2, 6)) + 10(λ(4, 4) + λ(4, 6)) + 12(λ(6, 3) + λ(6, 6))) +

35

τNAND4(3λ(2, 1) + λ(1, 0) + λ(1, 0) + 3λ(2, 1)) + τNAND3(3λ(2, 1) + λ(1, 0) +

3λ(2, 1))

2.1.2 Brent-Kung Dot Operator

Figure 2.9 shows the gate implementation of a Brent-Kung dot operator.

From the figure the interconnect delay, interconnect complexity, interconnect

delay product, power due to the interconnects and the interconnect power-

delay product are calculated as follows.

ICdot = 20 (2.8)

The total delay due through the interconnects in a dot operator is

TID,dot or MILdot = λ(5, 2) = 7 unit delays (2.9)

The IDP for the dot operator is the product of ICdot and MILdot.

The interconnect power calculation is as follows

Total Interconnect

Power(IPdot) = τ1λ(5, 2)+τ2(λ(3, 1)+λ(3, 0))+τ3λ(2, 1)+τ4λ(1, 0)+τ5λ(2, 1)

The IPD for a dot product is found to be

IPDdot = 7τ1λ(5, 2) + 4τ2(λ(3, 1) + λ(3, 0)) + 3τ3λ(2, 1) + τ4λ(1, 0) +

3τ5λ(2, 1)

36

Figure 2.9: Gate implementation of a dot operator

2.2 Adders

Adders are the most common arithmetic units used in general purpose as

well as in DSP systems. Adders are also used to perform subtraction and

they are the basic components in multiplier and divider units. Adders are

chosen depending upon their speed, area, configuration, interconnect com-

plexity and power. The power-delay product associated with the hardware

and interconnects of select adders are discussed here. The IDP of some of

the adders are discussed in this section. In analyzing the interconnect and

hardware complexity of the adders, the basic element considered is the full

adder and the complexities are represented as a function of this full adder

complexity.

37

Figure 2.10: A Serial Adder

2.2.1 Serial Adder

The serial adder calculates the sum and carry at each bit position. Figure

2.10 [7] shows a Serial adder. The basic element in a serial adder is the full

adder shown in Figure 2.7 and Figure 2.8. The interconnect and hardware

complexity of the serial adder includes the full adder and the D-flip flop. A

serial subtractor can be constructed with minor modification of the serial

adder. The interconnect and hardware complexity of the serial subtractor is

about the same as that of a serial adder. The delay for an n-bit serial adder

38

is

TD,ADD = (n− 1)(TD,FA,ic + TD,FF ) + TD,FA,is (2.10)

where,

TD,FA,is is the delay between the input and sum output

TD,FA,ic is the delay between the input and carry output

TD,FF is the delay due to the D-flip flop

Assuming thatMILFA is the maximum interconnect length complexity of

a Full Adder at the gate level, TID,FA is the delay due to the interconnects

within the Full Adder at the gate level, MILFF is the maximum flip-flop

interconnect length complexity and TID,FF is the flip-flop interconnect delay,

we have the Interconnect delay product (IDP) of the serial adder at the gate

level (`2 level) as

IDP`2 = ICFA × nTID,FA + ICFF × (n− 1)TID,FF (2.11)

Now the IDP for the serial adder at the full adder level(`1) is given below

IDP`1 = 2n− 1 (2.12)

The Hardware power-delay product (HPD) for the serial adder for n- bit

addition is given by

HPD = (n× PFA + (n− 1)× PFF )× TD,ADD (2.13)

39

Figure 2.11: A ripple carry adder

where, PFA is the power within the full adder

PFF is the power within the flip-flop

The Interconnect power delay product (IPD) of the serial adder for n- bit

addition is calculated at a broader perspective namely at the full adder and

flip-flop output level(`1 level) and is given by the following equation

IPDλ1 = PO,FA × n+ PO,FF × (n− 1) (2.14)

where, PO,FA is the FA output line power

PO,FF is the FF output line power

The IPD is calculated only at `1 level throughout this chapter.

2.2.2 Ripple Carry Adder

A n-bit ripple carry adder is the simplest parallel adder constructed by

cascading n full adders as shown in Figure 2.11. In a ripple carry adder the

carry output of a full adder is connected to the input line of the next full

40

adder. The hardware complexity of the ripple carry adder is proportional to

n. The worst case delay is proportional to n.

The delay of a n-bit ripple carry adderis

TD,ADD = nTD,ic (2.15)

where the notations followed are same as those in the previous section.

The interconnect complexity of the ripple adder at `2 level (i.e., the gate

level) is n× ICFA.

The IDP at the gate level is

IDP`2 = ICFA × TID,FA × n (2.16)

The `1 level interconnect complexity excluding the I/O Interconnects

(IC`1) is

IC`1 = n− 1 (2.17)

Now the IDP at the full adder level (`1 level) is given below

IDP`1 = (n− 1)2 (2.18)

The HPD is calculated for a n-bit ripple adder and found to be

HPD = n× PFA × TD,ADD (2.19)

The IPD calculated at the `1 level is

IPD`1 = PO,FA × (n− 1) (2.20)

41

Figure 2.12: Carry Block of a Brent Kung CLA(n=8)

2.2.3 Brent-Kung Carry Lookahead Adder

Through a structured CLA (Brent- Kung CLA) [34], shown in Figure 2.12

much reduction of the delay in obtaining the carry bits can be achieved. The

Brent-Kung CLA is actually a logarithmic adder which has a Binary Tree

and an Inverse Binary Tree for its carry generation block. The binary tree

structure produces k = 0, 1, ...log2n carry block functions for all powers of

two, 2k− 1 < n. The block carry functions for bit levels that lie between the

powers of two are evaluated with an inverse tree.

The Hardware complexity of the Brent-Kung structure in terms of the

42

dot operator function is (n − 1) + (n − 1 − log2n). The value within the

first brace is for the binary tree and that within the second is for the inverse

tree. Note that the inverse binary tree has fewer operators than the binary

tree. This clearly reflects the hardware complexity of the Brent-Kung carry

generation block at the dot operator function level.

The total hardware complexity is equal to the hardware complexity of

the brent-kung structure plus that of the sum generation part.

From here the analysis is done only for the brent-kung structure i.e., only

for the carry generation part and that for the whole adder can be easily

extended. The time-critical path runs from (g0,p0) via (Gn/2−1,Pn/2−1) and

(Gn−2,Pn−2) to the output Cn−1. This path contains the most operators

connected in series. A total of 2log(n/2) operators are encountered along

this path. When computing the circuit delay, it should be noted that the

capacitive loads are not identical, but rather increase towards the middle.

The delay induced along the time-critical path can be expressed as

TD,ADD = TD,g0−Gn−2 =log(n/2)∑

j=1

[TD,OP (j + 2) + TD,OP (j)] (2.21)

where TD,OP is the operator delay. The two summands represent the two bi-

nary trees and the arguments of TD,OP specify the variable load capacitances.

Next, the interconnect complexity of the Brent-Kung structure is ana-

lyzed. The interconnect complexity is analyzed at the dot function level

43

which is the λ1 level in this case.

Interconnect count in Brent-Kung structure is given below

IC`1 = n× (2× log2n− 1) + (n− 1) + (n− 1− log2n) + 2× n+ n (2.22)

The Interconnect complexity of the brent-kung structure at the gate level

is given by

IC`2 = ((n− 1) + (n− 1− log2n))× ICdot (2.23)

It is already noted that ICdot is the maximum interconnect length com-

plexity TID,dot is the interconnect delay within a dot function the interconnect

delay product at the gate level is

IDP`2 = (2log2n− 1)× ICdot × TID,dot (2.24)

Now the interconnect delay product at the `1 level i.e., at the dot function

level assuming that each interconnect offers unit delay is given by

IDPλ1 = (2log2n− 2) (2.25)

The hardware power delay product for the brent-kung structure is given

by

HPD = ((n− 1) + (n− 1− log2n))× Pdot × TD,ADD (2.26)

where, Pdot is the power due to a dot function.

The interconnect power delay product calculated at `1 level is

IPD`1 = PO,dot × IC`1 (2.27)

44

where, PO,dot is the power at the output of the dot function.

2.2.4 Carry Save Adder

The adders dealt upto this point are grouped under the term carry prop-

agate adders (CPA). In these adders the carry signal is first evaluated to

determine the sum. For two operand addition these adders are efficient. But

for multiple operand addition, as encountered in some DSP applications like

the Inner Product calculation in case of a DFT, the adders are not very ef-

ficient. For such addition the most viable alternative would be to use carry

save adders (CSA) [32]. In a CSA the carry signal is not used for the current

addition, but rather for its successor.

Figure 2.13 shows n-bit CSA for adding three operands. Here n full adders

without interconnects among them are used to merge the three operands X,Y

and Z to n sum bits si which constitute the sum vector S and n carry bits

ci+1 which constitute the carry vector C. The carry vector is actually shifted

by one position. If we look into the representation of the sum and carry of

the CSA we find that their representation is redundant as only n + 2 bits

are needed for representing the sum of three operands wherein 2n bits are

employed here. Not until the sum and the carry bits are added in a CPA

the results are complete. Such an adder is popular because of its highly

structured and faster operation.

45

Figure 2.13: Carry Save Adder

A CSA which adds three operands is called a 3 × 2 CSA. In general we

can use any m× 2 CSA.

In case of multiple operand addition, CSAs are used as long as a sum and

a carry vector remain. These two are then merged in a CPA to get the final

sum. Figure 2.14 shows a CSA tree for eight operands.The horizontal arrows

across every carry bit indicate the shifting of the carry bits to the left by one

position at every stage. The final adder at the last stage is a CPA.

The delay of a n-bit CSA is just the delay of a full adder. The delay of

a CSA tree for adding N operands depends on the number of stages in that

tree plus the delay of the final CPA stage. The delay of the CPA stage is

already dealt in detail in the previous sections.

The interconnect complexity, Hardware Complexity and the delay of a

m× 2 CSA tree for adding N operands are presented below.

46

Figure 2.14: Carry Save Adder Tree for eight operands

47

Notations followed throughout this section

N is the number of operands

m is the order of the CSA

n is the wordlength of the input operands S is the number of CSAs in each

stage

TCSA is total no. of CSAs in the tree found by adding the number of CSAs

in each stage

d is the number of stages

TD,ADD is the delay of the adder tree

TD,CPA is the delay of the CPA

Algorithm to find the total number of interconnects and hardware

complexity in a CSA structure

Until (N - N mod m)/m = 0

S =∑

[(N −Nmodm)/m]

TCSA = TCSA + S

N = 2[(N −Nmodm)/m] + N mod m

The above equation is the way N decreases at every Stage ( N is the no. of

inputs at each stage)

d=d+1 continue

Total number of CSAs = TCSA

Total number of interconnects at `1 level, IC`1= (m× TCSA× n) + 2× n

48

Interconnect complexity at `2 level, IC`2=TCSA× n× ICFA

Delay due the CSA tree part alone, TD,CSA=TD,FA × d

Total Delay of the adder tree,TD,ADD=TD,CSA+TD,CPA

The interconnect delay product, power delay product and interconnect

power delay product are calculated for the adder tree part excluding the CPA

stage. The delay products for the CPA are already been dealt thoroughly in

the previous sections.

The interconnect delay product at `1 level is

IDP`1 = d (2.28)

The interconnect delay product at the `2 level is

IDP`2 = TID,FA × d (2.29)

The hardware power delay product of the CSA tree is found to be

HPD = TCSA× n× PFA × TD,CSA (2.30)

The interconnect power delay product of the CSA tree at the `1 level is

IPD`1 = PO,FA × d (2.31)

2.3 Multipliers

Besides addition, Multiplication is a heavily used core operation in signal

processing. In many of the DSP applications where high throughput is the

49

prime concern, fast multipliers are required. Such fast multiplier configu-

rations have high interconnect complexity apart from the higher hardware

complexities and hence higher hardware costs and power. In the following

sections, the interconnect complexity, hardware complexity, delay and power

aspects of various multiplier configurations are presented.

2.3.1 Parallel Array Multiplier

This is the simplest parallel multiplier [4] in which the multiplier-multiplicand

bits are summed up one by one by means of a series of CSAs. The multiplier

has a two dimensional array structure of full adders as shown in Figure 2.15.

Each row except the final row forms a CSA and the final row forms a CPA

or a ripple carry adder.

The hardware complexity (number of logic gates) of the array multiplier

is proportional to n2 where n is assumed to be the wordlength of the inputs

to the multiplier.

The interconnect complexity of the array multiplier at the `1 level (full

adder level) is found to be equal to 3(n− 1)2 + 3n.

The interconnect complexity of the array multiplier at the gate level

namely the `2 level is

IC`2 = n2 × ICFA (2.32)

50

The delay of the array multiplier is proportional to n.

TD,MUL = n× TD,FA (2.33)

The interconnect delay product at the full adder level is

IDP`1 = n (2.34)

The interconnect delay product at the gate level is

IDP`2 = ICFA × TID,FA × n (2.35)

The hardware power delay product of the array multiplier is found to be

HPD = n2 × PFA × TD,MUL (2.36)

The Interconnect power delay product of the array multiplier at the `1

level is

IPD`1 = PO,FA × n (2.37)

2.3.2 Wallace Tree Multiplier

A further alternative to the multiplier implementation is the Wallace Tree

Multiplier [43] in which the partial products are evaluated first and these are

added in a CSA tree. A CSA sums up three binary numbers and produces

two binary numbers (i.e., a partial sum and a partial carry). Therefore, using

51

Figure 2.15: Parallel Array Multiplier

52

n/3 CSAs in parallel, we can reduce the number of multiplicand-multiples

from n to about 2n/3. Then, using about 2n/9 CSAs, we can further reduce

it to 4n/9. Applying this principle repeatedly, the number of multiplicand-

multiples can be reduced to only two. As seen in Section 2.2.4 the final

multiplicand-multiples (the final carry and sum vectors) can be added in a

CPA. Figure 2.16 shows the block diagram of a multiplier based on Wallace

Tree. The delay is small and is proportional to log n when a fast CPA with

O(log n) delay is used for the final addition. The number of logic gates is

about the same as that of an array multiplier and is proportional to n2. The

hardware complexity, interconnect complexity, delay and power analysis are

similar to that carried out for the CSA tree presented in Sub section 2.2.4.

The CSAs used in the Wallace Tree can not necessarily be those of the

3×2 type but can be in general any CSA of the order m×2 depending upon

the suitability for VLSI realization. Note that as m increases the delay of

the m×2 adder tree increases. The interconnect, hardware, delay and power

analysis provided for the CSA tree in Sub section 2.2.4 are for the general

m× 2 case.

53

Figure 2.16: Multiplier Based on Wallace Tree

54

Chapter 3

DFT:Power-Delay Analysis

DFT is a fundamental operation in signal processing. The design of VLSI

architectures for DFT undergoes major changes when DSM technological

implementations are considered. Based on the DSM issues presented in the

earlier chapters there is a strong need to analyze the interconnect, hardware,

power and delay complexities of the DFT architectures.

3.1 Interconnect and Hardware Complexity

In this section, the interconnect and hardware complexity of the DFT

algorithm and the corresponding architecture is presented. This DFT archi-

tecture is realized using PIPP and is discussed in Chapter 1. The intercon-

nect complexity is modelled at the functional level (`1 level). The model is

developed based on the interconnect complexities of the various arithmetic

functional units presented in Chapter 2.

55

Input vector partition size Hardware Count Interconnect Count

4 8280 2392

16 32856 9840

64 131160 36304

256 524376 134480

1024 2097240 530256

Table 3.1: Hardware and Interconnect Count for 1024 point DFT

Input vector partition size Hardware Count Interconnect Count

4 8280 2392

16 32856 9840

64 131160 36304

256 524376 134480

1024 2097240 530256

4096 8388696 2109264

Table 3.2: Hardware and Interconnect Count for 4096 point DFT

The interconnect and hardware complexity depends on the partitioning

of the input vectors. Refer Figure 1.1. It is obvious that the number of

interconnects increases with the input partitioning. On the other hand it

will be seen in the next section that the interconnect power-delay product

variation and the hardware power-delay product variation do not follow the

same trend. The interconnect and hardware count of the DFT architecture

for different input partitioning are shown in Tables 3.1 and 3.2 for 1024 and

4096 sample points.

56

3.2 Hardware Power-Delay Analysis

Power-delay product is an important parameter for an architecture. The

power consumed by a digital system is governed by the following equation.

P = fCV 2 (3.1)

where,

f is the average number of transitions 0 → 1

C is the capacitive load

V is the supply voltage

As seen from the power equation (3.1), for a specific supply voltage and

a given capacitive load, the power consumption is decided by the average

number of transitions. However, the capacitive load is heavily dependent on

the interconnect complexity. The power consumed by an individual transition

could be in the sub micro-Watt range, depending on the technology.

The power delay product analysis presented for DFT architectures in this

section is based on the average transition count at a functional level, like a

full adder or a flip-flop. This power-delay product is classified into intercon-

nect power delay product (IPD), hardware power-delay product (HPD) and

hardware interconnect power-delay product (HIPD) [30].

An analysis of HPD for the DFT architecture is presented below. The

power consumed by the respective hardware units like full adders, flip-flops,

57

Input vector partitioning size IPD HPD HIPD

4 1.75E+20 9.38E+14 6.13E+20

16 4.47E+18 3.38E+14 2.01E+19

64 1.78E+17 1.00E+14 9.79E+17

256 9.59E+15 2.71E+13 5.97E+16

1024 5.86E+14 7.62E+12 4.02E+15

Table 3.3: Power-Delay Product for 1024 point DFT

Input vector partitioning size IPD HPD HIPD

4 2.40E+17 1.13E+23 3.97E+23

16 8.65E+16 2.06E+22 9.26E+22

64 2.56E+16 7.29E+20 4.01E+21

256 6.93E+15 3.91E+19 2.44E+20

1024 1.95E+15 2.34E+18 1.64E+19

4096 5.66E+14 1.46E+17 1.16E+18

Table 3.4: Power-Delay Product for 4096 point DFT

etc is calculated using average transition count. We employ a unit delay

model for the gates. Using this, functional level delay is calculated.

The analysis of hardware power and hardware delay for the PIPP archi-

tecture for DFT with different input partitioning were performed. The results

are presented in Tables 3.3 and 3.4. From the tables it is clear that the HPD

decreases with the input partitioning for a specific number of sample points

and increases with the number of sample points.

58

3.3 Interconnect Power-Delay Analysis

In DSM technology, with the deep scaling down of the devices, more than

the hardware count it is the interconnects that will decide the overall perfor-

mance. Hence, it is necessary to calculate the IPD for any architecture. Like

the hardware power-delay product, the interconnect power delay-product is

calculated from the interconnect power and the interconnect delay. The de-

lay imposed by the global interconnects (without buffers) is more than an

order of magnitude over the device delay and this is predicted to increase by

several order of magnitude beyond the 90nm technology. This is due to the

interconnect capacitance becoming larger than the gate capacitance.

The vast increase in the number of interconnects has made the intercon-

nect capacitance the cause of most of the on chip power dissipation. The

power consumed by the interconnects is calculated using the average number

of transitions of the gates driving the interconnects. With reference to (3.1)

the capacitive load is due to the interconnect complexity, length and fan out.

The delay due to the interconnects is calculated on the basis of unit delay

and unit length as defined in Chapter 2. From these the IPD is determined.

Tables 3.3 and 3.4 show the IPD values for different sample points.

59

Interpretation of Results

A study of Table 3.3 for 1024 point DFT is presented below. For this case

the IPD is found to be more dominant than the HPD and hence the value

of HIPD is primarily influenced by the IPD. The IPD and hence the HIPD

decreases with the input partitioning for a specific number of sample points.

In Table 3.4, for 4096 point DFT the HPD is found to be more dominant

than the IPD and hence the overall HIPD is influenced by the HPD. The

HPD, IPD and hence the HIPD decrease with the input partitioning for a

specific number of sample points.

It can be interpreted from the tables that upto 1024 point DFT the IPD

seems to dominate over the HPD and from 1024 sample points, the HPD is

found to have higher values.

60

Chapter 4

FFT:Power-Delay Analysis

The FFT has greatly reduced computational complexity over DFT. Under

the micron technology this had great impact in reducing the chip area and

power with increased performance. However, the merits of FFT are lost

under DSM technology. This is due to the dominance of interconnect in

FFT architectures. This chapter focusses on the power and delay aspects

due to interconnect and hardware complexity.

4.1 Interconnect and Hardware Complexity

In this section, the interconnect complexity is presented considering the

characteristics of FFT algorithm and the wordlength variation of the partial

results flowing across the butterfly computing stages. At the architecture

level, a detailed analysis of interconnect complexity within and across the

computing element (butterfly) and delay commutator stages is presented.

61

Further, the hardware complexity analysis for the computing element and

delay commutator is presented.

4.1.1 FFT Algorithm

The interconnect complexity (count) for rj point 1-D FFT across the

butterfly computing stages is given by the following recursive formula.

r−1∑nj=0

Wnj(k1+...+rj−2kj−1)

rj [r−1∑

nj−1=0

Wnj−1(k1+...+rj−2kj−1)

rj−1 .....[r−1∑n2=0

W n2k1

r2 (4.1)

[r−1∑n1=0

X(rj−1n1 + ...+ nj)Wn1k1r ]W n2k2

r ].....W nj−1kj−1r ]W njkj

r

where,

X(rj−1n1+...rnj−1+nj) is the word length of input sample point

Ws are the wordlength of the respective twiddle factors

r is the radix

n1,n2,n3,...,njε(0, 1)

k1,k2,k3,...,kjε(0, 1) and

j is the number of stages

Figure 4.1 shows the total number of interconnects across the stages for

different sample points. An important inference from the graph is that higher

radix proves to be more efficient than lower radix from the interconnect point

of view. This is inspite of the fact that the hardware complexity within the

62

Figure 4.1: Algorithm level interconnect complexity of FFT

CE increases with the radix.

With increase in number of sample points, the interconnect complexity

varies sharply for lower radix values. This is evident from the slope of the

curve for radix-2. On the other hand, as the radix value increases the slope

decreases.

63

Figure 4.2: Radix 4 FFT architecture for 256 sample points

4.1.2 FFT Architecture

The basic FFT can be calculated using different radices viz. radix 2,4,8,16

etc. Radix-4 FFT architecture is shown in Figure 4.2. This architecture

has interleaved computational elements (CE) (refer Figure 4.3) and delay

commutators (DC) (refer Figure 4.4). In radix- r pipelined architecture, the

CE performs r-point butterfly computation. Reordering of the input data

stream to the next CE is performed in the DC.

Computational Element

The architecture of the CE is shown in Figure 4.3. This CE performs

radix-4 butterfly operation. In order to achieve a higher pipelining rate,

the CE employs a pipelined Wallace tree multiplier, using CSA blocks. The

64

Figure 4.3: Radix 4 Computational Element

Figure 4.4: Delay Commutator Circuit

65

Brent-Kung CLA is used in the final stage of this multiplier. The Wallace

tree and Brent-Kung CLA architectures are discussed in Chapter 2.

Delay Commutator

The number of shift registers in each DC for radix- r is 2 (r-1) and the

length of the registers is given by k × r(logr M−(s+1)) (where s is the stage

number), k=1,2,3,,r-1. The shift register complexity (flip-flop count) of the

DC is given by Nm − r(≈ Nm).

For a 4096 point radix-4 FFT, the lengths of the different shift registers

needed on the input port and output port of the DCs are 768 word, 512

word, 256 word, 192 word, 128 word, 64 word, 48 word, 32 word, 16 word, 12

word, 8 word, 4 word, 3 word, 2 word and 1 word. With the increase in the

register length, the latency increases in a non-linear fashion as a function of

the radix and the number of sample points.

As the shift registers are very lengthy, the power consumed (during tran-

sitions) by them during pipelining is also significant. More over, the latency

is determined by the sum of the delays of the CEs and the delay through the

serial shift registers of the DCs. The pipelining rate within CEs is given by

tadd. For given sample points of rj, where j is any positive integer the num-

ber of pipelining cycles is j. The total execution time is given by (Latency +

tadd(j − 1)). The latency is the sum of the delay through the CEs and DCs.

66

Sample Points Hardware Count Interconnect Count

16 45321 11715

64 1724492 51976

256 1.4247E+8 214501

1024 1.15823E+10 865606

4096 9.30206E+11 3474199

Table 4.1: Hardware and Interconnect Complexity for Radix 4 FFT

Sample Points Hardware Count Interconnect Count

64 251369 247989

512 2.806936E+07 3734777

4096 9.314962E+09 58103185

Table 4.2: Hardware and Interconnect Complexity for Radix 8 FFT

The total delay through the DCs is given bylogr N∑s=1

r−1∑k=1

k × r(logr N−(s+1)).

Tables 4.1 and 4.2 shows the interconnect and hardware complexity for

radix 4 and radix 8 FFT architecture. The interconnect complexity is more

than that of radix 4. This variation is different from that at the algorithmic

level. It is due to the fact that at the algorithm level, the DCs were not

taken in to account and at the architectural level these play an important

role in deciding the overall interconnect complexity. The analysis of inter-

connect complexities reveal that FFT architectures suffer from interconnect

dominance unlike DFT architectures.

67

Sample Points IPD HPD HIPD

16 36675500 36675500 1.28E+08

64 2.93E+10 7.91E+08 7.87E+10

256 1.38E+13 2.85E+10 3.56E+13

1024 5.34E+15 1.06E+12 1.33E+16

4096 8.83E+21 4.69E+15 4.38E+22

Table 4.3: Power-Delay Product for Radix- 4 FFT

4.2 Hardware Power-Delay Analysis

In this section the hardware power-delay product is analyzed. The power

consumed by the respective hardware units like full adders, flip-flops, mul-

tiplexers etc is calculated using average transition count as in Chapter 2.

We employ a unit delay model for the gates to calculate the functional level

delay.

The results of the HPD analysis are given in Tables 4.3 and 4.4. From

the tables it is clear that as the radix increases the HPD increases initially

and then decreases. Across different sample points the HPD increases within

the same radix. Comparing the HPD of DFT and FFT we find that FFT

has lesser HPD than DFT.

68

Sample Points IPD HPD HIPD

64 63980200 1.63E+09 2.86E+09

512 2.23E+10 6.26E+11 1.08E+12

4096 1.19E+13 2.63E+14 4.58E+14

Table 4.4: Power-Delay Product for Radix- 8 FFT

4.3 Interconnect Power-Delay Analysis

The results of the IPD analysis are given in Tables 4.3 and 4.4. For a given

radix, the IPD increases with increase in the number of sample points. On

the other hand, as the radix increases the IPD decreases for a given number

of sample points.

Till radix 4 FFT, IPD dominates HPD. Hence, the overall HIPD is de-

cided by IPD. But for radices above 8, HPD dominates IPD and the overall

HIPD is influenced by the HPD. Therefore we can determine the optimum

radix value for a given set of sample points to achieve the best power-delay

product.

4.4 DFT and FFT Architectures-A DSM Per-

spective

In DSM technology, with the deep scaling down of the devices, more

than the hardware count it is the interconnects that will decide the overall

performance. Comparing the DFT and FFT architectures from the Tables

69

3.1, 3.2, 3.3, 3.4,4.1, 4.2, 4.3 and 4.4 we have the following observations. The

total power delay product is more pronounced in lower radix FFT compared

to DFT. On the other hand, for a higher radix FFT, the total power delay

product is dominant in DFT. Interconnect count is higher in FFT compared

to DFT irrespective of the radix values. Moreover, as the radix increases,

the interconnect dominance in FFT increases rapidly.

Besides the dominance of local interconnects in FFT, the major techno-

logical drawback is the clock broadcasting to large number of flip-flops of the

DCs present in different stages. The number of flip-flops to be driven by the

clock against the number of sample points for radix-4 is shown by the graph

in Figure 4.5. However, for a given number of sample points the flip-flop

count grows as a non-linear function of the radix.

The DFT PIPP architecture benefits from the absence of switching and

delay elements leading to lower power consumption and no clock broadcast-

ing. The above analysis can be extended to multidimensional DFT wherein

reduction in the interconnect complexity, latency and power will be substan-

tial.

The important factor critical to FFT architecture with regard to inter-

connect complexity is the presence of delay commutators. In general, lower

radix FFT architecture is preferred in order to avoid hardware overhead in the

computing elements. However, for higher radix FFT, IPD, HPD and HIPD

70

Figure 4.5: Communication Complexity of Clock Distribution in the DC

Flip-Flops

71

are of relatively lower values, with the lowest being the IPD. The analysis

further shows that for lower radix FFT the values of the above mentioned

parameters are several orders of magnitude lesser in DFT. As we move to

higher radix FFT the parameter values, except the interconnect count, be-

come higher in DFT. For the analysis presented in Tables 3.1, 3.2, 3.3, 3.4,

4.1, 4.2, 4.3 and 4.4 no word length truncation is performed.

As a whole, the interconnect complexity in FFT increases considerably

over DFT with increase in radix across different sample values. To worsen the

situation further, the global interconnect complexity in the clock distribution

in FFT is very high, grows as a linear function of number of sample points

and a non-linear function of radix.

Based on the ITRS projection with the shift from device to interconnect

dominated design paradigm, DFT will be more suitable for multi GHz DSM

architecture implementation [30].

72

Chapter 5

Number Systems and DSP

The performance, speed and accuracy are important parameters of any

AU. The number system that an AU employs has a significant effect on the

above parameters.The efficiency in the execution of computationally com-

plex DSP algorithms greatly depends on the type of number system em-

ployed. The focus of this chapter is to discuss the various number systems

implemented in arithmetic units.

5.1 Characteristics of Number Systems

Range: The range of a number system is defined as the interval over which

every defined digit can be uniquely represented by the system, i.e without

having two numbers with the same representation. A number system is said

to have infinite range if each defined digit can be uniquely represented. The

decimal number system is an example of a number system with infinite range.

73

Uniqueness: A number system is said to be unique if each number in the

system has only one representation.

Redundancy: A number system is defined to be redundant if there are fewer

defined numbers than there are combinations of digits. Therefore, there could

exist some combinations of the digits, for which a defined number may not

exist. Redundancy could also refer to the situation where there are more than

one representation for a number. Hence, the absence of uniqueness implies

that the system is redundant.

Weighted Number System: A number system is said to be weighted if

there exists a set of weights such that, for any number, it is expressed as

V (A) =N−1∑i=0

airi (5.1)

where ai are the set of defined digits. If the values of r are constant, then the

number system has a fixed base or a fixed radix, e.g. decimal system with

base ten (r=10). Number systems in which the weights are not powers of

the same number are called mixed-radix systems. Weighted number systems

are advantageous because performing magnitude comparison, sign detection

and overflow detection are easy.

74

5.2 Binary Number Systems

Most arithmetic processors conventionally employ weighted number sys-

tems. A number is represented by a series of digits, with a weight attached

to each digit. The value of the number is computed by multiplying each bit

with its associated weight and then adding the same.

For e.g. in binary number systems the numbers are represented in the

form of 1’s and 0’s. The digit set I contains (0,1). The base or radix for

the binary number system is 2. Consider the number A, represented as

an−1an−2...a2a1a0 where, ai ∈ (0, 1) i = 0...N − 1. The value of the number

is then given by (5.1). The same format of representing data is used to

represent fractional numbers of the form an−1an−2...a2a1a0a−1...a−m where,

ai ∈ (0, 1) i = −m...N − 1. The value of the fractional number is given by

V (A) =N−1∑i=−m

airi (5.2)

All arithmetic processors should be made compatible to work on both

positive and negative numbers. The binary number system accomplishes

this by appending a sign bit to the beginning of the binary word. The sign

bit SAis Zero for positive numbers and is One for negative numbers.

Mathematically,

75

SA = 0forA >= 0

SA = 1forA < 0

Arithmetic units employ complement or sign magnitude representation

while operating on negative numbers. Hence, the subtraction of two numbers

is converted into the addition of the the positive number with the complement

of the negative number. The two main types of complements employed for

a radix r system is the radix complement or the r’s compliment and the

diminished radix complement or the (r-1)’s complement.

For a binary system, i.e. with radix-2, the two methods to find the

complement are the 1’s complement and the 2’s complement system. The 1’s

complement system or the (r-1)’s complement of a negative number is found

out by changing all 1s to 0s and vice versa. The 2’s complement system

or complement of any binary number is found out by first computing the

1’s complement of the number and subsequently adding 1 to it. The use of

complements in arithmetic operations (namely subtraction) helps to convert

all subtractions into addition and hence simplifies the type and number of

hardware used.

Representation of floating point numbers

Many DSP applications require operations on floating point numbers.

The standard format of representing these numbers makes use of the IEEE

76

standard floating-point representation. Single precision numbers are encoded

in 32-bit format. The most significant bit is reserved for the sign, 8 bits are

allocated for representing the exponent and the remaining 23 bits are used

to represent the fraction.

5.2.1 Algorithms for multiplication and division

The main drawback in multiplication and division algorithms is that they

suffer from a high degree of repetitive computations. However, these algo-

rithms achieve high performance and accuracy. Two of the classical algo-

rithms for multiplication and division are presented below [17].

5.2.2 Multiplication

Consider the multiplication of two binary numbers xn−1xn−2...x2x1x0 and

yn−1yn−2...y2y1y0. These numbers are stored in registers C and D. Register P

is used to store the product and is initially set to zero.

Step 1: If the LSB of the value stored in register C is 1, then the contents of

register D are added to P, else 00000 is added to P. The sum is stored back

into P.

Step 2: The contents of registers P and C are shifted right and the carry

out of the previous addition operation is moved into the high-order bit of P.

The lower order bit of P is moved into the register C and the rightmost bit

77

Figure 5.1: Multiplier Hardware

of C is shifted out and is not used further in the execution of the algorithm.

The value stored in P and C is the product. Register C holds the lower order

bits after n steps . The hardware to perform the above operation is shown

in Figure 5.1.

5.2.3 Division

The SRT division algorithm described below is an example of a restoring

algorithm . This class of algorithms achieves high performance. We consider

the division of two n bit numbers x and y (x/y). These numbers are loaded

into registers X and Y. Register P stores final remainder and is initially set

to zero.

78

Step 1: If the value stored in Y has leading zeros, then the k leading zeros

of Y are shifted left by k bits.

Step 2: Repeat 2.1, 2.2, 2.3 n-1 times.

(2a) If the top three bits of P are equal , then assign qi = 0 and shift registers

P and X left by one bit

(2b) If the most significant three bits of P are all not equal and P is negative,

set qi = −1 and shift registers P and X left by one bit and add Y.

(2c) Else set qi = 1 and shift registers P and X left by one bit and then

subtract Y.

Step 3: If the final remainder is negative, it is corrected by adding Y, the

quotient is corrected by subtracting one from q0. The remainder is finally

shifted k bits to the right, where k is the initial shift.

79

The binary number system has been popularly used because of its high

accuracy and speed. However, a closer inspection of the algorithms employed

for multiplication and division reveals that the binary number system is in-

efficient for multiplication and division operations. Moreover, the large word

length required to represent numbers of high magnitude and the problems

associated with carry propagation has lead to the use of alternative number

systems for simplifying the arithmetic operations.

80

5.3 Residue Number Systems

The Residue number system (RNS) [15] [40] is based on the principle

of breaking down a given number into a set of smaller word length residues.

Smaller wordlengths lead to the absence of carry propagation and lesser mem-

ory requirement needed for storage.

5.3.1 RNS representation of numbers

A number in RNS is represented by associating it with a set of radices/base.

However, unlike a fixed-radix number system, the base for residue numbers

is not a single radix, but an N-tuple of integers m1,m2mn etc where each

element of the set is called ’modulus’.

A k bit number is represented as (xk−1|...|x2|x1|x0)

Let M represent the N Tuple set of position weights (moduli) that are mu-

tually prime M = mk−1,mk−2, ...m1,m0

With the condition that

mk−1 > mk−2 > ... > m1 > m0

The residue codes for the number is given by

ri = x mod mi; ri = [0,mi − 1] (5.3)

81

Number 3 5 7 Number 3 5 7

-15 0 0 6 0 0 0 0

-14 1 1 0 1 1 1 1

-13 2 2 1 2 2 2 2

-12 0 3 2 3 0 3 3

-11 1 4 3 4 1 4 4

-10 2 0 4 5 2 0 5

-09 0 1 5 6 0 1 6

-08 1 2 6 7 1 2 0

-07 2 3 0 8 2 3 1

-06 0 4 1 9 0 4 2

-05 1 0 2 10 1 0 3

-04 2 1 3 11 2 1 4

-03 0 2 4 12 0 2 5

-02 1 3 5 13 1 3 6

-01 2 4 6 14 2 4 7

Table 5.1: RNS Representation Example

The Residue Number System is inherently redundant as it is periodic.

Any number can be uniquely represented in RNS only if the number lies

within the dynamic range of the system. The dynamic range is equal to

the product of the moduli. The dynamic range is called as the interval of

definition. Table 5.1 shows the representation of numbers from -15 to +14

with the moduli set (3,5,7)

82

5.3.2 Negative Number Representation

Variables often take negative values in arithmetic calculations. Two ways

of representing negative numbers are listed, with the latter being the most

commonly used. The first technique is to represent the absolute magni-

tude of a number in residue code and use an external sign bit to represent

the sign. In the second method, the sign of the number is included within

the residue code, similar to the complement representation in binary. In

the dynamic range M, residue numbers in the range [0, (M/2)-1] are taken

as positive and residue numbers in the range [ M/2 ,M-1] are considered

negative. Therefore, if X is represented as r1r2, ...rN , -X is represented by

(m1 − r1), (m2 − r2), ..., (mN − rN).

Given the RNS representation of a number X , the representation of -X is

calculated by complementing each of the digits xi with respect to moduli mi

(0 digits will remain unchanged).

Representational Efficiency

The representational efficiency of the residue number system is defined as

the ratio of dynamic range of the system to the total number of states that

the bits in the residue number system can represent.

Eg. Dynamic range of RNS (7—5—3) is 7× 5× 3 = 105

Each binary bit encoded with the Residue set above requires 3+3+2 = 8bits

83

8 bits can represent 256 unique states,

Hence, the Representational efficiency = 105256

= 41.01 percent.

Selection of Moduli

The selection of appropriate moduli is critical since it affects the represen-

tational efficiency and the complexity of the arithmetic processor involved.

The magnitude of the largest modulus directly affects the speed of the arith-

metic operations. To improve the efficiency, the moduli must be made com-

parable such that each modulus is nearly as large as the modulus of largest

magnitude. This does not affect the speed of the arithmetic operation. The

above conditions are bound by the constraint that the moduli so selected

must be relatively prime to each other.

Consider the RNS with the moduli (17—13—11—7—3—2). The maxi-

mum range that is represented is given by the dynamic range (M)= 102102 .

The number of Bits required is = 5+4+4+3+2+1 = 19

The speed of arithmetic operations is dictated by the largest wordlength i.e

5 bits. The moduli 2 and 13 and 3 and 7 can be combined with no apparent

penalty resulting in an RNS with moduli set defined as (26—21—17—11).

The later representation still needs 5+5+5+4 = 19 bits. However, there are

two fewer modules leading to better hardware and computational efficiency.

84

5.3.3 Arithmetic Identities

Additive inverse

The additive inverse of a number, X, is a number that when added to

X yields a result of zero. Since the modulus is congruent to 0, the additive

inverse is defined as X + (-X) = m or (-X) = m - X.

Eg. The additive inverse of 2 over a modulus of 5 is (−X) = 5− 2 = 3.

This property is used to define negative numbers, with the top half of the

range being the negative numbers of the bottom half of the range.

Multiplicative inverse

The multiplicative inverse of a number, X, is a number that when multi-

plied by X yields the result of 1. Unlike the additive inverse, the multiplica-

tive inverse of a number does not always exist. The multiplicative inverse of

a number, X, exists only if X is relatively prime to the modulus M (i.e. GCD

(X, M)=1.

5.3.4 Code Conversions

Conversion to Residue Code

The residue of a number is defined by (5.5). In conventional comput-

ers, this calculation is performed by dividing X by mi and determining the

remainder. In a residue computer, which is capable of residue addition, mul-

tiplication, etc. a more efficient method is used to determine the residue

85

representation. A number is represented in the binary system as

X = 2nbn + ...+ 22b2 + 21b1 + b0 (5.4)

Where bi are the binary digits of the integer X. Taking the modulo yields,

|X|mi= (2n

mod mi)bn + ...+ (22mod mi)b2 + (21

mod mi)b1 + b0 (5.5)

If powers of 2 modulo mi are directly available, either through look up

tables or through special purpose hardware, then the value of |X|miis calcu-

lated by adding those modulo values of the powers of 2 where the bits bi are

one.

5.3.5 Conversion from RNS to BNS- The Chinese Re-

mainder Theorem

The Chinese scholar Sun-Tsu in the first century A.D, described a rule

called t’ai- yen (great generalization) to determine a number having the re-

mainders 2, 3, and 2 when divided by the numbers 3, 5, and 7. When

the secret of this general technique to determine numbers based on residues

was discovered, it became known as the Chinese Remainder Theorem(CRT).

This theorem has been extensively used to convert numbers represented in

the Residue Number system to a weighted number system.

The mathematical form of the CRT is given as follows

n =L∑

i=1

[mi · (xi · m−1i )modmi]modM (5.6)

86

Figure 5.2: Ancient verse of the Chinese Remainder Theorem

where n is the resulting number, L is the number of sub-rings, xi is an element

of the ith sub-ring. mi is the modulus of the ith sub-ring, M is the modulus

of the overall system ring. mi = Mmi

and m−1i is the multiplicative inverse of

over the ring modulo mi. The CRT is explained with the help of a suitable

example.

Example: The number 23 is represented using the moduli 3, 5, and 7 as (2,

3, 2). These are the residues (remainders) after integer division by the moduli

(23/3=7 remainder 2). To convert (2, 3, 2) back to a decimal representation

we use the CRT

M = 3× 5× 7 = 105

mi = 105mi

=(35,21,15)

35 mod 3=2, 2× 2 mod 3=1 therefore m−11 = 2

87

21 mod 5=1 therefore m−12 = 1

15 mod 7=1 therefore m−13 = 1

n=[[(35)(2×2)mod 3]mod 105 + [(21)(3×1)mod 5]mod 105+[(15)(2×1)mod

7]mod 105 ]mod 105

=(35+63+30)mod 105

=23

Mixed Radix Conversion

The Chinese Remainder Theorem is one method of converting residue

numbers to binary. The disadvantage of this method is the mod M operation,

which would not make it feasible for residue machines, that are designed to

perform modulo mi operations. The mixed radix conversion presented here,

on the other hand, is easier to implement in a residue machine, since it

involves only mod mi operations.

The mixed radix representation is of great importance in residue compu-

tation due to two reasons. 1) The mixed radix system is a weighted system

and hence is used in magnitude comparison. 2) Conversion from residue to

certain mixed-radix systems is relatively fast in residue computers.

Mixed Radix System

A number may be expressed in mixed radix form as

x = aN

N−1∏i=1

Ri + ...a3R1R2 + a2R1 + a1 (5.7)

88

where Ri are the radices and the ai represent the mixed radix digits. For

a given set of radices, the mixed radix representation of x is denoted by

< aN , aN−1, , , , a1 > The multipliers of the mixed radix digits are the weights.

Conversion to the Mixed-Radix System

Consider a system where for a set of moduli m1,m2,m3mN , a set of

radices are chosen such that mi = Ri, then the mixed radix system is said to

be associated with the residue number system. Then the equation for mixed

radix systems is transformed into

x = aN

N−1∏i=1

mi + ...a3m1m2 + a2m1 + a1 (5.8)

where ai the are mixed radix coefficients. The coefficients are determined

sequentially starting with ai. Taking mod m1 of the above equation, yields

ai since the remaining terms are multiples of m1. ai is the first residue digit.

To obtain a2, first the residue code of x − a1 is formed. This quantity is

divisible by m1. It is seen that

a2 =∣∣∣∣x− a1

m1

∣∣∣∣m2

(5.9)

In the same manner all the other mixed radix digits are obtained. In

general the mixed radix digits are found for i > 1 by

ai =

∣∣∣∣∣ x

m1m2...mi−1

∣∣∣∣∣mi

(5.10)

89

Mixed Radix Systems for Overflow detection, Comparison and

Base Extension.

Overflow detection, scaling and base extension are RNS operations that

are easier than general division, but are considerably more difficult to im-

plement than addition, subtraction and multiplication. In all three cases the

mixed radix converter forms the basis of the operation, since a mixed radix

representation is required as an intermediate step in the procedure.

Overflow detection

In order to determine if overflow has occurred, it is necessary to provide

additional dynamic range in the RNS. The result of a computation is then

checked for overflow into the ”extra” range. The overflow need to be checked

only when the residues are converted back to binary. This is because overflow

has no meaning within residue arithmetic.

Adding a redundant modulus whose purpose is overflow detection pro-

vides the extra range needed for this purpose. A necessary and sufficient con-

dition to check for overflow with one redundant modulus is that it should be

the largest modulus. The occurrence of overflow is then detected if aL+1 6= 0,

where aL+1 is the highest order mixed radix digit of the redundant RNS

representation of X. This assumes that the quantity being tested, which has

possibly overflowed the original RNS range, is not so large as to overflow

the augmented range of the redundant system. This illustrates that overflow

90

detection requires a mixed radix converter, designed to accommodate the

augmented residue representation needed for redundancy.

Overflow detection and mixed radix conversion are similar in complexity

and are both considerably more complicated than RNS addition and mul-

tiplication. It is fortunate that overflow detection is a relatively infrequent

operation in many signal processing problems, in contrast to much more fre-

quently required addition and multiplication.

Scaling

In conventional mixed-radix arithmetic, two commonly used operations

are multiplication and division by a power of the base. This operation is

implemented easily in a digital computer by shifting the operand. Since

shifting is fast, multiplication and division by a power of the radix offer

obvious advantages over multiplying or dividing by an arbitrary number. In

terms of residue arithmetic, an analogy to mixed radix arithmetic would be

division by a predetermined number that is a product of any of the moduli,

which comprise the dynamic range M. This division operation is referred to

as scaling

Extension of Base

Frequently it is necessary to find the residue representation of a number

in one base depending on its representation in another base. In most cases,

the new base will be the extension of the original base, with one or more

91

extra moduli. The procedure, termed extension of base, is a mixed radix

conversion with an additional step. The new mixed radix representation of

the numbers with higher dynamic range is given by

x = aN+1

N∏i=1

mi + aN

N−1∏i=1

mi + ...+ a3m1m2 + a2m1 + a1 (5.11)

For any number in the original range, the value of aN+1 is zero.

5.3.6 Arithmetic operations in RNS

Addition and Subtraction

RNS is superior to weighted number systems, since the absence of carries

inherently results in higher speeds. In weighted number systems in order

to eliminate the carry propagation, extensive hardware is needed to imple-

ment carry look ahead logic. The hardware required in RNS for conversion

replaces the additional hardware in a weighted number system. Second, the

sum is obtained modulo M, hence if the number exceeds M, an ambiguity

arises, since numbers of the form |a|M and |a+ kM |Mhave the same residue

representation, hence M must be chosen large enough to guarantee results

within the dynamic range and to avoid overflow.

Multiplication

Multiplication, like addition and subtraction is done by performing the

modulo multiplication of the corresponding residues.

92

Figure 5.3: Generalized Block Arithmetic for Addition Subtraction and Mul-

tiplication

93

The representation of data in RNS is attractive as the absence of carry

propagation leads to a large speed increase for multiplication operations.

RNS provides a way of partitioning large dynamic range operations into

completely independent and parallel smaller dynamic range operations. Ad-

dition and multiplication operations are performed faster. RNS offers many

benefits, such as being able to skew clocks to reduce overall switching cur-

rent/power and system noise. RNS has been used to implement a number of

DSP related architectures [27] [5] [12].

RNS is not a weighted number system. Hence, it does not have many

of the advantageous properties listed for weighted number systems, such as

magnitude comparison, sign detection and overflow detection. Systems with

many channels will tend to be unbalanced, i.e the channels with larger moduli

have larger loads than channels with smaller moduli.

5.4 Logarithmic Number Systems

A majority of the early arithmetic units utilized variations of the weighted

binary number systems for fractional, integer and floating-point representa-

tion. The conventional architectures for multiplication and division are of

high circuit complexity and suffers from a speed complexity compromise.

Various number systems like RNS and LNS have been proposed to improve

94

the computational efficiency.

LNS proposed in [9] greatly speeds up the multiplication and division

processes. The early techniques utilized ROMs to perform multiplication

[42]. However these techniques were inefficient because of the large number

of memory bits required for storage.

[9] is an early concise description of a general logarithmic number sys-

tem, capable of representing a wide range of both positive and negative real

numbers. Algorithms for the four basic operations (addition, subtraction,

multiplication and division) are discussed.

5.4.1 LNS Representation

Any number is represented by its sign SA, and the binary logarithm of its

magnitude.

SA =1 if A < 0

SA = 0 if A > 0

SA = 0 or 1 if A = 0

The representation of numbers in logarithmic format has the disadvan-

tages that negative numbers cannot be represented. Scaling solves this prob-

lem, i.e. multiplying the number by a scaling factorτ such that the logarithm

95

of the scaled number yields a positive logarithm.

LA = log (|τA|) if |A| ≥ 1

τ(5.12)

LA = 0if |A| ≤ 1

τ

Let∑A represent the number A, then the value of

∑A = (1− 2SA)LA

is value of the number represented in LNS. The original number is found out

by the following formula

A = (1− 2SA)(

1

τ

)2LA (5.13)

To represent the number LA in a discrete fashion, the number is quan-

tized to form the number KA.

KA =

(12

+ 2b−1 log2 (|τA|))

2b−1(5.14)

[X] represents the floor function and returns the largest integer greater

than X. The 12

term is useful in rounding off. Hence the error is less (the

rounding off error is substantially lesser than the truncation error).

To represent KA as a finite precision number,it is represented in n-bits.

KA = (KnKn−1...Kb...K1) =n∑

i=1Ki2

i−b

5.4.2 Generation of logarithms for binary numbers

Many methods have been used to obtain the base two logarithms for binary

numbers. Both memory based and hardware circuits have been explored to

96

Figure 5.4: Straight Line Approximation to Logarithmic Curve

compute the same. While, memory based methods are inconvenient because

of the large sizes, hardware implementations suffer from the disadvantage of

increased machine time for calculating logarithms. Methods for generation

of binary logarithms are discussed below

The principle idea of the method in [29] is to substitute the log curve

between integer values by a straight line approximation. This introduces an

error but the simplicity of the proposed method makes it attractive in some

applications. The algorithm and its hardware implementation are discussed

below.

97

Figure 5.5: Machine Organization to generate and use binary Logs

98

Algorithm for the generation of approximate logarithm using shift

and count principle

Registers A and B contain 2 numbers, each of size 8 bits. Hence largest

characteristic will be 7. (x3, x2, x1) and (y3, y2, y1) each will initially contain

111

Step 1 :Shift a and b left until their most significant bit ”ONE” bits are in

the left most position and count down x3, x2, x1 and y3, y2, y1 during shifting.

Step 2: Bits 0-6 of A and B are shifted to C and D respectively. Now C and

D contain the values of the approximate logarithm. These approximate log-

arithms are added for multiplication and subtracted for division operations.

The logarithm of the result is stored in a new register, say E.

Step 3: Decode z4, z3, z2, z1 and insert a ”ONE” in approximate position of

E immediately to the right of this ”ONE”. F now contains the result.

5.4.3 Arithmetic Operations

Multiplication and Division

The biggest advantage of the Sign/log number system is that it can transform

computationally complex multiplication and division into simple addition

and subtraction operations. For ex. To multiply 789 and 234 the logarithms

of the numbers are added to obtain the logarithm of the result, which can

then be suitable decoded to obtain the output. The sign of the result is

99

Figure 5.6: LNS Multiplier Divider Hardware

obtained by taking the XOR of the individual sign bits of the operands. A

similar technique is adopted for dividing two numbers. Here, we subtract

the logarithms of the two numbers to be divided to yield the logarithm of

the output. The scaling factor that was initially added to the operands is

removed from the result.

Addition and Subtraction

Addition and subtraction in LNS is a cumbersome process and involves

significant hardware complexity. The process of addition or subtraction in-

volves converting the add or sub expression into equivalent multiplication or

division operations.

Addition/subtraction Algorithm

Consider the addition of two numbers A and B.

S = A+B

100

S = A(1 +

B

A

)(5.15)

The addition expression is now converted into a multiplication of the first

number with a function of the ratio of the two numbers, S = Aψ(

BA

)where

ψ (X) = 1 +X

First, the ratio(

BA

)is calculated. The value of ψ is then determined. The

value returned by the function ψ is then multiplied by A.

The sign of the output is the sign of the largest number.KA and KB

denote the logarithms of the numbers A and B

If KA ≥ KB

SS = SD = SA

KS = KA + β (KB −KA)

KD = KA + γ (KB −KA) (5.16)

If KA ≤ KB

SS = SD = SB

KS = KA + β (KA −KB)

KD = KA + γ (KA −KB) (5.17)

where, β (X) = log2(1 + 2x) and γ (X) = log2(1− 2x)

The above equations are realized through a comparator,adder , subtractor

101

Figure 5.7: Hardware for Logarithmic Addition and Subtraction

and ROM. The size of the ROM is a limiting factor in the implementation

of circuits for large word length operations.

102

Chapter 6

Logarithmic Residue Number

System

This chapter explores the emergence of a new number system to overcome

the computational complexity of conventional binary number systems. RNS

has the advantage of carry free arithmetic operations. Thus, it is greatly

useful in speeding up addition, subtraction and multiplication when com-

pared to BNS. However, RNS suffers from inefficient division and magnitude

comparison operations.

Unlike RNS, sign/log number system achieves higher speed and much

reduced hardware complexity, particularly for multiplication and division

operations at the cost of accuracy. Considering the fact that RNS involves a

hardware intensive multiplication and an inefficient division operation, it is

attractive to embed sign/log number system in RNS. Such a mixed number

103

system namely LRNS will give better performance, lower power consumption

and improved accuracy. The improvement in the accuracy is because sign/log

number system is applied to the residue code, which has a far reduced bit

length, compared to its binary counterpart.

The following sections will detail the architectures used in arithmetic

operations and list the advantages of LRNS over binary, RNS and LNS with

respect to power,performance and area.

6.1 Arithmetic operations

This section describes the execution of the various arithmetic operations

in LRNS. Addition and subtraction under this new scheme is performed in

RNS, while multiplication is performed in LNS.

6.1.1 Addition and Subtraction

The operands for the addition/subtraction operation are converted to

residue codes. This representation of operands into many segments of smaller

word length is advantageous to the addition operation as the individual

residues can be operated in parallel. Moreover, the smaller wordlength has

led to absence of carry propagation and lesser memory required for storage.

The execution flow chart for addition/subtraction operations under LRNS is

shown below.

104

Figure 6.1: Execution flow for addition/subtraction operations in LRNS

105

6.1.2 Multiplication

Most DSP algorithms have a high degree of multiplicative complexity.

LRNS plays a vital role in the efficient execution of such algorithms. The

advantage of using LRNS is due to two main reasons 1) The conversion of the

operand into residues allows for the parallel processing of the operands lead-

ing to higher speed 2) The use of logarithms to compute the product greatly

reduces the processing time as the multiplicative operation is replaced by a

single addition operation. The inherent inaccuracy in using the logarithms

is also reduced as the the bit length of the operand is reduced.

The circuits for the addition of the logarithms are discussed in Chapter

on 5. A factor that has to be taken into account when designing the archi-

tectures for LRNS is the computation of log(0). Quite often, the residues of

an operand could take the value 0. As the logarithm of zero is not defined,

the following set of rules are used.

Log (0) = X

Anti log(X) = 0(6.1)

A detailed flowchart depicting the possible architecture employed in the

multiplication operation is shown in Figure 6.3

Figure 6.3 includes a pre-checker circuit. This circuit is used to check if

any of the operands is zero in the case of multiplication. If such a condition

arises, then the pre-checker circuit bypasses the computational units and

106

Figure 6.2: Execution Flow for Multiplication in LRNS

107

Figure 6.3: Flow chart depicting Architecture for Multiplication

108

drives the output to zero. This step saves valuable computation time. In the

case of division, this circuit can be used to check whether the divisor is zero

and correspondingly drive an error bit high.

A checker circuit is employed to check if the generated residue codes

are zero. If any of the residues is zero, then the computation of logs and

antilogs follow the rule stated in (6.1). The checker and pre-checker circuits

are essential for the correct and efficient working of the LRNS multiplication

algorithm.

The memory module for log and anti-log are of Content Addressable

Memory) (CAM) type, thereby greatly reducing the time involved in the

access of the memory. The residue codes of the results from multiplication or

addition/subtraction operation flow down the architecture and gets converted

back to binary code in the final stage through a procedure called the CRT.

This theorem is explained with an example in Chapter 5.

6.2 LRNS: Area, Power, Performance

The new mixed number representation LRNS has been projected as supe-

rior to BNS and RNS for certain computationally intensive DSP algorithms.

To quantify the performance improvement of LRNS, it was compared with

other number systems w.r.t to hardware complexity, interconnect complexity,

109

Number Full adder Gate Interconnect Delay Accuracy Power

System Complexity Complexity Complexity

Binary 64 1056 576 Tadd Highest High

RNS 65 1104 613 ∼ Tadd Highest High

Table 6.1: Performance variation for addition operation

delay, accuracy and power. The results of the analysis are presented below.

6.2.1 LRNS vs Binary

LRNS was compared with binary for different operations like addition

subtraction and multiplication.

6.2.1.1 Addition/Subtraction

In LRNS, addition and subtraction operations are carried out in RNS,

thus giving the system the capability of performing dynamic parallel compu-

tation on the individual residues, with the additional advantage of carry-free

operations. The hardware complexity is calculated at the functional level

(full adder level) and also at the logic level (gate level). The interconnect

count (IC) has been taken across the functional units. While estimating the

power, BNS was taken as a benchmark and labelled ’high’, and the power

involved in arithmetic computations in other number systems was compared

against this benchmark.

The results of the analysis are listed in Table 6.1. It shows the variation

110

in the different parameter for 64-bit operations under the moduli set (2k, 2k−

1, 2k−1−1). The hardware complexity at the full adder level is almost similar,

whereas there is a slight increase in gate count for the RNS. The increased

gate count leads to a larger value for the interconnect complexity across the

gates. We have employed a CSA tree based adder architecture. Hence, the

delay for this architecture will be the pipelining delay equal to the delay

of a single full adder stage. RNS maintains perfect accuracy as long as the

operations are carried out within its dynamic range. This dynamic range can

be extended by increasing the number and magnitude of the prime moduli

used. As the hardware complexity in RNS for add/sub operations is nearly

comparable to BNS, the power consumed can also be termed ’high’.

6.2.1.2 Multiplication

Multiplication is a cumbersome operation in BNS. The basic operation

involved is the repeated addition of the multiplier to the multiplicand. This

repeated addition is carried out using CSA tree based structures. In LRNS,

the repetitive additions are replaced by a single addition of the corresponding

logarithm of the operands.

The hardware used for multiplication in LRNS is the logarithmic mul-

tiplier shown in Chapter 5. LRNS, therefore greatly reduces the hardware

required to perform multiplication. An order of magnitude reduction in hard-

111



Binary 3968 47616 12032 7Tadd Highest High

LRNS 65 1104 613 ∼ 2Tadd Higher Low

Table 6.2: Performance variation for multiplication operation

ware can be achieved. The reduced hardware count has an impact on inter-

connects. Moreover, the delay of the circuit is very much less on account of

the lesser number of computations to be performed. The low hardware count

results in reduction in the power consumed.

The reduction in hardware, interconnects, delay and power make LRNS

very attractive when compared with BNS for the execution of DSP algorithms

which are dominated primarily by multiplication.

6.2.2 LRNS vs RNS

LRNS is different from RNS only in its use of logarithms for performing

multiplication operations. Hence , there is no performance variation between

RNS and LRNS for addition and subtraction operations


The performance variation for multiplication operation when performed in

RNS and LRNS is shown in Table 6.3. By using the property of logs, LRNS

achieves very high degree of reduction of hardware complexity, interconnect

112



RNS 1279 15348 3967 < 7Tadd Highest High


Table 6.3: LRNS vs RNS: Performance variation for multiplication operation

complexity, power etc. The power reduction is by virtue of reduced hardware

count and computational complexity. The vast decrease in interconnect count

is of great importance under DSM technology. However, the accuracy, is not

as high as in RNS due to truncation effects. The minor loss in accuracy is

a small price to pay when compared to the drastic improvement in other

performance parameters in LRNS.

6.2.3 LRNS vs Sign/Log

Though the Sign/Log number system provides a computationally efficient

method for performing multiplication and division operations, it cannot be

used to perform efficient addition/subtraction operation. Another major

drawback of the Sign/log number system is its inaccuracy. This is mainly due

to the truncation effects, which are more pronounced in operations on large

word operands. LRNS reduces this truncation error as the initial conversion

to residue codes reduce the word length of the operand.

113



Sign/log 128 2517 1285 4Tadd Low Higher

RNS 65 1104 613 ∼ Tadd Highest High

Table 6.4: LRNS vs LNS :Performance variation for add/sub operation

6.2.3.1 Addition/Subtraction

The hardware to perform addition in LNS is quite complex and is dis-

cussed in Chapter 5. The performance difference between LRNS and Sign/log

number systems for the computation of addition/subtraction operation is

listed in Table 6.4

As is evident from the results in Table 6.4, LNS is not suited to perform

addition or subtraction operations. LNS suffers from high power consump-

tion due to increased hardware count. The delay associated with add/sub

operation is 4Tadd, which is four times the delay associated with a similar

operation in BNS.

While the high delay, low accuracy and increased power make LNS unattrac-

tive, LRNS maintains the same performance as binary as the add/sub oper-

ations are performed in RNS.

114



Sign/Log 64 1056 576 2Tadd Low Low


Table 6.5: Performance variation for multiplication operation


LNS is highly efficient for multiplication and division operations. A com-

parison between LNS and LRNS for multiplication is shown in the Table

6.5

The results shown in Table 6.5 reveal that the conversion to residue codes

do not affect the hardware or interconnect to a large extent. It however helps

increase the accuracy of the system as the truncation errors on small word

lengths are less.

6.3 Accuracy Analysis

It is well known that the sign/log number system suffers from high degree of

inaccuracy. LRNS provides better accuracy than LNS. This is because the

error in calculating the logarithm is lesser for operands of small wordlength.

Table 6.6 and 6.7, show for an example case of 8 sample points, radix-2

FFT operation.The accuracy of LRNS is within 1 percent of the correspond-

ing values in the Binary Number system.

115

Sample Point Binary FFT LRNS FFT

(16,16) (128,128) (128,128)

(16,16) (0,0) (0,0)

(16,16) (0,0) (0,0)

(16,16) (0,0) (0,0)

(16,16) (0,0) (0,0)

(16,16) (0,0) (0,0)

(16,16) (0,0) (0,0)

(16,16) (0,0) (0,0)

Table 6.6: Accuracy Analysis for 8 Sample point FFT-1

Sample Point Binary FFT LRNS FFT

(2,0) (12,0) (12,0)

(1,0) (1,-2.414) (1,-2.42)

(2,0) (0,0) (0,0)

(1,0) (1,-0.414) (1,-0.42)

(2,0) (0,0) (0,0)

(1,0) (1,-0.414) (1,-0.42)

(2,0) (0,0) (0,0)

(1,0) (1,2.414) (1,2.42)

Table 6.7: Accuracy Analysis for 8 Sample point FFT-2

116

6.4 LRNS Architecture for DFT and FFT

Figure 6.4 shows a PE for a 1-D DFT computed using a matrix-column

vector multiplication architecture. It consists of a LRNS multiplier and an

RNS adder. It is a time-dependent weighting function as the index m changes

cyclically with each clock pulse. Since W is complex, complex valued results

must be calculated in the PE. One complex multiplication can be broken

down into four real multiplications and two real additions. But by employing

LRNS we are converting one complex multiplication into six additions. The

partial product addition is carried out by using RNS addition.

Figure 6.5 shows the LRNS architecture for a radix-4 butterfly element.

The inputs are converted form binary to RNS in the first stage and then to

LNS using the log ROM. Multiplication with twiddle factors is done with the

help of LRNS multipliers. A ROM is used to store the LRNS values of the

twiddle factors. Anti-log ROM is used to convert the results back to RNS.

The binary additions are replaced by RNS additions. The results are finally

converted back to binary.

117

Figure 6.4: PE of a 1-D DFT array

118

Figure 6.5: PE of a radix-4 FFT architecture

119

Chapter 7

LRNS in Time-Frequency

Transforms

A traditional approach in the frequency domain analysis of time and space

domain signals is using FT. This transform is reversible, but applicable only

for stationary signals. With this transform it is not possible to predict the

variation of frequency component with time, examples being speech and EEG

signals. In the frequency domain no time information is available, it is im-

possible to tell where in time a given frequency component occurs, we only

know that it is present in the signal. Hence what we need is a transform

giving a time-frequency representation.

The traditional solution is to use a Short Term (Windowed) Fourier Trans-

form (STFT) that divides the signal into small segments (windows), where

the signal can be assumed to be stationary and FT is computed within each

120

window.

Such time-frequency transforms are applied in areas like speech and image

processing. The Gabor transform (Appendix A) is one such which is much

better than most of its counterparts for image representation. Its major

advantage being that it achieves the lower bound on the joint entropy. It is

found that majority of the mammalian visual profiles match quite well to this

type of representation. The main problem that prevents its widespread usage

is the computational complexity involved in finding the Gabor coefficients.

The following section presents a fast and efficient method for the computation

of the Gabor transformation by changing the number representation.

7.1 The 1-D Discrete Gabor Transform

The discrete Gabor transformation is expressed in matrix notation. The

Gabor coefficients is found by multiplying the inverse of the Gabor matrix

and the signal vector. The Gabor matrix is decomposed into the product

of a sparse constant complex matrix (which has known inverse) and another

sparse matrix which depends only on the window function.

For a finite 1-D signal f(x), x=0,1,...,X-1, X=KM, the complete Gabor

transformation[49] is expressed as

f(x) =K−1∑m=0

M−1∑r=0

amrgmr(x) (7.1)

121

which means, a discrete f(x) with KM sample points has KM coefficients

amr,m=0,1,...,K-1,r=0,1,...,M-1.

The Gabor transformation 7.1 written in matrix form is

f = Ga (7.2)

where

f =

f(0)

f(1)

.

.

.

f(KM − 1)

and a =

a0

a

.

.

.

aKM−1

(7.3)

and G is a KM∗×KM matrix. G can be expressed as

G =

G00 G01 . . . G0,K−1

G10 G11 . . . G1,K−1

.

.

.

GK−1,0 GK−1,1 . . . GK−1,K−1

(7.4)

The elements of G matrix are M ×M matrices. Expanding further (7.2)

becomes

f = CD

E∗ 0 . . . 0

0 E∗ . . . 0

.

.

.

0 0 . . . E∗

a (7.5)

122

where

C =

I 0 0 . . 0

0 (−1)M−1I 0 . . 0

0 0 I . . 0

.

.

0 0 0 . . (−1)M−1I

(7.6)

D =

D0 D−1 D−2 . . D−(K−1)

0 (−1)M−1I 0 . . D−(K−2)

D2 D1 D0 D−(K−3)

.

.

DK−1 DK−2 DK−3 . . D0

(7.7)

Note that the matrix D is a block-Toeplitz matrix. It is derived from

(7.5) that

a =1

M

E 0 . . . 0

0 E . . . 0

.

.

.

0 0 . . . E

D−1Cf (7.8)

where, D−1 is the inverse matrix to D and D−1 is written as

D−1 =

D(0)0 D

(0)−1 D

(0)−2 . . D

(0)−(K−1)

D(1)1 D

(1)0 D

(1)−1 . . D

(1)−(K−2)

D(2)2 D

(2)1 D

(2)0 D

(2)−(K−3)

.

.

D(K−1)K−1 D

(K−1)K−2 D

(K−1)K−3 . . D

(K−1)0

(7.9)

123

Basically the computation time of an algorithm for the inversion of an nth

Toeplitz matrix is bounded by O(n2). However algorithms for fast inversion

of banded Toeplitz matrices by circular decompositions are proposed in [1].

According to [1] the computation time of inverse of an nth order banded

Toeplitz matrix can be reduced to O(nlogn).

Rewriting equations (7.5) and (7.8) the Gabor coefficients are computed

from the following equations (7.10), (7.11) and (7.12) which involve mere

matrix multiplication.

x = Cf (7.10)

y = D−1x (7.11)

a =1

M

E 0 . . . 0

0 E . . . 0

.

.

.

0 0 . . . E

y (7.12)

The computational complexity of the Gabor transformation in binary

number system is analyzed at the algorithm level as follows. The computation

of x in (7.10) is a matrix column vector multiplication in which the matrix

C consists of 1s and 0s only. Hence no multiplication is involved in this case

and this equation can be computed in O(MK) time.

124

In the computation of y in (7.11), O(M2K2) multiplications are involved

and O((M-1)K2) additions are involved as seen from the equation (7.11) and

the time complexity is O(Klog K).

In the computation of a in (7.12) the order of multiplications is O(M2K)

and the order of additions is O((M-1)K)and the computation of a can be

achieved in O(MK) time.

The total computational complexity is found by adding the individual

computational complexities in each matrix multiplications. For the calcula-

tion of all amrs the overall computational complexity gets multiplied by M/2

(it does not get multiplied by M as perceived because half of the coefficients

are complex conjugates of the other half). The overall time complexity of

the Gabor transformation algorithm is found to be O(MK log MK) (in terms

of multiplication) which can be comparable to that of a discrete FFT. It is

clearly seen that the computational complexity of the Gabor transformation

is estimated to be very high which restricts its widespread use.

7.2 LRNS in Gabor Transform

Using LRNS in Gabor transform nullifies all the multiplications involved

as the multiplications become additions. The computational complexity of y

is only the addition complexity given by O(M2K2) addition complexity. The

125

addition complexity of the final stage becomes O(M2K). The overall com-

putational complexity becomes approximately O(M2K2). Hence the time

complexity of the Gabor transformation is far reduced from the original mul-

tiplicative complexity of O(MK log MK) to corresponding addition complex-

ity.

The computational flow of finding the Gabor Coefficients for the given

function f(t) is shown at the functional level both in BNS and LRNS in

Figure 7.1

The functional level schematic depicted in Figure 7.1 brings out the su-

periority of LRNS in reducing the time complexity involved in executing

the different arithmetic stages of finding Gabor coefficients. The applica-

tion of LRNS can also be extended to the multidimensional discrete Gabor

Transformation wherein the reduction in computational complexity and time

complexity are even more predominant.

126

Figure 7.1: Computational flow for finding the Gabor Coefficients using Bi-

nary and LRNS127

Chapter 8

Mixed number system

Arithmetic Processor-MAP

High performance, accuracy and low power are the most important de-

sign parameters of DSP architectures. In DSM based technology, while high

performance can be achieved, power becomes a critical factor, which needs

either a new architecture or even a new number system. The computational

complexity of DSP algorithms leads to high power consumption particularly

in high performance applications. We present an architecture for arithmetic

processor based on a mixed number system [31] to achieve low power and

improved accuracy without sacrificing on performance in the context of DSM

technology. The uniqueness of this architecture is that unlike conventional

architectures, there is support for four different types of number systems.

This architecture is expected to fill the gap between the conventional archi-

128

tectures and the DSM technology, achieving reduced interconnect complexity

and low power.

8.1 MAP Architecture

The Architecture for the Mixed number system based Arithmetic Proces-

sor (MAP) is shown in Figure 8.1. This architecture has quite a few special

functional units corresponding to RNS, sign/log and Binary system. These

functional units help achieve very high performance with reduced power con-

sumption for executing computationally intensive DSP algorithms.

The on-MAP memory modules for storing RNS and BNS data are in-

cluded . The RNS memory could be of an SRAM cache type and the BNS is

of DRAM type. The reason for including an on-chip cache is that, the volume

of RNS data is bound to be much more than BNS data. This is because the

computational complexity in DSP algorithms is dominated by multiplication

operations which are performed using LRNS in the MAP architecture.

The processor is supported by a uni-bus architecture consisting of data

/control /address buses. The data bus is provided with the capability of

handling both RNS and Binary data. The RNS data is carried as a set of

residue codes requiring a bus width larger than the Binary data. The control

unit carries out loading of either the RNS residue code or the Binary data

129

Figure 8.1: Mixed number system Arithmetic Processor(MAP)

130

by appropriately driving the respective bus lines to tri-state.

8.1.1 Special Purpose Functional Units

In majority of signal processing applications, the input to the processor is

from analog-digital converter (ADC). The output of the ADC gets converted

to residue codes in the BRC unit. This unit performs modulo division op-

eration. The divider unit performs multiplicative division using a pipelined

architecture. Another major functional unit of this processor is LMU. This

receives inputs from the LAU which is a CAM system. The basic operations

performed in the LMU are addition and subtraction. The PMA performs

RNS addition and subtraction.

8.1.2 General Purpose Functional Units

Besides all these special purpose functional units, a general purpose Binary

ALU is provided to perform any pre-processing needed in executing particular

DSP algorithms and also modulo division. The Binary ALU does not not

contain any high speed multiplier or divider unit. After execution of an

algorithm, the results that are in RNS form are converted back to BNS in

the RBC [14] [44] unit.

131

Figure 8.2: Special purpose MAP Instruction Format

8.2 Instruction Set

This arithmetic processor includes a powerful instruction set having both

special purpose and general-purpose characteristics. The general-purpose

instructions include the conventional arithmetic, load, store and control in-

structions. The special purpose instructions are used for performing opera-

tions under sign/log, RNS and LRNS.

The list of special purpose instructions is given below and their instruction

format is given in Figure 8.2. The instruction format includes individual

fields corresponding to different number systems. This is bound to make the

decoding logic simple.

RNS Instructions

RAD- RNS Addition

RSU- RNS Subtraction

MOD- Modulo operation

RBC- RNS to BNS Conversion

BRC- Binary to RNS Conversion

132

Sign/Log Instructions

SMU- sign/log Multiplication

SDIV- sign/log Division

LOG- Calculates logarithm

ALO- Calculates anti-logarithm

LRNS Instructions

LRM- LRNS Multiplication

The instruction set includes quite a few special purpose instructions for

faster execution. Hence it is better to avoid complex addressing modes. The

above special purpose instruction set supports Immediate and Direct ad-

dressing modes only. A MAP compiler can be designed to map the execution

of an algorithm into a balanced set of instructions for executing the different

arithmetic operations using relevant number systems.

8.3 Execution Flow for MAP Instructions

The data and control flow for the special purpose instructions are given

in Figures 8.3, 8.4 and 8.5. The bold lines represent the control flow and the

ordinary lines represent the data flow.

RAD/RSU

Figure 8.3 shows the execution flow for the instruction RAD/RSU. The

133

Figure 8.3: Execution Flow for RAD/RSU Instruction

Figure 8.4: Execution Flow for SLM/SLD Instruction

134

Figure 8.5: Execution Flow for LRM Instruction

operands to be added are converted into residues by using an appropriate

moduli set. The residues are added(in case of subtraction complementary

addition is performed) and the resultant residue set is converted to binary

using CRT.

SLM/SLD

Figure 8.4 shows the execution flow for the instruction SLM/SLD. The

logarithm of the operands,read form the Log/Anti-log ROM, are added/subtracted.

The anti-log of the resultant residue set is read form the Log/Anti-log ROM

to obtain the result in binary.

LRM

Figure 8.5 shows the execution flow for the instruction LRM .The operands

135

to be multiplied are converted into residues by using an appropriate moduli

set. The logarithm of the corresponding residues, read form the Log/Anti-log

ROM, are added. The anti-log of the resultant residue set is read form the

Log/Anti-log ROM and is finally converted to binary using CRT.

8.4 Verilog Simulation of MAP Instruction

Set

To verify the architectural behavior of the MAP Processor, the execution

of selected arithmetic operations in RNS, LNS and LRNS is simulated in

Verilog and the data flow timing diagrams are provided.

RAD

The timing simulation for RAD instruction is shown in Figure 8.6. a and

b signals are operands which takes values 10 and 5 respectively. The signals

m1, m2, m3 represent the moduli set used (3,5,7). The output signal is ad1

which has value 15 and f is the sign flag of the output which goes high to

indicate a positive sign. The output occurs after 2 ns.

RSU

The timing simulation for RSU instruction is shown in Fig.8.7 . a and b

signals are operands which takes values 3 and -14 respectively. The signals

m1, m2, m3 represent the moduli set used (3,5,7). The output signal is d1

136

Figure 8.6: Timing diagram for RAD Instruction

which has value 11 and f is the sign flag of the output which goes low to

indicate a negative sign. The output occurs after 2 ns.

LRM

The timing simulation for LRM instruction is shown in Figure 8.8 . a and

b signals are operands which takes values 6 and -3 respectively. The signals

m1, m2, m3 represent the moduli set used (3,5,7). The output signal is d1

which has value -18 and f is the sign flag of the output which goes low to

indicate a negative sign. The output occurs after 4 ns.

SLD

The timing simulation for SLD instruction is shown in Figure 8.9 . a1

and b1 signals are operands which takes values -8 and -4 respectively. The

137

Figure 8.7: Timing diagram for RSU Instruction

Figure 8.8: Timing diagram for LRM Instruction

138

Figure 8.9: Timing diagram for SLD Instruction

signals y1 and y2 indicate the sign of the operands a1 and b1 respectively.

Both y1 and y2 takes value 0 as both a and b are negative. The signals m1,

m2, m3 represent the moduli set used (3,5,7). The output signal is c1 which

has value -18 and f is the sign flag of the output which goes high to indicate

a positive sign. The output occurs after 2 ns.

139

Chapter 9

Future Work

The LRNS mixed number representation presented in this thesis can find

wide applications in image and digital signal processing.

9.1 Reconfigurable FFT Architecture for dif-

ferent Radices

Based on the applications and number of sample points, different radices

based FFT architectures may be needed. Designing an FFT architecture

in which the different radices can be brought in by proper reconfiguration

of the hardware is a difficult task. This design will involve enormous hard-

ware resources and hence will consume more power. Application of LRNS

in such complex reconfigurable FFT architecture will be great importance.

The overhead involved in reconfiguration can get greatly compensated by the

introduction of LRNS which eliminates the multiplier units.

140

The reconfiguration across different radices can be achieved by employ-

ing proper multi-stage interconnection network like Delta, Omega, Banyan

and Clos.The efficiency of interconnection for FFT reconfiguration across dif-

ferent radices will have to be investigated employing these MIN structures.

A possible functional level architecture for this radix reconfigurable FFT is

shown in Figure. This has several alternate stages of butterflies correspond-

ing to different radices and MIN structures. The different butterfly structures

from input stages to output stages can get connected according to the re-

configuration strategy and the capability of the MIN structures. A detailed

investigation on designing this architecture, the simulation etc will be taken

up at a later stage.

9.2 Low Power High Performance LRNS based

Convolver Design

The Convolution of two sequences x(n) and w(n) in LRNS is defined as

y(n) =∞∑

k=−∞ALOG(xLRNS(k) + wLRNS(n− k)) (9.1)

Suppose x(n) and w(n) are causal sequences and each is of finite length

N , i.e., n = 0, 1, 2, ..., N − 1, the (linear) convolution of these two sequences

is a causal sequence, computed in LRNS as

y(n) =N−1∑k=0

ALOG(xLRNS(k) + wLRNS(n− k)) (9.2)

141

Figure 9.1: Reconfigurable FFT Architecture

142

where, n = 0, 1, 2, ...2N − 2

The Computational complexity is reduced to 2N-1 addition complexity

by using LRNS. In case of binary it is N multiplications and N-1 additions

[23].

The 2-D convolution formula in LRNS is given as follows

y(n) =n1∑

k1=0

n2∑k2=0

ALOG(xLRNS(k1, k2) + wLRNS(n1 − k1, n2, k2)) (9.3)

where n1, n2ε0, 1, ..., 2N − 2

The number of computations in 2-D convolution is usually very large

for binary. By employing LRNS the number of computations is reduced to

(2N − 1)2 additions alone.

9.2.1 LRNS Based Convolver Architecture

The systolic architecture for 1-D convolution employing LRNS is shown

in Figure 9.2. The hardware delay and power analysis of the LRNS based

convolver architecture and a comparative study against binary based archi-

tecture will be taken up.

143

Figure 9.2: LRNS Based 1-D Convolver Architecture

144

Chapter 10

Conclusion

The thesis deals with the interconnect dominance in FFT architectures

and shows the need for shift in the design paradigm from device dominated

design to interconnect dominated design methodologies.

A mixed number representation called LRNS is evolved by embedding

the sign/log number system into the RNS. Major advantages of the mixed

number representation over the binary, Residue and the sign/log number

system are demonstrated. Based on this mixed number system, an arithmetic

processor (MAP) has been designed. MAP instruction set comprising of

general purpose and special purpose instructions has been presented.

MAP is expected to fill the gap between the conventional architectures

and the DSM technology, achieving reduced interconnect complexity and low

power without sacrificing on performance. LRNS can be widely applied in

embedded DSP systems to reduce the computational complexity of multipli-

145

cation and the corresponding power. A major application of this could be in

the development of high performance low power multidimensional convolver

which has several important applications in DSP and image processing.

146

Appendix A

The Generalized Gabor

Transform

For a given function f(t), tεR, the generalized Gabor transform [20] finds

a set of coefficients amr such that

f(t) =∞∑

m=−∞

∞∑r=−∞

amrg(t−mT )exp(i2πrt

T ′) (A.1)

where T, T ′ > 0. In some literatures

Ω =2π

T ′(A.2)

is also used in the exponential term in A.1. When TΩ = 2π, or T = T ′, A.1

becomes the original Gabor Transform proposed proposed by D.Gabor [8]

and it is also called the critical sampling Gabor Transform. When TΩ < 2π,

or T < T ′, A.1 is called the over sampling Gabor transform.

147

The 1-D generalized Gabor elementary functions (GEF) are defined as

gmr(t) = g(t−mT )exp(i2πrt

T ′) (A.3)

where, T, T ′ > 0 as seen above, and g(t)εL2(R). For a given function f(t),

tεR, the 1-D generalized Gabor transform finds a linear expansion against

the above set of Gabor elementary functions

f(t) =∞∑

m=−∞

∞∑r=−∞

amrgmr(t) (A.4)

The set of coefficients amr are called the Gabor coefficients of the function

f(t). One well known method for the computation of the Gabor coefficients for

an arbitrary function f(t) is to introduce an auxiliary biorthogonal function

γ(t) depending on g(t), such that the Gabor coefficients can be expressed as

[26]

amr =

∞∫−∞

f(t)γ∗(t−mT )exp(−i2πrtT ′

dt) (A.5)

A.1 1-D Discrete Gabor Transformation

For a finite 1-D signal f(x), x=0,1,...,X-1, X=KM, the complete Gabor

transformation [49] is expressed as

f(x) =K−1∑m=0

M−1∑r=0

amrgmr(x) (A.6)

That is, a discrete f(x) with KM sample points has KM coeffients amr,

m=0,1,...,K-1,r=0,1,...,M-1.

148

For any integer M > 0, if g is a real function,

gmr(x) = g∗m,M−r−1(x) (A.7)

and if both f and g are real functions,

amr(x) = a∗m,M−r−1(x) (A.8)

We first introduce two M ×M matrices E and E∗. For u,v=0,1,...,M-1,

E’s element (u,v) is defined as

euv = exp(−i2πh(u)h(v)M

) (A.9)

E∗ is conjugate of E. That is, E∗’s element (u,v) is defined as

e∗uv = exp(i2πh(u)h(v)

M) (A.10)

It is easy to verify that EE∗ = E∗E = MI, where I is a unit matrix.

We can rewrite the Gabor transformation A.11 as

f(x) =KM−1∑

l=0

algl(x) (A.11)

where, al corresponds to the original amr, m =[l/M], r=l mod M, and g(x)

has the same form corresponding to gmr(x). The Gabor transformation A.11

written in matrix form is

f = Ga (A.12)

149

where

f =

f(0)

f(1)

.

.

.

f(KM − 1)

and a =

a0

a

.

.

.

aKM−1

(A.13)

and G is a KM∗×KM matrix. G can be expressed as

G =

G00 G01 . . . G0,K−1

G10 G11 . . . G1,K−1

.

.

.

GK−1,0 GK−1,1 . . . GK−1,K−1

(A.14)

where each Gpq, p,q=0,1,...,K-1 is an M ×M matrix.

Let g(pq)uv be the element (u,v) of matrix Gpq.

gpquv = gqv(u+pM) = (−1)p(M−1)g(h(u)+(p−q)M)exp(i

2πh(v)h(u)

M) (A.15)

where u,v=0,1,...M-1. We can see that

Gpq = (−1)p(M−1)Dp−qE∗ (A.16)

where Dp−q is an M × M diagonal matrix with the uth diagonal element

being

dp−quu = g(h(u) + (p− q)M) (A.17)

150

where u=0,1,...,M-1. Thus

f = CD

E∗ 0 . . . 0

0 E∗ . . . 0

.

.

.

0 0 . . . E∗

a (A.18)

where

C =

I 0 0 . . 0

0 (−1)M−1I 0 . . 0

0 0 I . . 0

.

.

0 0 0 . . (−1)M−1I

(A.19)

D =

D0 D−1 D−2 . . D−(K−1)

0 (−1)M−1I 0 . . D−(K−2)

D2 D1 D0 D−(K−3)

.

.

DK−1 DK−2 DK−3 . . D0

(A.20)

The following is derived from A.12 (note that C=C−1)

a =1

M

E 0 . . . 0

0 E . . . 0

.

.

.

0 0 . . . E

D−1Cf (A.21)

151

D−1 is the inverse matrix to D. Since most of the elements in D are 0’s,

it is relatively easy to find the inverse D−1. If If g is a real function, D is a

real matrix. This makes the computation even easier. D−1 can be rewritten

in similar format

D−1 =

D(0)0 D

(0)−1 D

(0)−2 . . D

(0)−(K−1)

D(1)1 D

(1)0 D

(1)−1 . . D

(1)−(K−2)

D(2)2 D

(2)1 D

(2)0 D

(2)−(K−3)

.

.

D(K−1)K−1 D

(K−1)K−2 D

(K−1)K−3 . . D

(K−1)0

(A.22)

where each D(p)p−q, p,q=0,1,...,K-1, is also an M ×M diagonal matrix.

Since D is a block Toeplitz matrix, the inversion of D follows simply

that of any Toeplitz matrix inversion algorithm. Fast algorithms for finite

Toeplitz matrix inversion are suggested in [45] [41] [1]

The fast Gabor transformations can be considered in two views. First,

since matrices D,E and C are independent to input signal f, these matri-

ces and their products can be precalculated for any fixed window function

g(t) and K,M values. So the Gabor transformation becomes a product of a

constant matrix and the input signal f. Second, we can rewrite (A.18) and

(A.21) to

x = Cf (A.23)

152

y = D−1x (A.24)

a =1

M

E 0 . . . 0

0 E . . . 0

.

.

.

0 0 . . . E

y (A.25)

The computation and time complexity of the above equations are dis-

cussed in Chapter 7 for both binary and LRNS. (A.21) can be rewritten

to

am =1

ME

K−1∑k=0

(−1)k(M − 1)D(m)m−kfk (A.26)

where

f =

f(0 +mM)

f(1 +mM)

.

.

.

f(M − 1 +mM)

and a =

am0

am1

.

.

.

am,M−1

(A.27)

If both f and g are real functions, E is the only complex matrix in (A.26).

So, the real part and imaginary part of amr can be calculated separately. In

this case, we can easily verify, for m = 0, 1, ..., K − 1 and r = 0, 1, ...,M − 1,

am,M−r−1 = a∗mr (A.28)

This means half of the coefficients are complex conjugates of another half.

Only half of them need to be computed.

153

Appendix B

CAM-Content Addressable

memories

A CAM (Content-addressable Memory) is an advanced memory device

that has many applications.It is highly advantageous in applications that

require fast searches of a database,list or pattern like image and voice recog-

nition,or computer and communication networks. CAMs obtain an order-of-

magnitude reduction in the search time over other memory search algorithms,

such as binary or tree-based searches by simultaneously comparing the de-

sired information against the entire list of stored values.

The working of a CAM is better understood by comparing it with a

RAM. RAM is an acronym for Random Access Memory, which emphasizes

the ability to examine each stored data independently of any other piece of

data. Data is stored at a particular location, called an address. In a RAM,

154

Figure B.1: Typical CMOS SRAM Memory Cell

the address is supplied and the data at that location is retrieved. The depth

of the memory, or number of locations, is limited by the ability to address

the memory.

For example, if the address bus is eight bits wide, only 256 memory

locations can be addressed, since in binary math,28 = 256. Binary logic is

used, because signal lines normally have only two states, HIGH and LOW.

RAM chips are composed of arrays of cells of transistors. Each cell repre-

sents one bit and contains one or more transistors depending on the type of

RAM. CMOS Static RAMs commonly use six transistors per cell, as shown

in figure B.1; four are cross-coupled to store the state of the bit, and two are

used to alter or read out the state of the bit.

155

This configuration is called Static because the state of the bit remains

at one level or the other until deliberately changed, or power is removed.

Dynamic RAMs, on the other hand, get their name from the transient nature

of their storage mechanism, which commonly consists of a single transistor

along with a capacitor to store the bit information.

During a read, the charge on the capacitor is drained to the bit line, re-

quiring a rewrite of the bit, called a restore operation. Additionally, because

the DRAM capacitor is not perfect, it loses charge over time, and needs to

have its charge refreshed at regular intervals. Thus, dynamic memories are

accompanied by controller circuits to rewrite the bit and refresh the stored

charge on a regular basis.

Neither SRAMs nor DRAMs retain information when power is removed,

but SRAMs are often used to store important configuration information, with

battery back-up as SRAM does not require refreshing.

CAMs are organized differently. In a CAM, data is stored in locations

in a somewhat random fashion. The locations can be selected by an address

bus, or the data can be written directly into the first empty location because

every location has a pair of special status bits . These bits keep track of

whether the location has valid information in it or is empty and available for

overwriting. The data stored in memory is located by comparing every bit

in memory with data placed in a special Comparand register. If there is a

156

perfect match for every bit that is compared then a Match Line is asserted

to indicate that the data in the comparand register is found in memory.

A priority encoder is used to retrieve the address of the matching location

that has highest priority. Thus, with a CAM, the data is supplied and the

address is retrieved. As the CAM doesn’t need address lines to find data, the

depth of a memory system using CAMs can be extended as far as desired,

but the width (wordsize) is limited by the size of the chip. The depth can be

easily extended as the addressing is all self-contained. Extending the width

takes additional routines due to the difficulty in extending 1024 match lines

from chip to chip.

CAMs are based on memory cells that have been modified by the addition

of extra transistors that compare the state of the bit stored with the state

stored in a Comparand register. Logically, CAMs perform an exclusive-NOR

function, so that a match is only indicated if both the stored bit and the

corresponding Comparand bit are in the same state.

The static CAM cell shown in fig B.2 is composed of a six-transistor

SRAM memory cell plus four transistors to accomplish the exclusive-NOR

function and match line driving operation.

For writing and reading, each Static CAM cell acts like a normal SRAM

cell, with differential bit lines to latch the value into the cell when writing,

and sense amplifiers to detect the stored value when reading. When writing,

157

Figure B.2: Typical CAM memory cell

158

the word line is energized, turning on the pass transistors which then force

the cross-coupled transistors to the levels on the bit lines. When the word

line is de-energized, the cross-coupled transistors remain in the same states.

For reading, the bit lines are precharged to the same intermediate voltage

level, the word line is energized, and the bit lines are forced to the levels

stored by the cross-coupled transistors. The sense amplifiers respond to the

difference in the bit lines and report the stored state.

For comparing, the match line is precharged to a high level, the bit lines

are driven by the levels of the bit stored in the Comparand register, but the

word line is not energized, so the state of the cross-coupled transistors is not

affected. The exclusive-NOR transistors compare the internally stored state

of the cross-coupled transistors with the levels of the Comparand bit, and if

they don’t agree, the Match line is pulled down, indicating a non-matching

bit. All the bits in a stored entry are connected to the same Match line, so

that if any bit in a word doesn’t match with its corresponding Comparand

bit, that Match line is pulled down. Only the entries where the Match line

stays HIGH are considered matches. All the Match lines are fed to a Priority

encoder that determines whether any match exists, whether more than one

match exists, and which matching location is considered the highest priority.

CAM though is expensive and consumes high power, it is employed wher-

ever high performance is an important criterion.

159

Index

Additive inverse, 85

Anti-log ROM, 117

Area-Time(AT) product, 32

Base extension, 91

Binary number system, 75

multiplication, 77

representation, 86

SRT division, 78

subtraction, 76

Binary number system(BNS), 103

Biorthogonal function, 148

Block Toeplitz matrix, 123, 152

Brent-Kung CLA, 42

hardware complexity, 42

Brent-Kung dot operator, 36

Buffer insertion, 17

Carry save adder(CSA), 45

delay, 46

Chinese Remainder Theorem(CRT),

86

Chinese remainder theorem(CRT),

109

Chip power, 24

Code conversion

RNS to BNS, 86

RNS to MRNS, 89

Code conversions, 85

BNS to RNS, 85

Comparand Register, 156

Complement, 76

Complete Gabor transform, 148

Complete Gabor transformation, 121

Content Addressable memories(CAM),

160

154

Static CAM, 157

Content Addressable Memory(CAM),

109

CORDIC, 6

CSA tree

delay, 49

Decimal number system, 73

Deep submicron

effects, 14

Deep submicron(DSM)

impact, 22

DFT Vs FFT, 72

Diminished radix complement, 76

Discrete Fourier Transform(DFT),

2

algorithm, 3

architecture, 4

hardware count, 56

HIPD, 60

HPD, 57

interconnect count, 56

IPD, 59

multidimensional, 11

algorithm, 12

pipeline architecture, 13

systolic architecture, 13

twodimensional, 13

Discrete Gabor transform, 121

Discrete Gabor transfrom

multidimensional, 126

Dynamic range, 82, 83, 111

Fast Fourier Transform(FFT), 7

algorithm, 7

interconnect complexity, 62

array processor, 9

clock broadcasting, 70

DIF, 7

DIT, 7

HPD, 68

161

IPD, 69

parallel processor, 9

pipeline processor, 9, 11

sequential processor, 9

Floating point, 76

Fourier Transform(FT), 1, 2

Full adder, 32

Gabor coefficients, 148

Gabor elementary functions, 148

Gabor transform, 121

binary, 124

critical sampling, 147

LRNS, 125

matrix, 122, 149

over sampling, 147

Generalized Gabor transform, 147

Hardware power delay product(HPD)

Brent-Kung structure, 44

CSA tree, 49

parallel array multiplier, 51

Wallace tree, 53

Hardware power-delay product(HPD)

ripple adder, 41

serial adder, 39

High level characterization

adders, 24, 25

multipliers, 25

Interconnect

delay, 15

scaling effects, 15

Interconnect complexity, 24

functional level, 25

gate level, 25

Interconnect complexity(IC)

ripple adder, 41

Interconnect count(IC), 31

Brenk-Kung structure, 44


full adder, 32


162

wallace tree, 53

interconnect count(IC)

CSA tree, 48

Interconnect delay, 31


full adder, 35

Interconnect delay product(IDP), 32



CSA tree, 49

full adder, 35


ripple adder, 41

serial adder, 39

Wallace tree, 53

Interconnect power, 32


full adder, 35

Interconnect power delay product(IPD)



CSA tree, 49

full adder, 35


serial adder, 40

Wallace tree, 53

Interconnect power-delay product(IPD),

32

ripple adder, 41

ITRS, 14, 17

Log curve, 97

Log ROM, 117

Logarithmic number system(LNS),

20, 95, 103

addition, 100

disadvantages, 21, 113

division, 99

multiplication, 99

representation, 95

subtraction, 100

truncation error, 113

163

Logarithmic Residue Number Sys-

tem(LRNS), 23

Logarithmic residue number system(LRNS),

104

accuracy, 113–115

addition, 104, 110

delay, 111


interconnect count(IC), 110

power, 110

advantages, 106

delay, 114

DFT architecture, 117

multiplication, 111, 112

delay, 112


hardware count, 112

interconnect complexity, 113

power, 113

power, 114

radix-4 butterfly, 117

subtraction, 104, 110

Logarithms

generation, 97

algorithm, 97

hardware, 98

Low level characterization

adders, 24

multipliers, 24

MAP, 128

architecture, 129

data memory, 129

execution flow

LRM, 135

RAD/RSU, 133

SLM/SLD, 135

general purpose functional units,

131

instruction set, 132

LNS instructions, 133

LRNS instructions, 133

164

RNS instructions, 132

special purpose functional units,

131

timing

LRM, 137

RAD, 136

RSU, 136

SLD, 137

Maximum interconnect length(MIL),

31

Mixed number system, 21

arithmetic processor, 23

Mixed radix system, 88

Mixed-radix systems, 74

Modulus, 81, 84

Multiple operand addition, 45, 46

Multiplication, 49

Multiplicative inverse, 85, 87

Number Systems

range, 73

redundancy, 74

uniqueness, 74

Overflow detection, 90

Parallel array multiplier, 50

delay, 51


Power equation, 57

Power reduction, 19

Power-delay product, 57

Pre-checker circuit, 106

Prioriy Encoder, 157

Radix 4 FFT architecture, 64

computational element, 64

delay commutator, 64

Radix complement, 76

Random Access Memory (RAM)

Dynamic RAM(DRAM), 156

Static RAM(SRAM), 155

Random Access Memory(RAM), 154

Read Only Memory(ROM), 95, 102

165

Refresh, 156

Residue number system(RNS), 20,

81

addition, 92

advantages, 92, 103

drawback, 103

drawbacks, 21, 94

multiplication, 92

negative numbers, 83

representation, 81

representational efficiency, 83

subtraction, 92

Restore, 156

Ripple carry adder, 40

delay, 41

Rounding off error, 96

Scaling, 91

Scaling factor, 95

Serial adder, 38

delay, 38

Short Term Fourier Transform (STFT),

120

Sign bit, 75, 77, 83, 100

Sign magnitude, 76

Signal processing, 1

Single precision, 77

Straight line approximation, 97

Truncation error, 96

Wallace tree multiplier, 51

Weighted Number System, 74

166

Bibliography

[1] A.K.Jain, Fast inversion of banded toeplitz matrices by circular decom-

position, IEEE Trans. Acoust., Speech, Signal Processing ASSP-26.

[2] Semiconductor Industry Association, International technology roadmap

for semiconductors: 1999, Austin, TX: SEMATECH (1999).

[3] H. Bakoglu, Circuits, interconnections, and packaging for vlsi, Reading,

MA: Addison-Wesley, 1990.

[4] C.R. Baugh and B.A. Wooly, A two’s compliment parallel array multi-

plier, IEEE Trans. Comput. C-22.

[5] G.A. Jullien Ben-Dau Tseng and William C.Miller, Implementation of

fft structures using the residual number system, IEEE Trans. Comput.

C-28.

[6] T. S. Chang and C.W. Jen, Hardware efficient transform designs with

cyclic convolution and subexpression sharing, Proc. ISCAS (1998), 398–

401.

167

[7] Wai-Kai Chen, The vlsi handbook, CRC Press LLC, Florida, 2000.

[8] D.Gabor, Theory of communications, J.Inst.Elec.Engr. 93.

[9] E.E.Swartzlander and A.G.Alexopolous, The sign/logarithm number

system, IEEE Trans. Comput. C-24.

[10] C.J. Anderson et al., Physical design of a fourth-generation power ghz

microprocessor, Proceedings of the IEEE International Solid-State Cir-

cuits Conference (2001), 232–233.

[11] P. E. Gronowski et al., High-performance microprocessor design, IEEE

Journal of Solid-State Circuits 33.

[12] A. Skavantzos F.J. Taylor, G. Papadourakis and A. Stouratis, A radix-4

fft using complex rns arithmetic, IEEE Trans. Comput. C-34.

[13] M.A. Franklin, Vlsi performnce of banyan and crossbar communication

networks, IEEE Trans. Comput. C-30.

[14] M. Re G. C. Cardarilli and R. Lojacono, Rns-to-binary conversion for

efficient vlsi implementation, IEEE Trans. Circuit Syst 47.

[15] Harvey L. Garner, , the residue number system, IRE Transactions on

Electronic Computers (1959), 140–147).

168

[16] I. Gertner and M. Shamash, Vlsi architectures for multidimensional

fourier transform processing, IEEE Trans. Comput. C-36.

[17] John L Hennessy and David Patterson, Computer architecture-a quan-

titative approach, Morgan Kaufmann Publishers, Inc., California, 2000.

[18] T. Aboulnasr J. A. Beraldin and W. Steenart, Efficient one-dimensional

systolic array realization of discrete Fourier transform, IEEE Trans. on

Circuits and Systems 36(1) (1989), 95–100.

[19] C.M. Liu J. I. Guo and C.W.Jen, The efficient memory-based vlsi array

designs for DFT and DCT, IEEE Trans. on Circuits and Systems Part

II, 39(10) (1992), 723–733.

[20] Patrick Krolak Jie Yao and Charlie Steele, The generalized gabor trans-

form, IEEE Trans. Image Processing 4.

[21] J.W.Cooley and J.W.Tukey, An algorithm for the machine calculation

of complex Fourier series, Math. Comput. 19.

[22] K. Kocsis, A fully pipelined high speed DFT architecture, Proc. ICASSP

(1991), 1569–1572.

[23] S.Y Kung, VLSI array processors, Prentice Hall, New Jersey, 1988.

169

[24] L. Hartimo L. Wang and T.Laakso, A novel double decomposition method

for systolic implementation of DFT, Proc. ISCAS (1992), 1085–1088.

[25] H.S. Lim and Jr. E.E. Swartzlander, A systolic array for 2-D DFT and

2-DDCT, Proc. Int. Conf. Application-Specific Array Processors (1994),

123–131.

[26] M.Bastiaans, Gabor’s expansion of a signal into gaussian elementary

signals, Opt.Eng. 20.

[27] Mahesh Mehendale and Sunil D.Shrelekar, Vlsi synthesis of dsp kernels

algorithmic and architectural transformations, Kluwer Academic Pub-

lishers, Boston, 2001.

[28] J. Meindl, Low power microelectronics: Retrospective and prospect, Proc.

IEEE 83.

[29] J.N. Mitchell, Computer multiplication and division using binary loga-

rithms, IRE Transactions on Electronic Computers EC-11.

[30] N. Venkateswaran, S. Praveen, R. Subramanian and Vasanth Ramesan,

Emerging impact of DSM technology of DFT and FFT architectures,

selected for publication at International Signal Processing Conference,

Dallas, USA (2003).

170

[31] N. Venkateswaran, Vasanth Ramesan, R. Subramanian and S. Praveen,

A mixed number system based low power, high performance arithmetic

processor for DSP applications, selected for publication at International

Signal Processing Conference, Dallas, USA (2003).

[32] Peter Pirsch, Architectures for digital signal processing, John Wiley and

Sons, New York, 1998.

[33] John G. Proakis and Dimitriz G. Manolakis, Digital signal processing:

Principles, Algorithms and Applications, Prentice Hall of India, New

Delhi, 1997.

[34] H. T. Kung R. P. Brent, A regular layout for parallel adders, IEEE

Trans. Comput. C-31.

[35] T.D. Roziner and M.G. Karpovsky, Multidimensional fourier transform

by systolic architetcures, J. VLSI Signal Process. 4.

[36] I. Sedukhin S. Peng and S. Sedukhin, Design of array processors for 2-d

discrete Fourier transform, IEICE Trans. Inform. Syst. E80-D.

[37] H.S. Stone, Parallel processing with perfect shuffle, IEEE Trans. Com-

put. C-20.

[38] M. Sundaramurthy and V. Umapathy Reddy, Some results in fixed point

fast Fourier tranform error analysis, IEEE Trans. Comput. C-26.

171

[39] Dennis Sylvester and Kurt Keutzer, A global wiring paradigm for deep

submicron design, IEEE Transactions on Computer Aided Design of

Integrated Circiuts and Systems 19.

[40] N.S. Szabo and R.I Tanaka, Residue arithmetic and its applications to

computer arithmetic, Mc Graw-Hill, New York, 1967.

[41] S.Zohar, Toeplitz matrix inversion: The algorithm of w.f.trench,

J.Assoc.Comput.Mach. 16.

[42] John C Becker Thomas A. Brubaker, Multiplication using logarithms

implemented with read- only memory, IEEE Trans. Comput. C-24.

[43] C. S. Wallace, A suggestion for a fast multiplier, IEEE Trans. Elec.

Comput. EC-13.

[44] M. O. Ahmad Wei Wang, M. N. S. Swamy and Yuke Wang, A high-speed

residue-to-binary converter for three-moduli (2k, 2k−1, 2k−1 − 1) rns and

a scheme for its vlsi implementation, IEEE Trans. Circuit Syst 77.

[45] W.F.Trench, An algorithm for the inversion of finite toeplitz matrices,

J.SIAM 12.

[46] S. A. White, Applications of distributed arithmetic to digital signal pro-

cessing: a tutorial review, IEEE ASSP Magazine 6 (1989), 4–19.

172

[47] E. G. Friedman Y. I. Ismail and J. L. Neves, Exploiting on-chip induc-

tance in high speed clock distribution networks, IEEE Transactions on

Very Large Scale Integration (VLSI) Systems 9.

[48] , Figure of merit to characterize the importance of on-chip in-

ductance, IEEE Transactions on Very Large Scale Integration (VLSI)

Systems 7.

[49] Jie Yao, Complete gabor transformation for signal representation, IEEE

Trans. Image Processing 2.

[50] Sungwook Yu and Jr. Earl.E.Swartzlander, A pipelined architecture for

the multidimensional dft, IEEE Trans. Signal Processing 9.

173

Interconnect Dominant Design Methodology for DSP ... · Interconnect Dominant Design Methodology...

Documents

Transcript of Interconnect Dominant Design Methodology for DSP ... · Interconnect Dominant Design Methodology...