Post on 25-Apr-2020
FO DGITAL SIGNAL PPOCESSIV,
Approved or Tayoreae
Distrbutin UnlmIte
Flo
APONIIIR I~fUEFR LULSGA UPP~C~Sl 1F
VME m-W:
-o l_ --
,-
-1 -- _
J.-. L..,
BestAvai~lable
copy
~t~u~sv CL %1,V~A~i1, 409 w
A tEl W LTtIPI.?I. !,TRUC-TIIRE FOR DIGITAL. Sr'iJAI. YNLf~HOC;IN F~r I tic p4 vr~ oy'o mat e4fe d..
I~7 'Ir 010ft------. COaN T*1ATt 0 tR GA NT N
Dr.' Fr'md J./Tny Ior F)6?7-f'.,
I Pl[U0400M*4C 0OA6411ATION NAME AND ADDRESS10 PROGR ME'? A,~,~AEA 4 WORK UNIT NUW0 AIS
U~niversity of Cincinnati VDepartment of Electrical and Computer Engineeri --61102F /7Cincinnati, OH! 45221 4 23041A6
11. CONTROLLIN.J OFFVICE NAME ANO ADDRESS T-DAIL.
Air Force Office of Scientifjx.. a.o, h/NM 198Boiling AFB, DC 20332 ).Numkf
-T. MONIVORINC, AOENCY NAME 4 AORW0ffr-inf hforn Cntoilinlml flhllee) IS, SECURITY CLASS, (of this report)
UNCLASSIFIED
1T OEC I.. AS-IOPICATION/DO*NORtAO1.0scMko EOLE
16. DISTRIBUTION STATEMENT (of thoe Report)
APPROVED FOR PUBLIC RELEASE,_DISTRIBUTION UNLIMITED.
7,. DISTRIBUTION STATEMENT (oft he Abetract entered in nleek 20, It dillecent from Report)
1S. SUPPLEMENTARY NOTES
19, KCEY WORDS (Continue On reverae ald# it necoa ry arnd Identify by block nimber)
20 ASTRCT (ORIsideOf ecesaryendIdentlify by block number)
'The charge e4f-nFQGr /rant wstesu fanwrasomemory intensive digital arithmetic units based ofl modular algebra. Thenew class of arithmetic units, developed under this grant, operate at veryhigh speeds, admit VLSI and tit-slice realizations, and can be integrated intodigital signal processing systems,
DD A Nm 1473 901flm of I NOV 45 is 0UBOL 175
gEC-jRITY CL ASSIPICATie)N Ow Tmilt P ACE (1NOM~ flre Frrfered)
FINAL REPORT
AFOSR Grant F49620-79-C-0066
Title: A New Multiplier Structure For Digital Signal Processing
Principal Investigator: Dr. Fred J. TaylorDepartment of Electrical andComputer Engineering
'Jniversity of CincinnatiCincinnati, Ohio 45221
Grant Monitor: Dr. J. BrainAFOSR/NMBolling Air Force Base, DCWashington, DC 20332
__ DTCAccessi on Fo TMI~S GRA&I
DTIC TAB ciS3E3190
Distribut 'on/
Availability Codes IFOCNOT1z orTrafflMI2TTAL TO PDC
Dib ! peas].This techniical report ha.s been reviewed and Isapproved for p~ublic release IMW AIR 190-12 (7b).
I A II Distribution is unlimited,11111A. D. BLOSI
i ~ ~ Te ehnioal InfUatoV Offioe'w
TABLE OF CONTENTS
PagePART I Introduction P
Residue Arithmetic 3
RNS Capabilities 8
Large Moduli Au 11
Modulo p Adder 18
VLSI-RNS Multipliers 23
VLSI-RNS Multiplier Structure 23
RNS to Decimal Conversion 29
Overflow Tolerant RNS Multipliers 41
Error Analysis 43PART II System Design In the RNS 49
M-S Residue Architecture (MSR) 54
Finite Wordlength Effects 55
M-K RNS Filter Architecture 56
RNS Filter Design 62
MK Architecture 62
JKL-RNS Architecture 69
Distributed RNS Filter 69PART III Applications 81
PART IV Summary 81
References 82
Appendix A
Appendix B
Appendix C
iF
INTRODUCTION
The charge of AFOSR Grant F49620-79-C-0066 was the study of a new class
of memory intensive digital arithmetic units based on modular algebra. The
newi class of arithmetic units, developed under this grant, operate at very
high speeds, admit VLSI and bit-slice realizations, and can be integrated
into digital signal processing systems.
Numerous authors have demonstrated the potential of residue arithmetic
for replizing high speed signal processing and computational systems. [1'6 ]
These methods are memory intensive in that the table lookup operations are
used to perform modular arithmetic. However, there is a possible flaw in
this contemporary residue arithmetic work and it is our dependence on high
speed memory. Admittedly, memory is becoming available with higher densities
and access speeds. However, they present a non-trivial power demand on the
system and are very expensive. For example, Intel's HMOS 1K x 4 memory, having
access times of 55, 70, and 80 ns cost on the order of $82, $76, and $62 per
copy. INMOS 16K (4k x 4) static RAM is available at 35 ns and Fairchild markets
a 1K ECL (high power dissipation) RAM at higher per unit costsEil. Our research
has determined that by using K x I HM0S memories, an equivalent 12 x 12 RNS
multiplier could be configured which has a pipelined throughput of 35 ns, but
it would require 9.9 watts of active pok~er and 1.65 watts standby. Furthermore,
the cost per moduli would exceed $1,000. Tf-erefore, high performance residue
based signal processing systems may carry a high Drice tag as well. It may
49 therefore be wise to rethink our dependence on memory intensive arithmetic.
Footnote [ij: This condition will be strongly influenced by the results ofDOD's VHSIC Program.
2
It would seem advantageous to architecL future residue arithmetic
based systems on those technologies which will provide the highest performance
in terms o.:
1. speed
2. cost
3. power dissipation
4. packaging
5. availability
6. system compatability
parameters. The last two parameters are unfortunately often neglected in
exploratory research efforts. It would reflect poor engineering practice
to develop a technology dependent theory which is incompatable with it's
elcctronic environment. The technology which seems to yield the greatest
promise is the VLSI.J i High performance arithmetic unit5 are already
available in VLSI. For example, the TRW-VLSI carry-save 2's complement
multiplier line breaks down as follows:
TAB'[>L E I
. UNTIT SIZE PIfUS SPEED(ns) POWER (watts)
IMPY8HJ-l 8 x 8 40 45 1.5
MPY-12HJ 12 x 12 64 80 2.7
MPY-16HJ 16 x 16 64 100 4.0
MPY-2411J 24 x 24 64 200 5.0
By comparison, the 12 x 12 35 ns RNS multiplier is more than twice as fast
as the VLSI unit hut consumes more than 3.5 times the power. However, the
above VLSI multiplier units are designed to work in 2's complement and
Footnote [i: Small Scale Integration [SSI] = 50 gates, MSI 50-100 gatesVLSI : 4000 gates/chip
S 3
therefore do not support residue arithmetic directly. Since these basic
fixed point 2's complement VLSI multipliers offer outstanding performance
tor the price, it is desirable to integrate them into a residue number
system (RNS) structure.
RESIDUE ARITHMETIC
Before a mcaningful dislog on residue arithmetic units and systems
can be established, the fundamental properties of this numbering system
should be reviewed.
Residue number system (RNS) is mature mathematical study. A serious
study of the RNS was offered by Gauss in the 19th century. In 1968 Szabo
and Tanaka published the book "Residue Arithmetic and Its Applications to
Computer Technology". [7 ] Due to the technological limitations of the period,
the book did not receive wide-spread ecognition and was soon out of print.
However, due to the recent availability of high density high performance
Read Only Memory (ROM) and Random Access Memory (RAM), the RNS is being re-
investigated for the application in digital filter design, implemlentation
of fast transforms, convolution, and optical computation.
Let P=(plP 2,...,pL) be a set of relatively prime integers, and let xL
be any integer in [0, M-l] (called dynamic range) where M = 11 p.. Theni =I l
by the Eucledian algorithm, there exists ki,xieI(integers), such that
i ~~ x:ii+Xi i=l,2,...,L 1
The quantity x is called the ith residue of x, and is usually denoted as
Jxi or xmod pi.pi
.4
It is easy to show that x and M+x have the same residue representation.
Only if xe[O, M-I], can x be uniquely determined by the L-tuple (xlx 2,...,XL.
In this case denote R=(Xlx 2 ... A).
Another signed encoding scheme can also be used. In this case, the
dynamic range is [-(M-l)/2, (M-I)/2] with a negative number -lxi coded as
M-(xj. There is a trivial the isomorphism wnich maps [0, M-1] onto [-(M-l)/2,
(M-1)12]. This second coding scheme has the advantag,: t:,at sign detection
is not required during arithmetic operations and the sign of the result will
take care of itself providing that no overflow (out of dynamic range) had
occurred.
The following ire somde identities which will be used later. The proofs
are straight forward and will not be presented.
lx+yl, Ixitiyj 2.
IXyIp I IXplylp. 3.=1 4
I-xlp 1p-xp 4.
Let Zp be the set of integers x such that O<x<p (ie. residue class).
It is well known that Z is an abelian ring under addition and multiplicationp
modulo p. For any integer xeZp, the inverse of x is the integer ycZ suchp
that Ixylp = 1. It is also known that if x is relatively prime to p, then x
has an unique inverse, denoted x-l[p]. For example in Z6 ,5-1 [6]=5.
Arithmetic operations in RNS are defined in a straightforward manner.
Let x, ycZ, x, ye[O, M-1] and x=(xl,X2,...,xL), Y Then
z:koy:(zl,z2,..,zL) where zi:(xioyi ) vid pi, for i=l,2,...,L, and "o" denote
llz~l'g"!
5
the operation x, + or -.
It is clear that the sub-operation within each modulus is independent
to each other. That is, no carry information is necessary between n;oduli.
The arithmetic is also exact and therefore free of round-off error. The z
is exact if O<z<M-l, however if xoy>M (overflow) then the answer will be
incorrect. Hence it is critical to know beforehand that the result will
not exceed the finite RNS dynamic range. Division in RNS is known to be
lifficult. Therefore RNS is considered to be best applied to system where
division is not the dominant operations.
Another RNS induced scheme is called the mixed-radix numbering system
(MRNS). Given the moduli set P=(plP2,..pL), any integer xe[O, M-1] can
be expressed uniquely as
X=RI + 2PI+R 3PIP 2 .+ '.. 4 XL(PIP 2 . PL-1 ) 5.
i-ilet ql=l and qi= Ii P., eq. (5) can be written as
j=l
x=Xl ql +x2 q2+x3q3+. "" +'qL 6.
or equivalently, x can be represented uniquely by the L-tuple x=<R1 , 2 .... XL>.
The Ri's are called the mixed radix digits with O<_<Pi-l. The mixed-radix
number system is a weighted number system. Therefore carries between digits
are necessary in arithmetic operations. A property of a weighted number system
is -hat magnitude comparison is trivial.
It is often necessary to compute the M.R. digits <R ,'2.
from given set residue digits (xl,x2 , .. XL). Here
(x 1 ,x 2
l + ..... + Ll i 7.
Hence,
L-1'x JPl : x l =P1+x2Pl + .... + +L i1Ii Pi ixIpl l 8.
i=l 1
After subtracting Rl from both sides of eq. (7), one obtains
L-1x-x=x 2Pi ...... RL R P "
9L-I
1x-R P2IR2l .....1 =+XIL if Pil ={ P I
i TP2 "PIP2
Upon multiply both sides by pl-1[p 2] which exists by the reativeiy prime
property of p1 and P2 1 one obtains
S(x-R1)PI-2I [P2] 1 2= A Pl 1 [P 2 p2 2 0.
This process can be carried out successively until all Ri's areobtained. Actually, the iterate process can be realized in paralel formdue to the independence of residues. An algorithm found in Szabo and
Tanaka [7] can be used.
It was noted that division is difficult in RNS. However, in the casethat the divisor is a fixed constant c (where c is relatively prime toPi,i .. L), there is known to exist some simplification of the scaling
task. The scaling operation is formly defined as follows: GivenP=(PlP2 ,pL) and x-(xl,x ..... xL), what is the residue representation
of I ? (where I I denotes the rounding to th-e closest integer operation.)
From the Eucledian Algorithm, namely
x - i W -ij 10.
c '
it follows that
c, c Ii - c__ 1
which is of integer value. The residue representation of this integer is
given in teri.m of a "scaling kernel" satisfying,
xi -:c [p l " -1I~i iII Jl i x Ii~ -~ i i ,Pi lx-lxIcIc [PillP 12-
Thus, if xi~c is known, then the residue representation of ,- can be
Tui XC C
obtained using one subtraction and one multiplication. Since c are relatively
prime, c- [pi ] exists w.r.t. pi. Usually Ixic will not be given and nave to
be found by a base extension algorithm.
The integer value of a residue representation can also be obtained
through the use of Chinese Remainder Theorem.
Given P = (pP 2 .... pL); where Pi, Pj are relatively pri;:e for i j,
the CRT states;
Theorem (Chinese Remainder Theorem)
L
1xIM I E m 13.
where
8
M = f pj and jmim i [pi]lp i 13.Pi j=l
jfi
Proof: Since a residue number represents an integer uniquely in the dynamic
range [0, M-1]. It is enough to show that the right-hand side of eq. (13)
has residues (xl,x2, .... ,XL). Since
L -1 1 L
3 PjPl pip 3i= 3i:jmj Jxjmj'l I[pj]lpj i pj= Ixj I pj 2:Xj V :I,11...,IL
The claim follows.
*Notice that the left-hand side of eq. (13) is in the form of IXIM. That is,
the resulting integer will be unique if O<x<M.
There is yet another method which may be used to decode a residue tuple.
This method has been independently reported by Jenkin[8] and Julian.[9]
Starting from the residue representation (x1,x2 .... 'Y' <Xl'X2' . 9 >
is obtained through a M.R. conversion. Then eq. (5) will be used to recon-
struct x. This method is called M.R. reconstruction.
RNS CAPABILITIES
Interest in the RNS is due to its ability to perform high-speed arithmetic.
Speed is achieved through the use of a high degree of parallelism and an
absence of carry information requirements. Recall that the arithmetic composi-
tion of two integers, say x- (x x... ,x and Y (YI" given by x y (whereI'" L L
9
o denotes addition, subtraction, or multiplication) satisfies xoy--(xl,,Yl,
.... ,xLoYL). It can be seen each residue digit, namely xioy i can be computed
independent of all others (ie: no carry information requirements). In
practice, the mapping of xi and yi into xioyi is accomplished using table
lookups where the table residue on randomly accessed read-only memory. Typical
high-speed memory modules, which are currently available, are:
TABLE 2
Device Type Technology Configuration Access-Speed
10149 ROM ECL 256x4 20 nsSN54S ROM TTL 1024x4 35 ns214711-1 RAM HMOS 4096x1 30 ns2125H-1 RAM HMOS 1024xi 20 ns12167 RAM 1IMOS 16384xi 45 nsIMS1400 RAM MOS 16384x4 30 ns
The product of two residues modulo pi, Pi< 2n can be precomputed and
stored in a 2mnxn-bit memory unit where m=2n. Using a large existing high-
speed memory (4Kxl at 30 ns), residues having up to six bit integer values
can be used (ex: P = {64,63,...}). Thus, fixed-point multipliers having a
dynamic range of [-M/2,M/2) can be architected which have execution rates
in the low nanoseconds.
The disadvantages of the residue number systems are manifold. Since
the RNS possess no most significant digit, decimal to residue conversion,
division, magnitude comparison, and arithmetic shift operations are cumber-
some dnd should be avoided. Register overflow, due to its finite dynamic
range, impose a severe constraint on the RNS operations. Unlike weighted
numbers (decimal, binary, etc.) where rounding or truncating least significant
digits can control overflow, such is not the case in the RNS. Since there
is an absence of least significant digits, the more general and inefficient
.. . .... .
10
operation known as scaling must be used. Since scaling is a form of division,
its use should be discouraged. To gain insight into this problem, consider
the inner product of two 31-dimensional real vectors Y and y whose entries
are encoded as residue digits with respect to P = 32,31,29,27}. Without
scaling, the worst-case value of x and y would be limited to V-5 where
V = (M/2)/31 = 25056. Therefore, to insure that no worst case overflow
can occur, a 7.3-bit (ie: V-5 = 158 2 dynamic range limitation must
be imposed on x and y. With scaling,larger input ranges can be used at
the expense of statistical accuracy in the output space (analogous to
roundoff errors).
Due to the dynamic range limitation of RNS systems, one is generally
forced to accept one of the following two overflow prevention strategies.
1. Increase the dynamic range to a sufficiently large value
by adding more moduli to P, or
2. Make scaling a more efficient operation.
The first option represents a brute force attack to the problem. Such
an approach will increase to cost and complexity metrics of a filter. In
addition, the moduli set P must be tailored to unique filter. The other
approach appears to be the most popular at this time. Szabo and Tanaka,
and others, have concentrated on the scaling efficiency through the choice
of the three-tuple moduli set P : 2n- 2n 2n 1 } This moduli set has
the ability to efficiently scale a residue number by any one of the chosen
moduli. However, there is an intrinsic limitation plaguing this method and
it is its dynamic range. Using a large high-speed memory unit, say 4Kxl,
the input addressing space is limited to 212. This means that a moduli pi
is technically limited to pi< 26 (ie: xi-yi<2 12). Therefore, the dynamic
range of any modular operation is given by M=(2nl)(2n)(2n+l)-23n=.18
In many applications, an 18-bit resolution is insufficient resolution.
LARGE MODULI AU
It is desirable to keep the previously discussed three moduli structure
for purposes of potential scaling needs. However, in order to overcome the
existing disadvantages of this system, that of dynamic range, a new archi-
tecture is called for. Since it is unrealistic to assume substantially larger
density high-speed memories will continue to become available, it is incum-
bent that more memory efficient residue arithmetic unit be designed. An
efficient algorithm, which is ideally suited for this application, is known
as the quarter-square multiplier.[lO-12]
= <P(s+)-4(s)> 14.<p
where @(s) = <s2>~ with s+ = (x+y)/2 and s =(x-y)/2.
The quarter-squared multiplier has been studied by J.M. Pollard (1976)
in a Galois field. Questions of hardware implementation were not considered
and, due to the Galois field limitation, only prime moduli could be considered.
H. Nussbaumer (1976) studied the quarter-square multiplier over real fields
for use in ROM intensive digital filters. Soderstrand and Fields (1977) made
brief reference to this multiplier for residue arithmetic but offered rio
satisfactory hardware realization. Our research has produced a practical
residue arithmetic quarter-squared modular multiplier in commercially available
hardware.
12
A problem that would seem to plague the quarter-square multiplier is
the need to realize the division by two the sums and differences. In general,
the existence of an N-1 , such that <N-tIl>p=l, can only be guaranteed if N
is relatively prime to p. Since one of the chosen moduli is p=2n, multi-
plicative inverse of 2 cannot be guaranteed to exist. Therefore, the quarter
cannot be directly interpreted as the equation <<1/4> <(x+y) 2-(x-y)2 > >Pi i piP
The potential problem of dividing the sum of differences, found in equation 4,
by 2, will be explicitly and efficiently treated for the first time later in
this paper. For a 2m word memory unit, the direct product architecture (ie:
xy) would limit the maximal rnoduli to be bounded by 2n, n=m/2. In fact, this
claim can be extended to the case where p = 2n+1 through use of the following
modification. Observe that if xi = 0, then it automatically follows that
<xiYi>Pi = 0. Therefore, if xi =O.OAO ...0 (which is detectable condition in
that the (n+l)st bit and remaining n-bit block is zero (0+OAO0...0)) the out-
put register would be automatically cleared. Therefore, the lookup table need
not be accessed for this case. Instead, the all zero n-bit portion of the
table address, allocated to xi , can be used to represent xi=2n where xi
2nlA 00 ...0. Here, the table would be programmed to map yi into <2nyi>
using only a 2m word memory.
The memory requirements associated with the quarter-square multiplier
are substantially less than those of direct mechanizations. First, it should
be apparent that the integers s+ and s, found in equation 14, are bounded
from above by 2n+l. Therefore, only a (n+l)-bit table addressing space is
required to realize (s) versus the 2n-bit space needed for direct architectures.
It would appear however, that there is an exception to this rule. Since one
13
of the moduli chosen is p 2n+,. Here the maximal value of s+(or s-) is 2n+l
which would technically require a (n+2)-bit address. However, by using the
protocol found in Figure 2, which is an adaptation of the network found in
Figure 1, the table size can be reduced to 2n+l words for all moduli. Here,
the overflow bit serves to differentiate s+=O from 2n+l
The quarter-squared architecture is abstracted in Figure 3. It uses a
2n+l word high-speed memory for modular arithmetic lookup operations. Using,
for example, the previously referenced 4K-30 ns device, moduli having an 11-bit
dynamic range (vs. 6-bit in the direct form) can be mechanized. This would
yield a three-moduli dynamic range on the order of 23(11)-8.6-I09. That is,
without an increase in memory size (and therefore access time), the dynamic
range of the quarter-squared is 233/218=215 times larger than that obtainable
through direct means! This large increase in dynamic range makes the RNS a
viable alternative to traditional filter design methods. Both improved pre-
cision and throughput (through the reduction or absence of traditional scaling
operations) can be achieved.
Several versions of the multiplier algorithm can be considered. They are
summarized in Figure 4. The first, called the sequential form, would have an
estimated throughput rate of 240 ns based on a 60 ns lookahead adder and memory
having an access time of 30 ns with a cycle time of 60 ns. The second archi-
tecture, called the parallel form, would run at a 180 ns rate. The parallel
architecture is preferred because its higher speed, simpler control. A 60 ns
pipelined execution rate can be purchased at a small hardware cost.
Upon closer investigation of the table lookup data base, a potential
nuisance can be found. It can be examplified by observing that if s- is,
14
MSE3RAM or ROMvZ=Xy MOD P CLEARPz 2ri
R=2IF X L 0Yy IF Q 0
2=?' IF Y L 0
Fgureo 101110 21141 ATAl
flnifO R~ +l
SELEC s h2
P'igire2. cmov Copr(ssi~l od p-
c
CL
N. >
0.
+Lf) I )
17
x AA
1 12
A 1 X-y CI A 2 C EUKTAIl2
- 6OS0f4A. l80ns.(S-
Fiur 6. A 0r6hi, -N' 24
A4
18
18
odd and p'2, then (st)=<s2/4>32=x.25. Therefore, it may be required that
two additional fractional bits may need to be added to the table's output
wordiength. However, this is not the case as suggested by the following
theorem:
Theorem: Let Il vii denote the integer value of v. Then z<Ii (s+)ii- i-(s-)i >
That is, only the integer value of need be used and the fractional bits of
4(s±) can be ignored.
Proof: For x, y and k integers, one may define two rational numbers, namely
(x+y)/2-v+k/2; (x-y)/2=q+b/2 where k=O or 1. Then z=<<(x+y) 2/4> p-<(x-y) 2/4>p >p
<<v~kv+b2/4>p-<q+dv+k2/4> >p=<<v+kv>p+k2/4-q+k> 2/4>p=< (s+)- (s-)>pp p' p p-1 p
As a result, the parallel architecture is equivalent to that shown in
Figure 5. Furthermore, by deriving the above theorem over a rational field,
and showing that the results pertain to the integers, several classical pro-
blems are overcome:
1. The quarter-squared multiplier is not restricted to the
Galois fields suggested by Pollard.
2. The question of the existence of the multiplicative
inverse of 4 is now moot.
MODULO p ADDER
The Quarter-square multiplier requires a modulo p adder be used to com-
bine the two component parts of the solution (namely 4(s+) and 6(s-)). Modulo
p adders pose an interesting design problem. Unless a fast modulo p adder
can be fabricated, the overhead associated with addition will offset any gain
in throughput achieved through table lookups. For the moduli chosen, 2n-l, 2n,
and 2n+l, only the modulo 2n adder can be realized directly (n-bit adder with
tr, 19
D 1 0 12
iI
V)~
LIJ
a- Q
c~
LiO
20
ignored overflow). It would however, be desirable to use a n-bit adder to
realize the modulo 2 n-1 and 2+l adder as well. For the purpose of clarity,
let s be defined to be the sum of (s+) and f(s). The following observation
then follows:
TABLE 3
Dynamic Integer Modulo 214 Adder Modu1o p. Adder Example:U=3 iCase Range of S <s>2 1 OVF-BIT ) i '-s> s < s>
- - -- i
I s=O 0 0 2 l 10 0 0
2 1<s<2-2 s 0 2 '-1 s 4 4
3 s=2N- 1 s 0 2- I 0 7 0
4 S=2N 0L I 1 s- 2 +I 8 1
5 2N+I <s< 2N-4 s-2 N 1 2-1 s-2 N+1 10 3
6 s=O 0 0 2I 0 0 0
7 l<s<2 -I s 0 2 S 4 4
NN8 s=2 0 1 2 0 ,8 0
9 2 +I<s<2-2 s-2 N 1 2N s-2 10 2
10 s=O 0 0 2+1 0 0 0
11 1<s<2 -1 s 0 2 +1 s 4 4
12 s=2 0 1 2N 1 s 8 8
13 2 l+1..s< 2 N1I s- 2N 1 2 I s- 2N-1 10 1
14 s-2 N I I 0 0 2?%I is-21-1 16 7(special case)
Using n-bit AID gates to sense the zero condition of <s>2N, the overflow
bit OVF the sign bits of (s+) and *(s-) combinational logic can be defined
which will <s>2N into <s>pi It can be noted from the data found in Table 12*
21
that the mapping requirements are:
1. for p = 2141 map s to s or s-2N+]=<<s>N+> -.5.
2. for p = 2N map s to s-2N<S>2N
3. for p = 2"+I, map s to s or s 2Nl=<s> N>
Mapping two is trivially satisfied with an n-bit adder. The other two
mappings require that s remains unchanged or it is decremented or incremented
by unity. There are several ways to approach this problem. Bioul, Davis,
and Quisquater have presented an unorthodox architecture for a modulo (2 l)
adder using two-input gates.L Modulo (2n+1) adders can also be realized
through the use of end-around-carries. However, compared to inodulo 2n addi-
tion, this approach would almost double the addition delay. This extended
delay problem can be overcome through added complexity (ie: tine multiplexing
two end-around-carry adders). Mapping one and three can be efficiently
realized in the manner suggested by the example found in Appendix A. The
functional operation of adding one (mapping 1) or subtracting one (mapping 3)
from the output of an n-bit adder is performed by a PLA. The PLA will provide
an overlay mask which accomplishes the required task. !he derivation and
utility of the mask can be understood in the context of the following example.
Example: Suppose s is an 11-bit word having a decimal value of slO = 92 or
s2=O000101lO0. If slo-I = 91 or (slo-l)2+ 00001011 lll; is desired, one notes
that only the 3-LSB's of s2 need be altered. In general, for n=12, only the
i = following i3 distinct birary masks are required to for.n (slo-1)2.
22
MSB Pattern LSD Uotatiun
X X X X X X X X X XA 'A leave corresponding bit
X X X X X X X X X X X 0 location of s2 unchanged ](or 0)X X X X X X X X X X 0 1 - change corresponding bit
location of s to I (or 0)
0 1 1 1 1 1 1 1 1 1 1 1 Table II. MASK
Suppose the moduli p = 2n+l, n : 12, is to be implemented. By using two
commercially available 16x9 PLA's in parallel, the 12-bit output of an n-bit
adder (shown as <s> in Table I) and the four previously specified control2N
bits, can be converted to 13-bit mask. The mask would transform the output
of a high-speed n-bit adder to s or s-2nl, depending on the state of the 4
control bits. Based on a 25-ns 12-bit Schottky lookahead adder, a 20-ns
16x9 PLA, and 10-ns FET mask switches (in notation comments of Table II) a
65-ns modulo p adder, for p=2n-l, 2n , and 2n+l can be realized. The presence
of a 65-ns modulo p adder will now allow a 140-ns large moduli residue multi-
plier based on 35-ns 4Kxl HMOS memory units. (See Figure 5) For a moduli set
{2l2-1, 212, 2l2+1}, a fixed poirt multiplier, having an output dynamic range
of 236-212, can thus be fabricated having a word rate :f 7.143 M multiplications
per-second. This compares favorably with new 16xl6 VLSI multipliers. Using a
pipelined architecture, which requires the insertion of the storage registers
found in Figure 5, a ve.y impressive throughput figure of 28.5 M multiplica-
tions per second. It is important, and fortunate to realize that the Intel
A HMOS memory unit, used in this analysis, has a cycle time equal to the access
I. _ M
23
time. If, as is often found in practice, a memory unit has a cycle time
approximately twice the access time, then pipeline delay would increase
from 35-ns to 70-ns.
VLSI RNS MULTIPLIERS
As previously noted, 16--bit three-moduli 35 ns pipelined multiplier is
more than three times as fast as the VLSI unit but consumes more than four
times the power and is significantly more complex. However, the above VLSI
multiplier units are designed to work in 2's complement are therefore do
not support residue arithmetic directly. In this paper, the algebraic
elegance and speed of the RNS is combined with the technological advantages
of VLSI to achieve high-performance modular multiplier.
Since the RNS is an exact numbering system, the nesting of modular
arithmetic operations can result in register overflow. Register overflow
occurs when the result of an arithmetic operation exceeds the admissible
dynamic range M. For a set of relatively prime moduli set P={pl,...,pL },
M=1Ipi,i=l,2,...,L. Overflow prevention in the RNS is accomplished through
the use of a relatively inefficient operation referred to as scaling. This
can be mechanized using the mixed-radix conversion algorithm or the Chinese
Remainder Theorem.[5 To insure that there will be some degree of efficiency
in the scaling operation, the moduli set must be carefully chosen.[6] A
particularly useful moduli set is P={ 2nl_, 2n, 2n+11. Based on this choice
of moduli, a VLSI based residue multiplier can be realized in commercially
available hardware.
VLSI-RNS MULTIPLIER STRUCTURE
This structure will be presented as three special cases.
24
Moduli P=2n
For the purpose of discussion, consider p=2 n to be a moduli and x,
xeZp, to be the composite number
<X>p=X=2 mXHI+XLO; Xm=XHI or XLO, Qxrng m-1 16.
where m=n/2. Here xHl and xLO are m-bit positive integers and Zp is the
residue class of integers modulo p. For a y having the same format, it
follows that z=<xy>p is given by
z=<Xy>p=<2na+b+2m(c+d)>p 17.
where: m-)2,2n-.
b~xLOYLO; O<b<(2m-l) 2<2n_,
C=xHIYLO; O<c<(2m-1)2 <2n-
d=xLoYI1; O<d<(2m-l) 2<2n 1
v=c+d; O<V<2(2m-l)2
n mnnmUnder the hypothesis that p=2 n , and noting 2 M=2 /2, z computes to be
0
Z=<<2n a2n+<b>2n+2n c+d)/2m>2n>2n 19.
The last term in equation 19 may seem to pose a potential hardware realiza-
tion difficulty. However, this need not be the case in light of the follow-
ing interpretation. Suppose that the (n+l)-bit binary representation of the
?4
25
positive integer v=(c+d) has the form xx.. .x (x=f or 1). Then V/2m can 6e
formed by simply defining the binary point to precede the mth LSB. That is,
V/2m = IV+.XV where IV is the integer part of V/2m and XV the fractional
part. Thus <2nv/2m>2n = <2n'V+.XV)>2n = <2n--' IV>2n+<2n(XV)> 2n. Computing
<2n(XV)>2n could promise to be an inefficient operation if conventional
digital methods are used. However, this need not be the case since XV is
known to be a m bit word where m is n/2. For example, if a 24-bit moduli is
desired (which represents a substantial improvement over the 5-bit moduli
typically found in the literature), then m=12, and a 4K x 1 high speed (35ns)
memory can be used to implement the mapping <2n(xv)>2n as a table lookup
operation. The partial product terms could then be combined by a moduli 2n
adder to form z=<xy> .
Moduli p=2 n-
Equation 16 can be rewritten in terms of the following set of relation-
shipsshipsi: 2n=(2n-l)+l
ii. 2m=2 n/2m=(2n-)/ 2m+/ 2m 20.
with
z=<((2n-l)a+a)+b+((2n-l)(V/2m)+V/2m)>p 21.
From the previous analysis, one notes that <a>p=a, <b>p=b and V/2mIV+.XV,
with
<(2nl)V/2m>p =<(2n-l)(IV+.XV)>p=<(2n-l).XV> . 22.
AUsing lookup operations and a 2m word memory unit the modular mapping can
again be implemented directly. The term V/2m is, as previously stated, is
26
simply reassignment of the binary point of V. Again the partial product
terms would be recombined using a moduli p adder.
Moduli p=2n+l
This case requires special attention since it is not completely
analogous to the previous case considered. In particular, riot all the
residues in the residue class Z can be encoded into an n-bit word andP
represented as x=2mxHi +XLo. In other words, the admissible residue x=2n
does not conform to the accepted data format. However, x=2n is an easily
detected case since it is represented by x+l000... 0 (ie: test MSB for I
and AND with n-LSB's of O's). If x is detected to have a value of 2n
then only the following events are admissible
TABLE 4
x y z=<xY>p p=21+1 example, n=6
n <2n (2+1 )Y-Y 2n 1 = -Y>'2n+1 <64(y=5)> 65
I= _65-6=60
2n <22 n>2n+l
=<( 2 l 1 )2 -2( 2n+,)I > =1 <4096>65=1
These two possible events can be separately programncd without reducina
throughput. That is, upon receipt of x (or y) = 2n, the output will be
immediately set to <-y>p (or 1).
An architecture capable of realizing the proposed large moduli multi-
plier in VLSI is suggested in Figure 5. This system is composed of four
commercially available VLSI multipliers, one custom VLSI Quad moduli p adder,
and memory units for table lookup use. More will be said on the structure
of the modulo p adder in the next section. For values of n=24 or 16-bits,
and based on commercial multiplier specifications, a three moduli multiplier
system can be built having a dynamic range on the order of 72 to 48-bits.
Furthermore, based on these parameters, a multiplier can be partitioned into
four 100 ns operations. This translates into a real-time throughput of 2.5 M
multipliers per second for a serial realization or 10 M multiplier per second
when pipelined (a most impressive 72 to 48-bit multiplication rate). The multi-
plier, suggested in Figure 5 performs the following ierations.
28
TABLE 5
O peration p=2 n p=?"-I p=2nil Level Remarks
Si : <20 a -a 1 VLSI mul tipl ier
S2-x LOyLO>p- h ) b b 1 VLSI multiplier
T1 :x 1lYLO-" c c c 1 VLSI multiplier
T2d d I d 1 VLSI multiplier
U:<a+b>0- 1) <a+)>p b-a 2 mod p adder
V:c+d.. IV IV IV 2 adder-shift register
V': 0 +IV/2m -IV/ 2' 2 adder-shift register
XV: .XV .XV .XV 2 adder-shift register
WlI<U+V>p Wl W 3 mod p adder
W2:<p.XV> w2 w2 3 table lookup
Zk <w, +w2> p 2 2 2 4 mod p adder
Fxample: n=6, ren/2=3, p2 6 64 <xy>6 4 <558> 64 46
Let x = 1 = 2(8) 4.2 2:2 (I!I:LO) <xy>63=<558>63 =54
6 3y 31 = 3(8)-17 3:7 (11:1.0) xy-6. , 5, 3P
a=6, 1)=14, c::14, (1-6
29
TABLE 6
Opera ti on p-64 p=63 p-65
S1:a -64(6)- 10 <63(6) '6:. -6 .65(G)-6-pz-6S2:b 1).14>.1 •14 .'14 'P= 4 .14 > = 14
T1:c 14 14 14
T2:d 6 6 6
U:u 14 20
V:v 20-001OOO0 20,0I1O100 20-,00101O0
VI: SETrO 20/8-l0010.100 -20/8--0010.100
.XV .100 .100 .10
Wl :w1 144,O001110 22.5-,001011O.1O0 5.5-000011.100
W2:w 2(looku5) 32,0100000 31.5.,0011111.1O0 32.5--0100000.1O0
Z:z 14+32=46 22.5+31.5=54 5.5+32.5=38
RNS TO DECIMAL CONVERSION
One of the major disadvantages of the residue numbering system (RNS)
is its inability to efficiently perform magnitude comparison. Magnitude
comparisons are critical to general purpose RNS operations since they are
generic to the management of dynamic range .(register) overflow and conditional
branching. Unlike weighted numbering systems, where overflow can be efficiently
handled by comparing data fields starting at the most significant-digit, RNS
I..-,. ,30
" I I ,JI ,
- ' li1 'A
1A1
I Lb Vj
ty rIltI:T~ IIl I. '-
wJ I f .t I
,-A I ' 0
Si• 41°U clii
. .~r* (i(1' It I f
r .~
attI .4
n. C ',II
1 , J , ,,...,I "' I-' ,,, , ..
ItI
11; C1
t _ _ __ ,, , _ _ .. . . .<,Itl I"ANL - I_). I
J (.1 4
gal ri + C
(A I
ItI
.. ~ ., -,#A
-4 O~gcd t'i ~r~r -' '..~) esuy)I, 0 j tfl~ i -J
31
[41procedures are complex and time consuming.[4 Various versions of theseRNS-to-deciial routines have been published which make use of modular tablelook-up operations and distributed (bit-slice) arithmetic.[13,14] However,the methods reported in the literature require a significant hardware invest-ment and consume a disaproportionate amount of run-time compared to otherRNS computational operations (viz: addition, subtraction, and multiplication).In the AFOSR program, a three new RNS-to-decimal has been developed which issignificantly more efficient than existing techniques.
With respect to the moduli set P = {p ,p2,P3 } there exists, for O<x<M,three unique mixed radix conversion (MRC) digits x MRC FX,2,3 such that
(x,2, 3)suhta
x x + + X 3P2P3 23.where x RNS (X,2,x 3 ) with
= X
S2 =<P2-I~p3]*<X3-x7> P3> P4P3
3 24.
More specifically, for the choice of moduli P={2n_, 2n,2n+j1 , it follows that
p2 -1[pl:I; p2 -1[p3 ]2; p-~[p1]=2n-l=2n; 25.
Upon substituting these multiplicative inverses into equation 2, one obtains
72 <2n*<x = + 26.
3-2 > n = -x3 )2n+, 2n+ < x -x >2n+l 6
%n
32
= -<X -X >
T3 < n l < _2 2 n _ -< 3 x2 2 21+ t J 2'- 1
Functionally, it can be seen that
XI = fl(x2 ) ; V2:f2(xl'x 3); i3(xl'x 2,x3) 27.
The MRC algorithm can of course be realized vy using sequential methods. Here,
nexted modulo pi adders, and p 1[j] multipliers would be used to compute
xl,x2,yx. The three-tuple of mixed radix digits would be used to compute
i (equation 23) using these multiplications and additions. The disadvantage
of the direct appioach is execution speed due to the sequential complexity
of the algorithm. Throughput improvements and a reduction in complexity can
be achieved by using memory based table lookup operations to replace some
arithmetic. If high speed is to be achieved, high-speed menory units must be
employed. Such memories have a fairly restrictive input addressing space (5
to 12-bits). If -aapping f3 is to be realized, by presenting all three
12residues to a 4K-35ns RAM or ROM, then n=4, and M<2
Consider again the three moduli case where P={ 2n+1 ,2 ,s -1) which
specifies an RNS dynamic range M=plP 2P3. Based on a 4K-word high-speed
memory model, the previous medium moduli RNS-to-decimal converter was
practically limited to a size of six-bits per moduli (ie: M2 18). The
method presented in this research targets a 12-bit moduli for practical use
(ie: M2 36). The developed large moduli scheme can be easily motivated by
the data found in Figures 6a and 6b plus Table 7. The data found in these
figures and tables are based on the moduli set P=(5,4,31 and M=60. The first
three entries found in Table 7, namely x3,x2 and xi, are the residues digits
of x for x monotonically increasing over [0,59]. The fourth and fifth entries
namely Jl and J3, are hybrid parameters. Since 1p2-Pl1=1p 3-P21=l, the values
of Jl and J3 will increase by unity (in a moduli pi sense) for monotonically
I 33
iii L----- CA
L rI
IIV-
___~ L
1 I Lii L j.
L~j LL
Cl CD
LiI [71 II I n
Li I HL~1I~j ils vC1i- 0g C:) LXI Tr
34
____- AL3P~±JflN ) D
-:9 A. A:f 1
-14Fv~Fhrv Ii
rnNJ~ N
-- j-jm 1 T ~~jI j B~
It ___~Io- .J
U,
I~ 13
I I -N le
35
- 7 T - .f~,,f
- % z C. - - -j -4C
* ..
fn W7 I I n ' n
I J I
0. o Io ! --
o 0 c . O 7 . 7?- C , oL -o-. .-A 0,
cli C -40-~ - -
t~I CI~ %. ,
36
increasing values of x. The important observation is that J1 and J2 naturally
decompose into a system of cyclic patterns which shall be denoted Si2 and $32
over a subcover of M, say S2. More specifically,
S 2 = three sets of five subsets of four elements each and 0(S112 )=60
$32 = five sets of three subsets of four elements each and O(S 32)=60
S2 {kP21 10k<plP 3}
In general, for P={ 2n+l, 2n ,2n-l}{pP 2 ,P3}:
Sl2 = P3 sets of p1 subsets of P2 elements each and O(Sl 2)=M=.fpi
$32 = p, sets of P3 subsets of P2 elements each and O(S32)=M= Pi
Using more traditional algebraic terminology S1 2, $32 and S2 are ideals
in the ring of integers modulo M (ie: ZM). It is well known that in general
the mapping
xRNS (xl,X2,...,XL), xi=<x> 1 28.*1
is an onto homomorphism with kernel j'Ij. For Ii={kD i} (as is this case
here), Ij=O and Maher has shown the mapping to be isomorphic.[ 163 It should
also be apparent that due to the cyclic nature of J1 and J2, that any x
belonging to the subcover S2 has the block representation
x .~2 29.i x : {(plll+Jl)*P2; O<ll<P3; O<_Jl<pl,'o 29
or
= {(P313+J3)*P 2; O<13<pl; O<33<p3 E 30
2)
37
where Il,13,Jl, and ,3 are integers. Equating equations 12 and 13, one
obtains
P3 13-pjIl = (JI-J3) 31.
which is of the form ax+by=c. Equation 14 possesses a very important pro-
perty which will now be derived.
Lemma 1:[ 17] If a, b, c are integers and at least one a, b is nonzero,
set d=gcd(a,b), then a solution to the Diophantine equation
ax + by = c 32.
exists for integer values of x and y if and only if d/c.
Lemma 2: If b is relatively prime to a, then the congruence by = c
mod a has an integral solution x. Any solutions x, and x2 are congruent
modulo a.
From these two lemnias, the following theorem can be stated:
Theorem: Given the Diophantine equation 14
Pll-P313 = (Jl-J3) = c (see Figure 3b) 33.
the solution two-tuple (11,13) is unique.
The proof is straightforwarded and is based on the fact that p1 and P3
are relatively prime, l[O,p3-l], and 13E[O,p.-l]. Therefore, by specifying
c, the block indices (11,13) can be uniquely determined. Observe that xS 2
can be derived from knowledge of the two-tuple (11,13). However, (11,13) is
uniquely determined by c=Jl-J3. Therefore, upon presenting a (n+l)-bit word
c to a (n+l)-bit high-speed RAM or ROM, the Drecomputed value of Sl=P2P11
or S3=p2P313 can be outputted. The corresponding value of TES2 can be
realized by adding to s, the integer P2Jl to Si or P2J3 to S3. Lastly, if
38
xc[O,M), one only needs to add x2 to x. The decimal value x can therefore
be computed in the composite form
x = x2+P2Jl+llPlP 2 34.
where, due to uniqueness, the mixed radix digits ere (x2 ,Jl,nl).
In general, for P={ 2n+1 ,2n ,2n 1 }, the routine would proceed as follows:
1. Accept x RNS (xX 2,X3)
2. Form Jl:<x 2-x1 > and J2=<x3-X2>P2
3. Form J1-J3 = c
4. Map p(c)=pIP211=Sl or p3P2 13=S3
5. FORM _P2J'+SI or x=P 2J3+S3; -XS2
6. FORM x=+x2;
These steps are numerically examplified in Table 7 and diagrammed in Figure 7
for the {5,4,3} system.
Compared to conventional RNS-to-decimal conversion algorithms, the
derived algorithm possesses the following attributes:
1. no modulo M addition required as in the case of CRT or MRC methods
2. practical realization of very large moduli RNS systems
3. simple architecture and reduced complexity.
Additional refinements in the proposed method can also be obtained. First,
observe that c, the hybrid parameter which defines the argument of the mapping
4 (item 4) is a signed integer such that cc[-(2n-2),2n]. Technically, to do
the mapping (c), a (n+l)-bit high-speed RAM or ROM would be needed. This
39
+
x 1 +
XX I 2 or
GENERAL ARCH17TCTURE
FIGURE 7
40
suggests that the largest admissible moduli is 11-bits (using a 4K-memory
model). Furthermore, since cmax= 2n , it would appear as though the output
register for the signed-adder found at sten 3 would have (n+2)-bits if a
standard binary weighted code is to be used (eg: 2's complement). This pro-
blem can be overcome through the following modifications.
1. Using an (n+l)-bit (at least) sign-magnitude adder, the sum c=JI-J3
can be represented as a (n+l)-bit word having the format MSB:xx--.x.LSB.
The sum c can be partitioned into two sets V and Z given by:
xCV if MSB of x = "0i
xCZ if MSB of x = "1"
More specifically, V is a set of 2n_, elements determined by
V = {y I y = x for xc[0,2nl]}. Also, z is a set of 2n-l elements
determined by Z={zlz=x if xc[-(2n-2),-l], z=O if x=2n}. It can be
seen that the sets Y and Z are defined by the magnitude digits of
the signed magnitude value of c with the membership to Y and Z
determined by the MSB (sign-bit location) of c. The importance of
this partition is that two 2n word tables can be used to map c into
@(c). The device select line would be tied to the MSB of c as
suggested in Figure 8.
2. Another efficiency can be realized by using data packing. More
specifically, the term P2Jl+x 2 , for the considered choice of moduli,
can be rewritten to read 2 Jl+x 2. Since O<x 2<2n , and O<Jl< 2n, the
the term 2nJl+x 2 can be directly, and uniquely encoded into a (2n+l)-
bit register. This is suggested in Figure 8.
3. The proposed architecture, as in the direct realization of the mixediL
41
mixed radix conversion, requires moduli p. for Pi= 2n_-I or 2n+l.
Several such modulo adders have bpen reported in the open literature.
A very efficient 40nsec modulo 2 +1 and 2n- adder, for n<12, has
been reported in reference 18.
OVERFLOW TOLERANT RNS MULTIPLIERS
In order to extend the dynamic range of the autoscale multiplier to a
more useful size (say 12 to 16 bits), based on a 4K memory model, data com-
pression will be required. A suitable compression algorithm, based on the
quarter-square algorithm has been reported in an earlier section. Further-
more, the theoretical foundation of a compression scheme has been motivated
in the previous section. Here, data compression will be studied in the con-
text of the popular three-moduli system P={ 2n+l, 2n,2n-l} such that M=P1 P2P3=23n-2n.
Any integer over [O,M) has the unique RNS representation x RNS (x Xx).
Consider now a subcover of the range [O,M) generated by all numbers x having
an RNS representation x RNS . Obviously is defined over a
subcover of [O,M), say S2 where S = {kP 2IOk<plP 3}. The utility of this
operation is that of data compression. More specifically, only 2n-bits of
data are needed to uniquely quantify i-(ie: x-%_(XI,x 3)) versus 3n-bits for
x ie: x'_(xlx 2,X3)). The digits of _ can conveniently be defined to be:
1l x2-"2
x2 = x2-x2=O 35.
3 x2x 3
As a point of interest, this is also an operation found in the residue to
42
0 N 2N-'!
ILO L HI
n TRI- I +X2 __ 4 DATA STATEI BUS
4+3J SELECT
REFINED ARCHITECTUJRE
Figure 8
t
43
mixed radix conversion algorithm used to determine the mixed radix coeffi-
cients of the weighted representation
x Xl+x2p2+A3).j-3 36.
ERROR ANALYSIS
The mean and error variance for the extended ranae autoscale multi-
plier is a function of the chosen moduli set. Even the simplest analysis
becomes burdened with nested sums and binomial coefficients. Instead, the
error statistics of the multiplier was studied using numerical simulation.
A general purpose FORTRAN program, written on a PDP 11/60 under RSX-II, is
reported in Figure 9. In Figure 10, the product of x=16 and yc[0,29], for
P={3,4,51 is reported. The parameter Zl is the autoscaled product over
S2' Z2 is the theoretical autoscaled product, with the last column exhibiting
the error. The test software operations in either a deterministic or statis-
tical mode. In either mode, the user specifies the choice of moduli (ie:
P={p1,P2,P3}) and the number of fractional bits used to define the table
lookup data. That is, the output wordlength is given by [l1g2M]+ number of
fractional bits. In the deterministic mode, all possible values of x and y
over [O,N) are tested. However, if N is large, long execution delays can
result. To overcome this problem, a statistical approach may be used where
the integer value of x and y is randomly chosen from a uniformly distributed
process over [O,N). The test is repeated M times and the statistics analyzed.
The software presents to the user both error mean and error standard deviation.
*For example, for P={7,8,9} and zero fractional bits of accuracy, the determin-
istics error mean and standard deviation was determined to be e=-.00011476 and
l ....
ANALYSIS 'BAlICE CODE
%L '~44
-oi T~t C , ill
~T-1 11
~wfc 7* II.11
MITi ( 1 31
:fl'11T77 )1I ~pA ~I r. \O C.Atr,
'l T IIr r
-! M I T I I 6
INA r.II OF 9RC*nl- 1.IT
.1': ?IV IIT r2 12 I
1r.LCi2s -T
I S.Efl. V I !If;.0- 5TCS Is/2
n,.) TO iO
* r IYr E10 11.)T 3 3
rA I AlIJ '.1.1n
11TO 310
ml I I
J C1.v2 ITy .1/1 2T) CO I I
13') IT3.PtT (T2l)TI I a L ( , I ) I -TI . v( ( I JI)IV -*2T I _c 1 -T1 )I. L
*I'jL X'rT CO 1 Ill
'M TM ,20 %1g 2,',lI
I F !iT FIGURE92
45
X Y Z Z ERROR
,:'. I) * oI ?Iof:I)o L . rH I~ ~ )I .* (fl.(
I,~* ' c' ' IIq ' ' * C" ": ": ":"' ~ * • : 41. n: C.I.,I
I • ., 1, ., .(f.Ii.. ,*.*,&,o - , _ ': .' .
.6 . , ,2' 1 3. (:'-,C c'c'U 11 7~332l I ~-.46. .
1/, 4. "' : r. "c'c,'i'c' ') ' 1 ' ' ":"/'.4 -r ,'::33,!. ,4
.1,4 .. . ...;.r I 'I,' C!'vdI I"., 1" r;'O . ;AOO..'47
-, :'6. "' rO0000':'- 4 . 6.67 - 1733::::
.. 1 ....... .... 1 • :7- ,./-,,7,'., o ,'-, "i . ! ':r'"IJ ,'C ' 4 . . .. ' . 1 " """'"'-'
I..00'00 4.. 000 -1.200000
1. ti:: ;8. 000000 6,.9 30 ':>""":":":' .
16~~~~ .1 -0.,.,.0 P.3,3 ,IA 6 6. ,
J6.' ,71..0;..'. 000000 :9'.066667r' ' ), .. ... 9333 r)') 0 *0
16 Is, 1..,00000 ,.000'1 -0.40001..: t9 10",.. ,000000 0 1333-1... 0.1/.6 20- 1:3' ". 00000:0 10.66666/:,4,':,7 - 2..:."'""": ,=,, ,.', :16,' 21 1 ..:,,000000 11].,2..0'000''0.. A - MMON''!.. . .
16 , 2 :74 1,..".000000 1, .. 80:'0000 .. .- 2"-20,: x:00.16, "2" I.,00000,:0 1 3: "' ........3 33 -,:.:.,=,: "I 1. .,6 :,716:, ..: 26. 1..5,000000 1t .8 666:, 6 :.:,7-"Ii.,,..:.,16 2 7 1,.000000 ' 14.400000 -0.6:00000I/: 2S:''' J 7. 000000 14.9 3 1 -2.0"6.....
I .. ' ::,: 7. 000000 1 A-;.. ... ,-,6.1 .'I..,,,,,.,.:*'" '
MULTIPLIER OPEIRATlONl
FIGURE 10
46
e=-e00 379230. In the statistical mode, the results were e=-.00023643 andae=.00396498, which can be seen to be in close agreement. Table 8 andFigure 11 summarizes the results of several experintents. They are:
1. Deterministic for P={3,4,5}
2. Statistical for P={7,8,9}
3. Statistical for P={15,15,17)
4. Statistical for P={31,32,33)
for various choices of fractional-bit accuracy (denoted NN). The errorstandard deviation data has been interpreted in graphical form in Figure 6
and compared to usual theoretical model 'cien by ae2 Q2/12 or a e=Q/AT.Here Q is the quantization step size which, over S2, is given by Q=I/pIp 3.
The data is shown to be in close agreement with the theoretical model.
Lastly, it can be observed that the multipliers performance is more-or-less
invariant to the number of fractional bits used to generate the tables.
47
rm f~'ri vo i 'i ':i'i i r D 'h r v r A T r Ni\ *. *rC 7 0 *~*'i ~ ::w r opr NN '
m:rcAm o, i ti 7;r nI*~rvF~'rfI4 *~*'u:~~:7~. OR A TN -0 0 0 09 0 .1= 6284 PMRi NII4a 4MRAI'N ANDIi DID.* KVyIATT'Or':N *-0.001'I7W~ 00116i4"4 FO:R. MN#.,Mr * 1 ii r, v r:r. ;Tp w r T ION **: 0 A6 . t0. *i' v17 4 FOrR 1'I'fvMF(AN~~c:'Ii.Y( AND M. UNWAION -0.00920::y''94 J , 0 1563645'4 F'I~v - J f
I'wOD"L Ti CHO;I CE r 1 t *r'; . 7. S.
ANT ii i: i&rr; ~WAV~rION -. 0 , 00700 0.00414101('4 FRm NN'"WAN~i ~s"':i AND OTD i.rTI -0.00 :'iWo 0.0008302~(: Ci FO '?~. .?R NN,'
r:.ju i:1 AND M. MWATO R?-0.063 j.0894 FO NN* j * 1
I 1L(.II AN AM DEsrV IION - 0'nQI 1, 0.0009C06M7~y: FRii NNNI 4 .t 'rr () p i ::n ,, i' im rO -0 0 0 I 4 0 0 0 8 0 r R N :
EXPERIMENTATION SUMMARY
TABLE 83
3_ _ _ _ _ _ x
xx
.002 TH[EORY~
0
4 00013
-1IM=R15)x16 x17 '
0.000282M3lx 32 x K
.00021o 2 1
N U 1BER OF FRACTINAL BITSERROR ANALYSIS
FIGURE 11
49
PART II SYSTEM DESIGN IN THE RNS
Under the AFOSR grant, new residu-e arithmetic units were conceptualized,
researched, and tested. Key breaktbrnughs were an eflici2nt RPIS to decimal
c.onverter and an autoscale multiplier. In this section. thec t1ui~ding block5_
will be put to use in designing high speed digital systems.
The utility of the RNS in digital filtering was forwarded by Jenkins and
Leon [1] through their work in non-recursive filtering (FIR: finite impulse
,esponse). In this case the problem of register overflow, in the RNS, was
overcome through the use of a P"' norm argument. Given a FIR, satisfyir.g
yn:aix-,i=l ,...,', with jx()k11, it follows that ix(n) 1,<2ai=v
over all i. In order to insure that dynamic range overflow will not occur,
the RNS dynamic range M=wp i would be chosen so that M>V. However, the design
of recursive filters (FIR: infinite im.pulse response) is significantly more
complex. Soderstrand [4] approaches t:, problem through base-extension mthods.
Other authors have used the Chinese Remainder Theorem (CRT) or mixed radix
conversion algorithm to control dynamic range. This has been strongly
criticized because of the -;erhead normally associated with these operation:;.
The PJS concepts, developed in -art I, overcome many of these objections.
_J
50
The classic digital filter architecture, often referred to as the
Jackson-Kaiser-McDonald (JKM) filter, realizes a filter in terms of general
purpose multipliers, adders, and shift-registers. In the mid-1970's, several
new memory-intensive linear shift-invariant digitl filter architectures were
introduced. First, the distributed filter [Peled-Liu], or PL filter, was
introduced in 1974.[lO] Next, the Monkewich-Stunaart, or M-S, filter was
reported in 1975.[20] All three architectures are summarized in Figure 12.
Compared to conventional architectures, this class of memory inten.sive filters
offer the potential for high throughput. Execution speed is acheved by
replacing the relatively slow process of general digital multiplication with
table lookup scaling operations. Jenkins and Leon, in 1977, studied a memory
[I ]intensive filter architecture based on residue [modular] arithmetic. J By
exploiting the parallel nature of the residue numbering system, and using
table lookup operations to mechanize modular arithmetic, ultra-high speed
digital JKM filters were realized (see Figure 13). In most reported cases,
a residue arithmetic filter is organized into a decimal to residue encoder
stage, arithmetic-filter section, and residue to decimal converter stage.E4 ,6 ]
In this work, the fundamental structure of the residue arithmetic-filter sec-
tion will be developed and new results presented.
One of the principal limitations to the residue concept is its intolerance
of register overflow. This is a consequence of finite ring theory. Specifically,
for a set of relatively prime moduli, say P = {PP2" 'PL the residue
representation for a signal integer i, is given by i--(il,i2,...,iL), where
Ii mod pj if i>Oi j
(M-jij)mod pj if i<01
51X(n) ______________
DIRECT if 0 2
T.. T + y(n)
0KM
x~n) COEFF. RAM/ROM
a. ,5.SRM .x ACCUM. y(n)
PL
LOKU /R ACCIJM.
S/R Ln
A R C H I T E T U R E S
52
x (n) INU
DECIMIAL TO RN'S CONVERTER
RNS PATHS
RAM/ROm RAM/ROM RAM/ROM COEFFICIENTr
RAMs or ROMsRNS RNSRNALU ALU AUS
ACCUM ACCtUM ACCUM ACCUMULATORS
FIR RNS FILTER ARCHITECURE
FIGURE '13
53
kand OCij<pj, M- pk, k=l,2,... ,L. The integer i will have a unique repre-
sentation if and only if -M/2<i<M/2.
In order to insure the satisfactory performance of a residue arithmetic
filter, dynamic range overflow cannot be tolerated. If for example, a shift-
invariant filter of the form y(k+l)=Zaiy(k-i)+Zbix(k-l) is considered, the
k00 bound on y(j) (ie: max (y(j)) for all j) must be less than M/2 otherwise
uniqueness can be guaranteed. As a result, the JKM residue arithmetic filter
suffered from a severe dynamic range restriction. For example, in order for
z=ax+by to be correctly represented in a residue system z must be bounded by M.
For O<a<A, O<b<B, O<x<X, O<y<Y, then AX+BY<M. If A,B,x,y are on the order of
r-bits of precision, then M must be on the order of 2r+l-bits in range. How-
ever, this is not the only constraint. If highspeed RAM or ROM is to be used
to perform the algebra (in a lookup sense), then for n=14, a table addressing
space of 29-bits would be required. Suppose, for the sake of discussion,
MQ3-bits and rm16 bits. Using a 16K highspeed memory unit (30-50 nsec) as
ria model,[ the maximum value of a moduli pi is 7= 128. In order to achieve
the 33-bit system dynamic range (ie: M), at least 5 (ie: [33/7]) moduli, on
the order of 7 bits each, must be used. This means that five parallel paths,
complete in memory and logic, must be configured and integrated into a complete
system.
Footnote 1: l6Kxl units: INTEL 2167: access line=4Ons, enable time=4Ons,cycle time=40ns, active power 500mW, standby power=75mW:
= INMOS 1400, access time=3Ons, enable time=35ns, cycle time30ns, active power=375ns, standby power=35ns.
ii I,
L I
54
M-S RESIDUE ARCHITECTURE (MSR)
The algebraic operations found in a linear shift invariant filter are
data delays, additions, and scaling (in lieu of general multiplication).
Replacing general multipliers with residue scalars has been proposed by
several authors. [1 ,2 ,14 '18 '19] A residue multiplier would present a 2r-bit
two tuple (ai ,xi) to a ?r word memory unit and respond with the precomputed
value of (aix i) mod pi. Using the same 2r-word memory unit, a scalar would
accept a 2r-bit representation of xi. The table would respond with the
precomputed value of (,aix i) mod pi where ai is known apriori. For example,
using three 16K NMOS 30 nsec static RAMs and three moduli of the form
P={2n-l, 2 n, 2n+l , a scalar having a dynamic range on the order of 42-bits
can be realized. Using such scalars, the M-S filter architecture found in
Figure 12 can be realized.
55
FINITE WORDLENGTH EFFECTS
Generally, a digital filter is a finite precision approximation to some
user defined discrete filter defined over a real coefficient field. The
errors, due to finite wordlength effects, are v- 1 documented. It is
generally assumed that the expected truncation error variance, per multi-
plier, is given by Q/2 and Q2/12 respectively (0 represents quantization
level). However, in a residue arithmetic filter must be defined over a ring
of integers. Real numbers cannot be tolerated. For example, suppose
a=3.251 and x=l0, then ax-32.51. Rounding this results would yield an
estimated product 33. In a residue sense, with respect to a moduli set
P=(3,4,5); M=60, x RNS -- (xlx 2 ,x3 )(l,2,0), one may make two sets of calcu-
lations, namely (i) and (ii).
i) a = 3.251 ii) [a] = 3
ax = 32.51 ax = 30
<axI> 3 = <3.251>3 = .251; [251] = 0 <ax 1>3
= <3>3 = 0
<ax 2>4 <6.502>4 = 2.502; [2.502] = 3 <ax 24 = <6>4 = 2
<ax 3>5 = <0>5 = 0; [0] = 0 <ax 3>5 = <0>5 = 0
The calculations found in column i use the decimal value of a in forming
product (ax) modulo pi. The resulting products are then rounded. The final
residue digits are (0,3,0) which is equivalent to a decimal value of 15. In
column ii, the integer value of a is used to form the product ax in the usual
residue arithmetic sense. The result is seen to produce the correct truncated
value of product, namely (0,2,0) RN-S -30. Therefore, since all filter para-
meters are to be integer value over [0,M], traditional finite wordlength
error modeling and analysis techniques apply.
56
If a large dynamic range is required, in limited RNIS hardware, magnitude
scaling is required. A similar strategy is used in designing filters using
weighted fixed point arithmetic where rounding or trunication is used to
control the growth of dynamic range. In a RNS system, the problem is com-
pounded by the fact that the magnitude of a number must be known, if it is
to be scaled, and magnitude determination in the RNS is difficult. That is,
in order to support scaling in the residue number systemn, some sort of residue-
to-decimal conversion is required. Most existing residue scaling routines
makes use of base extension or mixed radix conversion schemes. ['] It has
been shown in reference [15] that in a realistic RNS system, a ten to twenty-
fold increase in computational overhead can be expected if scaling is present.
As a result, the overall throughput of an IIR-RNS filter would be compromised.
In order to achieve high data rates, over realistic dynamic ranges. in
limited hardware, a new low-overhead RNS arithmetic unit must be developed.
In the next section, such a unit will be presented.
M-K RNS FILTER ARCHITECTURE
In order to insure the uniqueness of a modular product of two numbers of
dynamic range V, the modular dynamic range must be bounded by V2 ie: V2>M%
This can be achieved through the use of a newly developed auto-scale policy.
The auto-scale arithmetic units will be shown to support memory intensive
filter architectures. Assumed that there is a practical memory wordsize
constraints. For high-speed (-,30 ns) applications, memory size is presently
limited to 4 to 16K words (ie: 12 to 14-bits).
For reasons that will become self-evident later in this section, the
n_, n, n+-,,, will b osdrdthree moduli system, given by P={pl=2 P2=2n, be considered
w m m m w m m n m m m n m m m m)
57
Using such a moduli choice, signed integers Xc[[L-M/2], [M/2]] arc uniquely
represented by the three-tuple (x1 ,x2,x3) where x1 =x i mod p. In order to
scale x, using memory table lookup operations, the magnitude of x nust be
known. That is, the RNS three-tuple (x l x21 x3 ) must be simultaneously
presented to a memory module which is programmed to output (cx) nxd pj
where cx>[[L-M/2], LM/2]]. For high-speed application, the limiting 16K
214 memory units require IPi23n,24. That is, the design would be con-
strained to consider moduli on the order of 4-bits each. Also, the dynamic
range is limited to 14-bits. Referring to Figure 7, it can be observed that
an integer "xC[L-M/2], LM/2]], satisfying x=kP 2 , O<k<pl p 3-1, has the unique
RNS description
R ( (XsIX2)mod P1 (x2-x2)mod P2 ' (x3-x2)mod P3) 37.
= ((Xl-x2)mod Pi, 0, (x2-x2)mod p3)
- (x l'0' 3 )
Observe that x is defined over a subcover of [L-/2], LM/2]], and it can be
uniquely represented as the two-tuple (XI, X3). The two tuple approximation
2n 14of x, namely x, establisnes a memory size constraint given by 2 <2 or
n<7. Now 7-bit (vs. 4-bit) n;oduli are admissible) with the dynamic range
extended to 3n or 21-bits. The memory units can be programmed to output
[cx]mod pi where c is a user specified constant. Overflow prevention can
be achieved by introducing a scale factor K so that [cx/K]=[c'x] will not
exceed the permitted RNS dynamic range. For the application under study
(viz: IIR-RNS filtering), it shall be assumed that all system variables and
58
constants belong to the integer range C-M/2, M/2) where M=pP2 =A 2 [2.
Therefore, the scale factor K needed to insure the absence of arithmetic
overflow, is K:M/2.
The error statistics associated with each auto-scaled multiplication
22 2can be shown to be bounded in mean by p2c/2M and in variance by a =p2 /36M
This has been experimentally verified: For example, for P = (15,16,17) and
(255,256,257), the error variance for the integer valued product y=cx, for
cc[O, M] given and xcFO,M] randomly choosen, is plotted in Figures 14 and
15. The error is defined t be e=(cx/M-[c:/M]) where i=kp,, kE(O,pP 13-1].
A M-S recursive RNS filter can be architected using the e.-o-scale
arithmetic unit. A dedicated auto-scale unit must be configured for each
unique filter coefficient. Each unit, in the three-moduli case, would
require three memory devices each. For example, a 9 coefficient 21-bit
resolution filter, based on 4Kxl RAM/ROM, would require
N = 9 3 6 162 (16Kxl) memory units
coef. moduli wordlength per moduli
A detailed description of an autoscaled arithmetic unit is found in Figure 16.
The modulo 2n-l and 2n+1 adders can be realized in the manner suggested by
Taylor 1 9] This architecture uses PLA's to augment a conventional n-bit
integer adder. Other realizations have been reported in the open literature. 1 3
For example, a modulo 2n-l adder can be realized in a simple straightforward
Footnote i: Unlike a 2's compliment system, where partial sum overflowcan be tolerated if the complete sum of products is bounded,each partial sum must be bounded to [-M/2, M/2) in the RNS.
)4
ci
400
I.-1
If))
CN C
U-
9lfh 1- 62 is- el 'L- T Gz- LLCi9- 09 tIL- EtOQ-980 NJ 3JNtIWU8A L40UH
C-Y)
0010)
i i
cc
cc
C~CD4t7~
UU-
ID0
00N 3alU O9
MODP2ADDER
+ C
I-
C-w > I I
[LJ
X 3 X
x RAMA>MEM ~oroMAPPIN CO
THRE MODUL EXENE RAMAUOAL ARCITETUR
ADE FIGUR RA6
61
manner. Consider the term (xl-x2) and x2 Z2n. For(x1-x2} positive, (x1-x2) mod 2 -l=(xl-x2) but for (xl-x2) negative,(x1-x 2) mod 2"-=, - x.c where1x-x-md[ 2 1 Ix1-x2q2 denotes the complement of thebinary representation of Ix1-X2 1. For example, 5 mod 7
If a real-time, or pipelined architecture is required, then it'sdesired to design the modular adder which have identical propogation delay.Using the PLA-supported architecture, modulo 2n*l adders can be realized incommercially available hardware, for n<12, having a 30 nsec delay.
A second arithmetic operation found in the three moduli namely thecomputation of v=(2n-l A) mod 2n-l, where A={(xl-x2)mod 2nl) - f(x n+,
n 2 ~ 2-x3)mod2 4can be simply comcuted. It is directly verifiable that Ac[-2 ,2 -2]. Consider
\ 2A1 + 0 if A>0?
38-- 2A.1 - if A<08
where A0CZ2 and AEcZ For .>O,v can 5e computed usinq the following
scheme: n-l I
A : =V
Example A=6, n=3, (4.6) mod 7=3.
For A<O, v is given by [c denotes complement]n-l 1
phantom sign-bit
t Example: =-6, n=3, (-6-4) mod 7=4
62
i 1 1 1I O -c-I I
or - - 0 =
As a result, v can be computed with negligable overhead and hardware.
thon
A ( 2i-' A)m)d2 I A(t i n.,ry) A )(i nary) A '( i_I 1.)
6 <24>7=3 i10 011 3
5 <20> 7= 6 101 110 6I
4 <16>7 =2 t00 010 2
3 <12>7=5 ol 101 5
2 <8>7=1 010 001 1
1 <4 > 7=4 00 1 100 4
0 <0 7=(0 00() 000 0
The discussed parmutation, from a modulo 2 n-1 adder to a buffer register,
can be realized by a hardware mapping. Here the LSB of the adder is
connected to the MB of the buffer. The other (n-l)-bits are mapped to the
buffer with shift of one bit location.
RNS FILTER DESIGN
MK Architecture
In a M-K architecture, eacn filter coefficient ci is realized with a
dedicated RNS table lookup unit. Based on the three moduli MRC algorithm
and a given ci, a MRC multiplier unit similar to the one detailed in Figure
_ would have to be co;ifigured. Here, for convenience, the multiplier 2
63
is imbedded into a lookup table. Each unit would consist of nine 2nxn-bit
memory units. It must be stated that in o-der to uv. a 2nYn meriory in a
mOuuiu ( 2 n+l) operation, some form of data compressi: is requind. The
simplist compression routine would differenrtiate between the two external
number in Z , namely 0 and 2n Those two numbers have a (n+l)-bit (.e:2nl
common n-bit data bus plus 1-bit control line) representation of 00.. .00
and 10...00 respectively. By ANgn the n LSP's and sensing the MSB, the
two conditions can be easily identified. For x=O, it follows that [cx/M]=O
and the output registers would be zeroed without any memory (table lookup)
action. This means that one of the 2n memory addresses, namely x:O, is super-
fluous and can be assigned to x=2_ +l.
It follows directly from the MRC representation that
-j p Imod p + [x3ciMPP3]mod p.)mod p. 39.[-M1 mOd P= : 1- -L m d P + ii .--- _--Jm__P
That is, the outputs of modular tables (viz: [xlci/Mjmod pa,..., 1 x3 ciPlP3/M]
mod p.) must
From a design standpoint, it is desired to configure a system which has
minimum complexity. The 3=6 possible permutations of a three moduli set are
summarized in Table 9 in general and for the specific example x=l00 for
P=(7,8,9). A key feature of the general architecture are the propogation
delay paths dt (total delay), d, and d2 (see Figure 17). For sequential
operation, the MRC digits will be available for use after td seconds. The
j lengh od delay is due primarily to the nesting of four multipliers. In
addition, td is a function of the multiplication philosophy used (bit-serial,
64
MRC
+ + X I o2)modpX d
RNS X 2"---' I
(X3-X2 )modP3 3 [2
P2 [P31 R2 P2 [Pl]
d
PIPLINED
d! d2
0)(O 2()o)dl -d- 3(01
2X3(2)
X22)R1 (2) R2 (2)
S d1
X1 (3) ooo
MRC DIAGRAM AND TIMMING
FIGURE 17
4
65
lookup, general purpose, etc.). If high throughputs and low complexity is
desired, td should be minimized. One could ai:o consider a pipelined
architecture of depth two having an effective throughput rate of t2 second
per MRC. The design of an efficient pipelined MRC processor is promised on
the condition that t2 is small and tI and t2 are comparable. Referring to
the data summarized in Table 9, it would appear as though the first ordering
admits the best design. Therefore, this ordering will be used as the model
throughout this section. Based on this model
xI =x2 40.
x2 =(x2 -x 3 ) mod (2n+1 )
_X3=(2 n - {(Xl-X2)mod(2n-l )-(x2-x3) mod(2 n+, )})mod(2n-l).
The 2n+1 adders found in this architecture have been previously discussed.
In a M-K architecture, each filter coefficient c. is realized with a
I
dedicated RNS table lookup unit. Based on the three moduli MRC algrithm
and a given ci , a MRC multiplier-scalar can be configured as suggested in
Figure 18. Here, for convenience, the multiplie- 2 n-1 is imbedded onto a
lookup table. Each unit would consist of nine 2nxn-bit memory unit. It
must be stated that in order to use a 2nxn memory in a modulo (2n+1) operation,
some form of data compression is required. The simplist compression routine
would differentiate between the two external number in Zn. namely 0 and
2n. Those two numbers have a (n+l)-bit (ie: common n-bit data bus plus 1-bit
control line) representation of 00...00 and 10.. .00 respectively. By ANDing
and n LSB's and sensing the MSB, the two conditions can be easily identified.
(XI- x2 )mod2~- cx module
x2
(X 2 -X)mod~+l _____________
x(IE FITE
MUSC FILTER ARHTCUE
FIGUREmod n_8 2
67
For x=o, it follows that [cx/M]=O and the output registers would be zeroed
without any memory (table lookup) action. This means that one of the 2n
memory addresses, namely x=o, is superfluous and can be assigned to x=2n+l.
It follows directly from the MRC representation that
Iimod pj~d; iCrImod p.+ '2i'ltmod nil 3~l~mod p.)/mod pj
That is, the outputs of modular tables (viz: 'Xic /Mimod pj,..., x3ciPlP 3/Ml
mod pj) must be recombined, in an additive modulo pj sense. Again, a
sequential or pipelined architecture can be realized.
Example
Using an MS architecture, a 4th order Chebyshev filter was realized.
The response is reported in Figure 19. It has been experimentally determined
that it requires a 16-bit moduli three-tuple to achieve satisfactory
performance.
68
r- MLa 0
cuLn
1030 0h9 odS-i 09109- a t-so 3on.rNotfw
FIGURE 19
t 9
69
JKL-RNS Architecture
The disadvantage of the MRC-RNS-MK filter is the rapid growth of
hardware as a function of filter order. In order to reduce hardware com-
plexity associated with RNS-FIR filterinq, reusable (undedicated) overflow-
free RNS arithmetic units may be considered. Again, if overflow scaling is
imbedded the table lookup operations, the magnitude of RNS coded numbers
must be known. As previously noted, this has been the historical obstical
to fIR filtering in the RNS. The architecture which can achieve this goal
is detailed in Figure 18. The multiplier 2n-1 as previodsly noted, is a
zero-overhead operation. A timing diagram is offered in the referenced
figure. It is assumed that the modular addition delays are less than the
lookup table access times (say tM). The difficulty with this proposed
architecture are the delays associated with reprogramming the tables, for
each ci, from high-density low-speed (comparatively) main memory or mass
storage. As a result, the overall throughput of this architecture will be
unattractive.
Distributed RNS Filter
A powerfull linear constant coefficient filter policy is distributed
arithmetic (alias: bit-serial, bit-slice, or Peled and Liu filter). In
B-bit radix-2 binary weighted, the output of a discrete filter, sacisfyinj
n ny(n)= . aiY(n-i)+ Z bix(n-i) .
i=l i=O
is given
7O
B-1 n n n ny(n) ( aiY(n-i;j)+ Z bix(r-i;j))2i-( 7 aiy(n-i.B)- E bix(n-i.8)2'
j=l i= i=O j- i J:O
with j denoting bit location. An equivalent statement for RNS systems can
be made in the RNS using the MRC. Here, a system variable would be given in
MRC form as
ZZ+Z2P +Z3 PIP2
There is a minor structural between a distributed filter using a MRC and
radix-2 format. For a three-moduli system, distributed partial products
must be recombined modulo plP 2 , and P3. Table requirements, for this
architecture, are correlated directly to n, the order of the filter.
Bit Slice
Example: A 2nd order discrete Chelyshev filter was designed in the
usual way. The infinite precision response used double pr in floating
point arithmetic. It can be noted, from the da.ta displayed in Figure 20
that three -8 bit moduli filter performs better than a 12-bit fixed point
filter and closely approximates 16-bit precision using a 4th order model,
8-bit moduli can aqain be seen to offer acceptable performance (see Figure
21).
75!I
71
gc
12
-jj
U.. In
2J
ral !;d- evou ti.-qj es zZ5
SO Wflf N9111
FIUR 2.
-
a,
U3L
Cl
so 9r4
.~ .u
zu Li
60 3QnlfNfkII4
FIGURE 20b
731
10In1
o 3 C.-
4.1-Vi i - Sr-Ep
FIGRE20
74
soJ
4n C)
vft
Uri #i U- i n- iBO a1(~N!t
FIGURE 20
cu
U.U
90 C2
cu CD~
so 30111;ILJ
FIUE 0
I r ;
c5
Ps Ln
s0
FIGURE 20f
77
-4CI'S
Ccl
t-0
-4n
i Ln
C
120 '30nirfluiW
FIGURE 21a
Ia
78
zuIi1-4
0n
IS Eq f~a o I
FJUE 1
79
U)
ca
I~Ll
(u E
CU4v-fn 9
ILJ v 0=. = (a
FIGURE 21c
80
-d
cu
u
Aui
., j
ca
U3 I-3--
oo O- 26,81- I6s- 56eg hSL .l-£ 1-
M- FIGUJRE 21d
I~- --------- _ _
81
PART III APPLICATIONS
The RNS arithmetic, developed under this AFOSR grant, has been tested
in a MS, JKM, and distributed architecture. Several applications were
considered. The uniqueness of these applications, are on to themselves,
worthy of special treatment. Therefore, have been included in this
report in appendices. Appendix A treats the problem of realizing a
real-time Kalman filter. The development filter (submitted for publica-
tions) represents an original approach to this class of problem. In
Appendix B, a linear adaptive noise canceller is presented. Appendix C
contains other papers published, or under revicw. which contain an AFOSR
credit line.
PART IV SUMMARY
Under an AFOSR grant, RNS arithmetic, hardware, and architectures
have made major strides. Using a three moduli system, practical ultra-
high speed RNS units have been developed. The major problem of RNS-to-
decimal conversion plus magnitude scaling has been successfully treated.
In addition, new filter architectures were derived and analyzed.
The future of the RNS is considerably brighter as a result of this
study. In particular, the RNS techniques developed during this grant
period, will be further inhanced with the advent of VLSI.
82
References
1. W.K. Jenkins and B.J. Leon, "The Use Of Residue Number System in theDesign of Finite Impulse Response Filters," IEEE C&S, CA5-24, April 1977.
2. G.A. Jullien, "Residue Number Scaling and Other Operations Using ROMArrays," 1EEE Trans. Computers, Vol C-27, No. 4, pp 325-337, April 1978.
3. C.H. Huang and F.J. Taylor, "Memory Compression Scheme For ModularArithmetic, to be published, IEEE ASSP.
4. M.A. Soderstrand, "A High Speed Low-Cost Recursive Filter Using ResidueArithmetic," Proc. IEEE, Vol. 65, No. 7, July 1977.
5. N.S. Szabo and R.I. Tanaka, Residue Arithmetic and Its Applications ToComputer Technology, McGraw-HT 1T, 1967.
6. W.J. Jenkins, "A Highly Efficient Residue-Combinatorial Architecture forDigital Filters," Proc. of the IEEE (Letter), Vol. 66, No. 6, June 1978.pp 700-702.
7. N.S. Szabo and R.I. Tanaka, Residue Arithmetic and Its Applications toComputer Technology, New York, McGraw-Hifl, 1967.
8. W.K. Jenkins, "Techniques for Residue-to-Analog Conversion For Residue-Encoded Digital Filters," IEEE Trans. on Circuits and Systems, Vol. CAS-25,No. 7, July 1978.
9. A. Baraniecka and G.A. Jullien, "On Decoding Techniques for ResidueNumber System Realizations of Digital Signal Processing Hardware,"IEEE Trans. on Circuits and Systems, Vol. CAS-25, No. 11, November 1978.
10. J.M. Pollard, "Implementation of Number-Theoretic Transforms," ElectronicsLetters, Vol. 12, No. 22, July 1976, pp 378-379.
11. H. Nussbaumer, "Digital Filters Using Read-Only Memories," ElectronicsLetters, Vol. 11, 1976, pp 294-295.
12. M.A. Soderstrand and E.L. Fields, "Multiplier for Residue-Number-ArithmeticDigital Filters," Electronic Letters, Vol. 13, No. 6, March 17, 1977.
13. G. Bioul, M. Davis, and J.J. Quisquarter," A Computation Scheme for anAdder Modulo (2n-l), '' Digital Process, Georgi Publisher, Switzerland,No. 1, (1975), pp 309-318.
IJ
-
83
14. W.K. Jenkins, "A New Algorithm for Scaling in Residue Numbers Systemswith Applications to Recursive Digital Filtering," Proc. 1977, IEEEInt. Symp. on Circuits and Systems, Phoenix, AZ, pp. 56-59.15. H. Huang and F.J. Taylor, "High Speed DFT's Using Residue Numbers,"IEEE Int. Conf. on Acoustics, Speech & Signal Processing, Denver,Colorado, April, 1980, pp. 238-242.16. D.P. Maher, "Mathematical Background For Generalized, Partial, andincomplete Discrete Fourier Transforms, Proceedings of the 1980ASSP Conference, Denver, Colorado, April 1980.17. J.A. McClellan and C.M. Rader, Number Theory in Digital Signal Pro-s , Prentice Hall, 1979, Ch. I 1.18. F.J. Taylor, "Large VLSI Module Multipliers," IEEE Int. Symp. onCircuits and Systems, Houston, TX, April 1980.19. F.J. Taylor, "Large Moduli Multipliers," Proceedings of the ICASSP 80,Denver, Colorado, April 1980.20. A. Antonou, Digital Filter: Analysis and Design, McGraw-Hill, 1979.
:I
APPENDIX A
THE PEALIZATION OF ADAPTIVE KALMIAN FILTERS
George Papadourakis
Fred J. Taylor
Department of Electrical and Computer Engineering
University of Cincinnati
Cincinnati., Ohio 45221
Portions of this work was supported under AFOSR Grantr/1 7047-0066
Keywords: Ada pti vu Ci Ite:rs , ro: - ti me compu tat i oil,
whi to stochastic process, Pfcj fo-r-13 iankcenshi p
al gorithm.
A T .' A C T
An ndap t ive Ialman f i I to r i s cons Idrered f or ,h i ci thIe n
i inpu t noise c ovn ri nce nd n the output coviria-ncn are unk-
n o,.:,n .A nw..: procnlu rn is p rosen tnd for the i dnnt i r icI io
or the ton'notin cov.ar imcos . The prolosed i (InLiCi c-iLi on II1-
~or ithri us cs thc nu toco rre Ia t Ion i n fo riit ion con t.1I nn in
the innovation error SOCILunce to dletormine the rntio of the
unlmnov.n noise covarinncos rind to ohtain the op t ima n:in ,,11.ln
f II tor -.c in. The p roposel vn thoci cian '.oe .,s 1 i mr) 1 omen tc'J
0 rou'h the list, of h1Ii -h s pccl .-I I! i tra zru toco rrel t I on inl o r-
I !' s:Mh c' onarlntn -t, ' t d,.itat ra-ts.
The ~ ,n:u~ fo rhiulat ion of the i n i rin 1 wir i once
f i I t ..ri ;1, , C.,: 1o 1 1!-. sr, 1' 1 s c ot p I ater. a pr io ri '- ~n o%!.r d- c o f th1e
input )n(4 output noise covorimicos, spy P -,n! I". . In 205ost
prncticnI -pplications fl nnd R. an nithnr %ssuic to be iin!-
nouin or 4-pproxlrnvtacl. Several au thorrs have p r n -;t tn >
0schomesnr for the Identif Ication of tht, un!k-novrn covnrlanceso
with soria stccesq (2), 00i, (7) ind (11) %-.i th the usc or thne
lnnnvtlIoni senluorco I n, the idtnt I f Ient ion of tile un!.no-.-:n co- C
va~f1~,introiucec hyV :CU,
4-<
Von m I fI CIt t10 a : i F.16 rc i a A' 1 , -09 wpn
I bIo t:ho fi rs t -11i v.- 1110c of Opr nu f1(rr" m L, tion 'unr.-
*tioni of thr ininovt ion s-iionc to dictr~rmine! tiir! rn t i o or
O ir', 'n rv.n rocv.; , *~iron ' - t !,oe p re' ;nr r ! is' i pnl i cii! I
ro,* O ns rysL,) , ilchi rfx'*! lt -. nhi t- i l .' r in t C cnns trnt
1, 'f I "h spe.cd rn 1i-time nl ,or I tri ror tiin coriptitrit ion of
t I r mutocorrn )t Io "minct ion !ii 'jr prennntnd by Phn ' ,r
a n d I I an k cn sh I p ( 112) 1 C i n Ti :- pnd o~l t h 17 ~or i t, hm can
f fuir t he r I i .ip r o v m th r o t n!i th. isq (-f ip" I r~ co :rs (I
i n-lIi no~, tb rrn.dnd, ,:fv) t tpd ) poi~r ~c~ sHifl1 eL )'i: c0ipu t'-
tion of th-i ititororrelation fu.nction, in rc,,-1 tie ~ic)
rf(n) f (n~k) "
n:i
if r(k~ is dm, I rti -f or k: on th c o rclnr o f /2~ thn
tir.I n- IF-Tr to coripti't IflFT'-(~)()m~: is co.:iptntion-
, IJIy or :I-I nl 1 A tcot~lo 1l1t2"+! roriplex ::i 1 Pies nrce
reau ,' Cin co~il.- in tIJmmu y t I 1m1~ r.It rr!,il mII tip IP1 10
P rteect inompu Lt1ttIton or thel Amtoy' rocuir Vein r am I muti 0-
-
The, S version of'211 the atocorreltion al::.orithm sts
f i s:
P-1 K
i'~~1 ~)j=O i=i
For ::>>P tile "P 11,orith.- essnntially hal)vpes the numberof multiplications normally alssociated viith direct co~mpta-'tion. Thle multiplication count for !'>>P can also Im co. si.-crahi'ly ]-ass than the associnsted with FIFTI mechan ati ins.
The speed of the al-lorithm can be furthe.- i;mproverI 1b,I i -i n ti the tii.ie C nswJ1-!fj to ro-nP:to data invariant in-
dices (ie: (2Th) (2ki) (2ki2; . Ts can bePch iePved th r ou g h the tise of so cil led in-l ine code, I n a n
in-line code, the code is arrang-ed in a top-down fashion.fiere entry is made at the top and w..itLhout loopin-, run t~ocomp Ie t ion. The desired in-line code havin-g all data invar-
Lan acees c-n Se properly. comiputedcf ad properly se-'nded Lbrou::h the use of ALJTO.SIV! -iethodol ol-2 (1!;).
I Ano the r a? te rna t ive i s pos s i l1e th rou:.,h the use ofI ~~threaded codle. I t repli-ces a sta-nda*rd protgram with a series
of m~odt11leS ... ich a re thre,-:d to c thor e trea .s a rcou~te? aa -Aryi hc allI reruI's t nifbpit i on i:s.
~oun~- Thcz array 4;s selriotly L:ne ILrntiead ~
'ore r Pmo v r, th r o v r r h a f;o c a tri i n- ~1 r n ri n a C owl-
nutt ion of the paz r t P r--. Comipa red to t' 1, n -lIi n e c o: n
t h threaded rode is twice ns fast and occunites less riorx'
-\not:ier ontion is avai 1ahlre andi is !.nov.ri : a knotted
Code. ho ts -w in h p t i ed in the thedinlici-tin- that thr,
procr:;- !.-A 1 1 move clot-n the thr- nd, "run-,rotind" i ns ie a
1-not for a wil then continune dox'n thr r~~ to t!-(- next
knot. T'-., *-:not represents a suloprol-rara in t:hir-h thI-r e ma .1
z ~xist nele:---,ntary loo;,in, . u.'sin, the knotte.-I cod-ll the ;mo'
requ~tirrrC"---rns rn-Iticed considerably (I';)- I t has 0.n sho,-n
in referecer( it; that a 12."- point time series can be auiLocor-
rlae, nu to a delay of 11 sarmples) iusin-, a pnp 11 -rin na-
co;Thputflr the folloi-wing ti-c intervals:
PPP 11/55 pnp 11/7.* pnp 11!r(0r* Proran Size
Conventional .9 G9 9.72 17. 35In- inn F, 2 75 4~!21.1 75
* ache, ::ahines
I nen-~gol i z in tha t the, co;.;u ta t ion of the :m ,oco rela t ion
function cun hi- pe r f o rme- a t ren l t i v speed~s .-we can
procned to thin anail-~sis of' the rle;._
-I
Adi scret- tc~i ~nii ytw a orpcs 1 ~
+
F :nmnr s tz t n tr.:nc i tio~n riatri' (cons tant)
!z ~'IF~ veso s~ us n in, %:11 co n oiz-rt- P ut:fLt v c or
='r:! v,-ctor of ^auss i an esre-.wnt errors ( -!i t)
--I iS ~sucd t
I,~T
1=-L tin, or-PTA t 1-i con trol 1 !hIl c, -,nei-ovil
-1} coraicnti 6- 6 VTntn' t-t O .. -and
- Vp,1iic P5
1St _ _ .P (-) f'kit U
cal in-lor:-rtion of -11% inPut noise covwiriznce CIl an- ou tpu tnoise COY~rance in cf stimate of thIe Stajte of tle stn
Sfn~ 0y(.) i~o ta i n c ( s criurmt Ia for k=1,?2, ~
S t,- 11, st n rrd KnlIuw-an filt r:
2. r-ror COvari~nnce Extrapolation:
+*'Ct -'ti jto
*U~p
4~. E-rror CovarianCe lin-ate:,
PEK(+)=tI-::Kx"i) MEK 3.7
___ - PTK(!)WKK(-2K'-' T . K)
_!c nt i t ion
IT
y 9 . ~~ L c roc--r. I C*
r :. rnoz:to oro r~ r~ -.-o Za r. c c.~ ns *-j: !-.c
-K .: K*-::K-
orror_ T,:lo
in..; I C.'- .n:n io,.:*i ' I r--ct ru'st -t io P
t -t S -Iia sel tt- -,)i rran be (eIndi crv
or t a rtio o f t ,i t ru sta t is t ics nfhQ)=~ ) I n thi s
caSe PE.(-) is directly proportional to PT(-) by a factorA
i~~ r:r
!'Y oxn-: the In f in it ion of 'T( - i n Cy 1)u
P'T(-) PPT+ ,.c
PTr(+) = ifiP() t.)
1e ro lownr obsorvat ions can be madle re-,ardin~ equa-
1. PT(- ;Ic (nns (.,)pl1ic it Iy jupon the output apriort
st ItistiCs Of the Sy'Ste-4 Model (viz.: I rv-'.n ,,I n)
arid the true noisn covaripncps and 1
2. PE(-) diipnnds upon apriori sta-tistics only._
Consilder the -case in toh~ andk- ~ I.I'~ -nIt can hf
sec ~ ~ ~ 6-, eqerl Iro _ut i 6ns. 13.~,{i) ni 1;7 hiut
cn, I~ 'J r ~ Is Ilta -~ T >T he 7 c~-t-
_,~ -lip - I____ Tsas
~weser ies wri tn h 1. = r2 : jI rh tnn. SL~Ystao co~
var ianr as r c plotted for e~ach i tera-t ion vi th the, r)t io of
-1/21 distribuited loaihicyfromn .01 to1.
VicTh crossover po-int of the curves is thI-e or)timnl
ca seo in v.,hich we obtrin Od~ optimail !.,ilman -.ain.
2- -The theorn t-ical covn r inc 5- inc r63 s i n- f fe r the.
c-rossover roint.- it can he expl-inadby t-he !'%,rlcin
f iILer --ain wh-ich is- derash n each- i-t ra t ion.
Thereforn, the innovation error is ariid
~.It s-hould !)- nenti-onei that in a ranl sj,'stc!-rTC-
.- n not h)ec ex p ci ft 1 dt r:r mn ne.
A1t -each i tarn-t-fon in-- the- above-r0 mi.ulatio6n th av6 uto-
--i eht~ fuCto of th nnovp tiron prbtcso- :as -al-culcat-
ec!'i cir-ty Osrv tat tfh - s!pco- -- r- of thn !ttocorre-,
tiltiori- -Fn -. --i -fr -) -tr -L~ -:~t -'-iiC -fl~r~ ~r
ieg I a~ tj As pr-iosl f i v a- v, risrio
) tl4 -o fl- (fli
hen-~Fi~ t? i - - - __6_ r
ISI
Tho~~ I'oit:i ;o pn "hoi~ othr c; r c~ t
:i On OF hinnov'r~t'O:1 PIO^CS.
In tha first ftI~t~n ih fr s airi~ 'I CThe tZ: E1
hl o~1c . 95c)i: 1 n t t - sa 2U) -n s , c ri-tt am' vort
''i I s of ll zirci chose;,n so "t~.(2 6~) 2.ftr c ach i tar--
ero h i ~ ly- "-'-. of ' ! c iit o
fo 0!,Is ca,-lLcu1tC4 as CS c' 0 7 of- s1 r~i It :hn
-a ~ prcs s ove r, a ctl ve citr ~ hs oos~l i
o r r a, e ,pra,.;:--. -tion is Uec-~n' _S S C-1 Vc-!
tiory alj-ori thm, tho val't.fc -o~ tt -:ici ' () c -r c Re
;J~ii~ I zV U1cn fla a 'do-cb Cu ri t -iro :;.l n Cno-~~
,~~~ricQ~ :::-1_1 i!t ~~~i-, t s. Jons- thei-tc a~n
Ti*: c- ri :;:r*-~
r% (.' p tn pol C! fn An dptiv I y -'1;ri I tw can bc el-r % ye:!
.1hich, ~ce~~~upon t~- cUrren.- loccotion oilI min OYr t n!-
-c!!rvn, c~m ccnvcr:e to the opti;.m.1 Cnni t i I: o rr I
a.wwo al Ori Call~ c ,r. cii 1- I; vnt~.in ni -
c rop r0c 0s so r eILue to 5 -cs ii'ip] it'. 2en 111 0. 1 :ntI
ca i be 0'c r d uSing th'2 a? i1o ri thr.C to cZ1 -n~c h~f s
,.-.,o rI'las 0f 1 -c nutoorlto uci
Y!I-t o r cI- c11 tc n p~s fi nzr -I:s-; 1 dOn
a n4 : re:
Ha -''l *1 V.4
F : .- T
Mu~, -
-fa 15 11.% Iv .i t 1
r y
AWN-
(c) i1ut~t terew t;The dots reprz25ent th 1 CttZ31,
%nue- 1f - 1.) -t cinch i teri-t ion.. The cciordi n tnn Of then--
f arc r. o v)rc!s e ntc d Th e c ur ve f i t t in, o f tes -)o in ts i s
.l r s n L2Jhn c ntin ti nIn m I1 5
:~e cna S- ' o 5 I conio 1- ir! oi, n-e r ~nn a o I 6 11
cor ~ ;li ii ~or _ re aJev i~n~o - *;~~ nd
tlh,.2 I r es tii.;mt-on is siho::n Io o-- Lthen,. 2e I, 1 CS ~r : Ood
Kit iniix- errors.
7 . sr:;:;_ VM. CO-IWZ I 0!!
a v ta r:n :i0 ol_;^ COVZXr ;!iOCC5 II Cl IN 1 ~ Ilter- i..i th thc*
ti~ ~ ~ :~-~1~n ~' t e tr sh o loih f:e Ino h
- rs n. d, ec n oc I
Cue ~I o vctio prcss 5: .ec~ii~ r rsi, I C rat io of
Li~unhn ao~r ~n c s th ir r-c tuel I.'zva s c an b e cn ICu It -
~ ' t~lof th-Tai'-ori Min is, possi!;.- Cr r.dteal
AA- ia tl ou hpt c-m- e a.chiVcd v.,Ith thed use of fst C.toL
cdr:r Z 1 oin I; 1-or! th- s. V: nuliae'r iCal- 0~~ ampj 41 usCratcs thez 1 r
r.-! I*- no *n. le ai ~ ier.I svr' h1
I! t - -inp_, 6
(T~~*~n~ r~ I ~ ~Z! Iir!~' C<i l-ino't
"T -T! *-Tt s -4 4- -
ij#:
1 2~~XC I,,1;v c.t
2. ~ ~ C ~'I-pi to inpft noisc rin.-i o i! %,;- vr~'~ i
r if-, nnt t.cr cra rit ::~s of nlolc of a
notsn covr-.rtminc2 is v:ery cti-~
r; rrc i on z a 'i fterinr thzor,, in Proc. .:
0" : :u ctin -.: or a
171
ft c, u pt.2 s ~ ,~ Ii T P'o rn ,t I '2i 7
r -~r~-.,j --in AC -: r, A -'ri cai, pp
f L L- ws
-, s as" s -;0 , : r- s L
a n --- r- anS ,i r aitjfl
:; . . Cr^o nd vol~. 1':r," :n- icfo
z ~ ofoi- itrst-i-t~~~~l~o vt:i
~' ;n:n'oi~rdr ~ie, EAl~
__f am
____ * ;ij onr..x;l - -
in-tI
o r 2i r f-b 'fr:* nr- i n-: no s~
S ~tcs -fE E:7r~s ~'~2~ Con r, vol.
- A ~-2?, '~c~;~r V7!;,; 2 ;
~~ ~ ;~t:Lo: Ia ronr. 'o.'C-2 1,
mu .- time'r Ybr% In i ro-t~~~~ ; V. I-
77.7
LC Iii:' 7r -ms. 1
.:~co1ri, 1,f 0 C~r f;- I --
-:S
-hm. * - f:~ v~ rr~ !~~~~~
mw
w
C n i P- N C.< ' N' -. c r .iv i I CCNie c iF.c uR N . U-c c.( I I N tooNI,
11 11 If It it I it 11 11 if 1; 11 T1 i TI iti t i t I , it !I 1 I; it it It It 11 I f 11 11 11 It I t I f It- -4 ~ ('l IN Ol * -I. j 1 4 (1 0 ~ ri (I C-1( ~ il0 NcI "% ( I c ll (11 cl I5 (I rA A (I "I (1 0l 0* CI C
<1 1I H1 if 1I Il 11 i II I i 11 H H H 1 1 11 1 11 Ht 1 1 H if It IfI t it It R H H iH It H1 11 1 it II It II It
ttzzzzz-Zzzzzzzzz zzz4ZZ ZZzZZ ZZZ ZZ ZZZZ ZZz ZZ
00
0G) H LO)
LLII
*Y 4D~ tJ-
it. r,~r i If' 0.1 C: C
* * C el .. N JI -C. U-1 -' *- 4 -. f- r-.. . . ..... ,- . . . q , -- 4 ;; v. ,,+-+!~~ .' .- .- . . . .. . . . . . . . . ... . . . . . . . . ._ _. o _ _ _
H If I I If II I t I I f it II ffI II ft II II H II I f I I I I "1 I t I II I t Ii I f it if II H II tI i i f I t I 1C -: C. : (-i C l C-1 Cl (I Cl Ct.: , I C I '| C -o l CI , .c C I C +:C C; C ' Cl ' C: C.I t C l .A Il, C ' C! CI cl
LC CC CC u CC CC C0C CC C CC C CC C C. C. CC CC :C CC CC C. 0C ZP C nC CC ,C CC C.CC C CC.- - ' c- (4 C¢" , t' - C C, ., 'O).++ .,O . , +..C -, .( ,: -< , +,e . '-, +) ',-" ."" C (I. fj b) lI',? , C1 l --- CA+ +-"
I--S Ii i t I t i i t i t Ii f I I I t It I t It ii If ii i t i t It Ii It ii It I ti Ii f It If it I| I I It I I t I f
CCZZZZZ ZZ ZZZ ?ZZ ZZZ ZZZZ ZZZZZZZZZ ZZZ ZZZZZZZZ
GS)
W •
,*,
&
SZ
1-7la. . I .,--R
0-0i)a.i i- I-+;---
& 0I G (JC+
1 m + - - <" 1 £
C C.
.-.. .. .~ r~.. .CC vaC.
S i t U I If i 11 11 11 . It It 11 if ": 1 I I if i if ! I i i u if 11 .11 q n H It :1 11 1 : lup-1((1 ( 3 c- l* 3~ l C J Cl 171 iCS Ci -, j1 I (1(1( 1 Cl Cl c3 cl I: Cl C s Cl ( I
l - 'C -IC ~_~ l C. CIr P'. 4 ~_ _'a C - - - jr ~t~ f l( -(
0- -. C. --.. (C, C. if' !C
q( 1 q aI H i i t 11 11 11 .1 it it is HI I U 11 it i t 1 11 11 11 1- 11 -. it 11 -1 11 i t 11 it it N(izzz Zzzzzz z z zz z z zZzZZZZZZZ ZZ-ZZZ
* o
fix Z
tLi
i-: z * -
CUC
.(t .- ......
LLI)
I
-1CL H C
PASS
U') Cl)
R-1 --.
lti
z
J, z(I-
A-1Hat -
X T77
F- (0
CLIC,)J
z Io
414
APPENDIX B
ADAPTIVE NOISE CANCELLING
Adaptive Noise cancelling is a method of estimat-ing signals corrupted by additive noise or interference.This method uses a 'Primary' input containing the corruptedsignal and a 'Reference' input containing Noise which issimilar to the primary noise.. The reference input is adap-tively filtered and subtracted from the primary Input to ob-tain the signal estimate..
Adaptive filtering before subtraction allows thetreatment of inputs that are deterministic or stochastic,stationary or time variable. When the reference input Isfree of signal and when certain other conditions are met,the noise in the primary input can be eliminated withoutdistortion. It is further shown that in treating periodicinterference, the adaptive noise canceller acts as a notchfilter with narrow bandwidth and the capability of trackingthe exact frequency of interference.
Noise cancelling is a variation of optimal filter-Ing that is highly advantageous in many applications. Itmakes use of a reference input derived from one or more sen-sors located at points in the noise field where the signalis very weak or undetectable.. This input is filtered andsubtracted from the primary input containing both the signalandthe noise.. As a result, the primary noise attenuated orelliminated by cancellation.,
If done'improperly, subtracting noise from a re-ceived signal, would result in an increase in the outputnoise power. However when the filtering and subtraction arecontrolled by an appropriate adaptive process, noise reduc-tion If not complete noise elimination, can be acomplished.Adactive filtering may not be applicble In all of thefiltering situations. This adaptive filtering would not bepossible when, for example, the reference noise input Isunavailable.. In circumstances where the adaptive noise can-celling. is applicable, the levels of noise reduction areoften attainable, that would be difficult to achieve in di-rect filtering.
=
PAGE 2
V)
la-F-
LiiF--
U))
I-j
-)En c0
C0C
LiJ (n c
w w
PAGE 4
LnJ
CIQ w-M 71
cm)
D&
<L
cmI
- ..- c i
PAGE 5
II
r-"
In the noise cancelling systems, the system outputZ=S+nO-Y should be a best fit in the least squares sense tothe signal S. This Is acomnp-lished by feeding the systemoutput back to the adaptive filter and adjusting the filterthrough an LMS adaptive algorithm to minimise the total sys-tem output ter. Thus the system output serrves as theerror signal - r the adaptive process.
THE LMS ;ADAPTIVE FILTER:
The LMS adaptive filter Is the basic element ofthe adaptive noise cancelling systems. . The principal compo-nant of most adaptive systems is the adaptive linear com-biner shown in fig1.1.. The combiner weighs and sums a setof input signals to form the output signal. The input sig-nal vector X is- defined as:
~X~j
Xj--
Xnj
The input signal componants are assumed to appearsimultaneously on all input lines a discrete times indexedby the subscript "'. The componant X is a constant nor-mally set to 1 unless biasing is desired. The weightingcoefficients WOWI,W2... Wn are adjustible as shown in fig1.1. The weight vector is:
(' W0W RIWhere WO is the bias weight. The output Y isthe innerproduct of W and X
That Is: y(j)=xTW=WTX.j 3
Error e(j) is defined as the difference betweenthe desired response d(j) and the actual response Y(j). Inthe noise cancellin- systems , d(j) is the primary input it-self.
e(j) )d(j) -XT=d(j) 1TJ,
PAGE 7
ui w.
ce-4
LiiiC34
LL)4
ceI
LLI
CL.
cic.
at>
z c
'i
C,~
P44
IT id
PAGE
The adaptive algorithm has to adjust the weightsof the adaptive linear combiner to minimise the least meansquare error. The adaptive linear combiner aong with thetapped delay line forms the adaptive filter shown in fig1.2. As before, the input signal vector is:
x. - J
~Xj-n+l1
The componants of this vector are delayed versionsof the input signal X . This filter permits the adjust-ment of gain and phase at many frequencies simultaneously.The total length of the delay line is determined by the re-ciprocal of the desired filter frequency resolution.
ADAPTIVE NOISE CANCELLER AS A NOTCH FILTER:
The notch filter is required in the situationwhere the primary input Is corrupted by an additive unde-sired sinusoidal interference. A notch filter can easilyreallsed by an adaptive noise canceller. The advantage isthat it offers easy control of bandwidth and the capabilityof tracking the exact frequency of interference.
Fig 2 shows the single frequency noise cancellerwith two adaptive weights. The primary input is assumed tobe any kind of signal-- periodic or transient or stochasticor the combination of these. The reference input assumed tobe a pure-sine wave C cos(wO+0 ). The primary and thereference inputs are digitised at 2*pi/T rad/sec sampling.The reference input is also phase shifted by 90 deg andagain digitised..
Fig 3 is the flow diagram. It shows the operationof the LMS algorithm. The weights are updated as shown Inthe diagram by,
wTl(j+I) wT1(j) k2ue(j)A(j)
wT2 (j+l) =wT2 (jY 2ue(j )B(j )
PAGE j
where e(j) is the error
The reference Inputs are:
A(I)=C*Cos(w0 j Tg)
B(1)=C*sin(wo 1 T +9r)
The isolated impulse response from the error e(j)to the filter output is obtained with the feedback loop frompoit 'DI to Point '81 being assumed to be broken..
Let an impulse of amplitude 'at be applied at thePoint of error signal that is at Point 'C' at a discretetime j=k.
That is: at i=0, e(!)=a;
i~e. e(j) a d )
and Afj I for 1 0;
&00for 1 0; j
Therefore e(j) =a * Aj -k)-(3
The response at Point 'E' Is then:
e(j)*A(j) a *C cos(wO kT.~i ) for jk
=0 for =k
This Is the input Impulse scaled In amplitude bythe Instantaneous value of A(j) at j=k.
The signal flow path f rom- t-he- go Int tE to point
AmI
PAGE i I
IF' is that of a digital integrator with transfer functionof 2u/(z-1) and the impulse response 2 *)p* U(I-1) where U(J)is the discrete unit step function.
The response at point 'F* is obtained by convolv-Ing 2u * U(J-i) with e(j)A(j).
i.e. A(j) = 2 a C cos(wC kT+O) where jek+l -(5)
This step function which was scaled at 'E' and de-layed at IF' Is now multiplied by A(j)to yield the responseat 'G' as
yl(j)-2,uaCcos(wOjT+ J) cos(wOkT+ 0) -(6)
where j k+1
the response at point OK' is also obtained. Thesignal flow path from point 'H' to point f'I would show animpulse response of 2p U(j-1) with UMC) being the step func-tion. The response at point 'V' is then
Bj=2ji a C sin(wO kT+ f) -- (7)
where j 2 k+1
This again is multiplied by B to obtain the response at 'K'as
y2(j) = a es~n(w0 jT+) sin(wO kt+ J) --(8)
where j k+1
= The combination of YI(j) and Y2(j) yields the response atthe filter output point 'DO as
2Y(J) = 2-a C cos w0T(j-k) -- (9)
2=2 aC uI(j-k-1)cosw0T(-k ) -(10)
This is a time Invariant Impulse response which isproportiona to the input Imiulse. The Linear transfer
-4 a Liner- trgnsfe
PAGE IL.
function for the noise canceller can now be derived as fol-lows. If the time k is set to zero, the unit impulse res-ponse fromi the point 'C' to 'D' is
2YMj = 2y C U(j-1) cos (wO .iT) --(1D)
The transfer function of this path is
G(z)=2uC 2(z(z-cos(wO0T)
I (Z 2 -2zcos(w 0T)+1
=2 uc 2 0 zcos(w 0 T) - 1
Z 2_2zcos(w 0T) +1 3
This function can be expressed in terms of radian samplingfrequencyA 2 PIT as
2 zcos (21rVw f.) - 1G~) 2uC~f z-
Z _2zcos (2rwf )X +10
If the feed back point from 'D' to '8' is now closed, thetransfer function H(Z) is obtained from the formulae H
G/(1+ as z2 + 2z cos (21Fw.O., 1 -1H(z) =
(Z 2_2zcos (21Ww~l ) + 10
Equationl15 shows that the single frequency noise cancellerhas the properties of a notch filter at the reference fre-quency wO .This Is also shown experimentally..
APPLICATIONS:
There are a variety of practical applications suchlike the cancellation of Noise in speech. signals, cancella-
ij tion of antenna sidelbbe tnterference cancellation o-f sev-
ieral idso nefrence In Electrocardiography etc... Thesimulated experiment.-And their results witl nw !)e shown..These would Indicate the use of the adaptiVe noise cancellerIn varlous- environments.
Ni
PAGE 1
Filter 1 is the hypothetical case where variousfixed frequencies are present and the undesired freqvency isto be cancelled. Consider an input of fixed signals at60c/s, 350c/s, 400c/s and 450c/s., If the 60 c/s signal isto be eiliminated, the program for Ftiter 1 is shown alongwith the output plots for the filter.
Filter 2 is another form of notch fflter. Thisremoves the 60 hz signal from a primary input of varyingfrequencies. Signal varying frequencies in the range of 300hz to 800 hz and 40 hz to 70 hz are generated as the primaryInput to the filter. As before the 60 hz will be removedadaptively.. The Filter 2 and it's output plots ae shown.
Adaptive filter is equally applicable to filterany type of varying signals and varying frequencies. ThisIs shown by Filter 3.. The primary Input has the signal Invarying freqs as before and the lower frequencies in therage of 40 hz to 70 hz are completely eliminaed. The res-ponse for Filter 3 is also shown
Li
Nr~i.i I N I v'N
I'f' I - f'J I
til 10
n j I
!.)I rig r/,. t, o
.a..
IIf V Pi1Iil) IF/
CC FILTER 1:TIIIS FILTER HIAS FIXED PRIMARY INPUT FREQUENCIESC AT 60HZ,3501Z,1;O00lZA1D 45011Z WHICH ARE ALLC SINUSOIDAL. .TIIE 60HZ IS BEING CONSIDERED AS THE NOISEC FREQUENCY TO ABE REMOVED ADAPTIVELY.C T14E OUTPUT IS A PLOT OF POWER SPECTRUM IN DB.,C THE FILTER 1/P AND 0/P PLOTS ARE EXACTLY IN SAME SCALEC
COMPLEX PI,P2,X2,X3,Xlt,X6,X7,X9COMPLEX Y1(257),S(257),RF1C257),RF2(257),Xl(257)T=o.P1=(0. .0.)T2=0.P2=(0.,.O. )X2 =(O.,0. )
X4=(0. ,.O.)X7=(0O...X9=(O.,O.).D09O 1=1025YDO(9 1=10,257
RF1(I )=(O.,0.).RF2(t )=(0.,O.).X1( I ) =(0. ".)
9 CONTINUEA=0.F =350.*0,DO 1 1=1,256
C --- PRIMARY INPUT --------IF(I.GE.7.5) F=4iO9.0.
A=15.0*SIN(T)B=3.0*SI N(T2)
P21 =CfP LX (AB, 0 .
S(I )=P1+P2
T2=T2+2 .0*3.14*60.O./256.0,1 CONTINUE
A=0.C ---- REFERENCE INPUT--
T=0.DO 2 1=1,256A=2.0*S[N(T)RF1( I )=CfMPLX(A,0.)T=T+2.0.*3. 11*60.O./256.0,
2 CONTINUET=0.
C ---- 90 DEG PHASE SHIFTED REFERENCE INPUT----AW.DO 3 I'-1,256A=2 . 0 *COS (T4)RF2( f)=Ct-MPLX (AO.)
Filter 1:contdT-=T+2.0,- 3 . 14L*60 .0./256.0,3 CONTINUE
C ---THE FILTER--,-Y1(1)=(0.,O .DO 5 1=1,256X1(l )=S(I )-Y1(I )X2=X1( I )*RF I( I)X 2=X2*0.*1.25XG=X7X7=X2+X6X2=X7*RF1( I)
X3=X3*0.1i25X 9= X IXli=X3+XgJX3=XII*RF2( I)Y1(( I1)=X2+X3
5 CONT INUEIM8
CALL FPT(Xl, 114)CALL. FFT(Y1, 11-1)CALL FFT(S,lM)
c S(1 PRESENTLY HAS TIHE FFT OF 1/P S(1D0 50 1=1,256GG=REAL(XIil))**2+AlrtiAGx(XI i))**2GG=GGIZ 000.0.AI(M )=CMPLX(GG,0.0)GK=REAL.(YI( ) )**2+A;IMAG(YI( ) )**2G K =GK /1000.0,Y((1)=CIAPLX(GK,o.)PP=REAL (SC ) )**2+Alt4AG(S( I))**2PPPP/ 1000 .0S( I )=CM.PLXPP,O.)
50 CONTiNUECALL INITT(120)CALL PLQTS(IGUFT,5)
c CALL AX1Sr(0./A.',,I AXIS' ,-Gr io. ,0.0 O.4. NfC ~CALL AYIS(0.,,o.,.'Y AX I S1,6,u-. ,90 i J558 FORMAT(n)WIRITE (1,660)
660 FORMAT(' ','TO START THE PLOT, HIT RETURN 1)-~ READ (1,558) lEDX=O.CALL PLOT (0.0,0.0,-3)DO 303 (=1,128Y=REAL(s( I))X=X+6J C=2CALL FACTOR(O.01)CALL PLOT (X.Y,I'-)
303 CONT I Pt)E- C P. S. .0 F 0/P (G(I)Y) IS PLOTTED
FilIter 1:contd
x=o.C CALL AXISC0.,o.,,'POwq SPEC',8§,.OlC CALL AXIS(0.,o.,t ,1,8.f,9 .. 1.)f0
CALL PLOT (0-0,0.O.,-3)DO 90 1=1,.123Y=REAL(Xlf I))IC=2X=X+6CALL FACTOR (0.01J)CALL PLOT (X,YIC)
90 CONTIUECALL ANMODECALL FJNITT(0,0)STOPEND
1CIE
[LiL
ILI ~£~ -------
z Vd
o__o
oII
C ****k*******k*-r1*******.*
CCc FILTER 2: THIS FILTER I,:-AS A PRIMARY INPUT OF VARYINGo FREQUEIICIES Itl THE RANGE OF 1-10%CIS TO 70C/S ANDC 300 C/S TO 8000/S. THE FREQUENCY CF THE GENERATED SIGNALC IS CONTINUOUSLY VARYING SINUSOIDALLY.C THE AMPLITUDE VARIATION OF THE SIGNAL IS ALSO SINUSOIDALC 6OC/S IS AiSSUMED TO BE THE NOISE FREQ TO BE ELLIMINATEDC THE OUT IS A.PLOT OF POWER SPECTRUM IN D8.C THE I/P AND 0/P PLOTS ARE IN TUE SAME SCALECCC
COMPLEX P1,P2,X2,X3.X4l,XGX7,X9COMPLEX Y1(257),S(257),RFI(257),RF2(257),Xl(257)T=0.Pl=(0.,0.)T2=0.P2=(0.,0. )X2=(0.,.0.)X3=(O.,O..XI = (0. '0.)
X7=(0. ,0.).
DO 9 1=1,257Y1( I)=(0.,0. ),RFI(I )=C0.,O.),flF2(I )=(0.,0.,X 1,(1 )=(...
9 CONTINUEA=O.U=0.GA=O.OU2=0..GA2=O.O
C --- PRIMARY INPUT (NOISE CURRUPTED)---DO 1 1=1,256U=U+4 .0GA=U/ 100F=IiOO.0+(SIN(GA)*100.0)U2=U2+'i.0GA2=U2/100 .0.F2=60.0+(SI N(GA)*10.0)A=5 .0 *SlI N( T)B=3.0*SINCT2)P1=CM-.PLX(A,O.)P2=Ct-PLX(1), 0.)SCI )=Pl+P?T=T+2.*.3. Th*F/ 1000.0.T2=T2+2. '1*3. 14*F2/256.0
1 CONTINUEC --- REFERENCE INPUT------
A=0.T-0.
DO 2 1 =1,256f I er2cntA=2.0*SIN(T)flFl(l)=CtMPLX(A,o. )T=T+2.0*3. 11I*0.0!256.o,
2 COJT! IJUEC ---- PHASE SHIFTED REFERENCE INPUT ---
T=O.A=0.DO 3 1=1,256A=2.0*COS(T)RF 2( I ) =CMPLX (A,0.)T=T+2.0*3. 11*60.01256.0
3 CONTINUEC --- THlE FI LTER----
Y1(1)=(0.,0. ),DO 5 1=1,256X1( I)=S( I)-Y1( I)X2=Xl(lI)*RF1( I)X2=X2*0.125146=X 7X7=X2+X6X2=X7*RF1( I)X3=X1(I )*RF2(i )X3=X3*0 .125A'9=X4i
X4s=X3+X9X3=X14*RF2( 1)Yl1 1)=X2+X3
5 CONTINUEI ft4 = 8CALL FFT(X1,!M)CALL FFT(S,IM)
C SO9) PRESENTLY HAS THE FFT OF I/P S(1)DO 50 1=1,f[23GG=REAL(XI('I))**2+AIt.iAG(Xl(, ))**2PP=REAL( ) )**2+At .1AG(S(1I )**2
520 FORMAT( I, 'INPUT=',4X.E4.5.,OX,TOUTpV4XE14 5.)S(l )=CMPLX(PP,o.,)
X1(I )=CfPLX(GG,o.0o)50 CON4TINUE
CALL INITT(120)CALL PLOTS(IBUF,1,5)CALL PLOT (1.0.,i-O.,-3)WITE(1,660)READ(1,558) lED
C CALL AXIS(0.,,0.,-'lNPUT POW SPEC IN DI-09,010110)C CALL AXS(0.,.O.' 1,90.,.1.5 5 FORMAT(W~660 FORI4AT(' ','TO START TH4E PLOT, 141T RETURN '
.x=O.C CALL PLOT (1,0,1.0.,-3)
IC=2DO 303 1=1,128
Filter 2:con.td
PP=REAL(S( I))PP=20.0*LOG1O( PP)Y=PPX =X+ 3CALL FACTOR(0.02)CALL PLOT (X,Y,IC)
303 CONTINUEC P.S- .OF 0/P (G(I)) IS PLOTTED
x=0.READ (1,'45u) IZK
1 5 8 FOIRhAT(I14)C CALL PLOT C1.0,1.0,-3)
IC=2c CALL AXIS(0..,0..,,'POVW SPEC IN DB',-14,9.,,0.,O.,1.)C CALL AXIS(O.,0.1. 1,1,7.,90.,O,1.)
CALL PLOT (1.0,1.0.,-3)DO 90 1=1,128GG=REAL(X1(I ))GG =20.0.* LOG 10(CGG )Y =GGX =X+3CALL FACTOR (0.02)CALL PLOT (XY',IC)
90 CONITINUECALL ANMODE'CALL FINITTCO,O)STOPEND
-CL: -
CL (S)
IA- (
W
a.J a.
CL..
Fter 2:contd
PP=4rAL(S(J ))PP='20.O*L0G10( PP)y=pPX=Xi-3CALL FACTOR(0.02)CALL PLOT (X,Y,IC)
303 COfiTINUEC P.S. OF 0/P WMJ) IS PLOTTED
READ C1")IZK1158 F 0 RAT(I It
C CALL PLOT (1A.,1.0o,-3)I C<
c CAL:_ AXIS(0.._0,.,_P~lw SPEC IN 3'149.,,0,)C CALL AXIS(0.,o._ '1,7.1,90~.1.l)
CALL PLOT (1.0,1.0-,-3)00 90 I11,28GG=R.EAL(X1(I )GG=2 . 0_*LOG10 (GG)
CALL FACTOR (0.02)CALL PLOT (XY,IC)
90 CONTINUECALL ANM.ODE-CALL FINiITT(0,0)STOPEND
DO 2 11,256filter 24:contdA=2.O*S I t(T)ZF I(1I ):-CtPLX (A, 0.)
2 T=T+2.0*3. 1[*CiO.0/256.02CO IT Ir1U EC ---- PHASE SHIFTED REFERENCE INPUT ---
T=O.A=O.D0 3 1=1,256A=2.O*COS(T)
T=T4-2.0*3. 11*0.0/256.03 CONTINUEC --- TIIE FILTER----
00 5 1=1,256X1( I)=S( I)-Y1(I )X2=X1( I)*RFI(I)X2=X2* 0.1,251'G=X7X7=X2+XGX2=X7*RFI( I)X3=Xl( I)*RF2( I)X3=X3*0.125X 9= X11X4~=X3+X9X3=XI.*RF2( i)YI1I+1)=X2+X3
5 CONTINUEI IM= 8CALL FFT(X1, Im)CALL FFT(S,IA)
C S(I PRESENTLY HAS THE FFT OF I/P S(I)DO 50 1=1,123GC=REAL(XI( ) )**2+AII4P.G(X1( I))**2PP=REAL(S(I) )**2+AI..IAG(S(I) )**'2520 FORf-AT(' r-lINPITr,4X,E1I.5.X,10XhvUTpU,XE14.)SC! )=CtIPLX(PP,.)X1( I )-=CMPLX(GG,0.0)
50 CO4T I MU ECALL JNITT(120)CALL PLOTS(IBUF,i,S)CALL PLOT (1.0,1..,-3)WRil TE(1,660)REAO(1,.558) IEDC CALL AXlS(0.,_0.,.' g1PUT POW SPEC IN' B,2....,0,10)C -CALL AXIS(o., o.,' 1,7,g.o,
550 FORMtAT(II)660 - F11MATC' I 'TO START THE PLOT, HIT RETURN F)
__- C CALL PLOT (1. 0, 1,,.,-3)
-DO 303 1=1,128
-A/
cI
C FILTER 2: THIS FILTER 11 -A$k A PRIMARY IIIPUT OF VARYINGC FREQUENCIES IN ThIE RAINGE OF I:OC/S TO 70CIS AIDC 300 C/S TO 800C/S. THlE FREQUENCY OF THlE GENERATED SIGNALC IS CONTINUOUSLY VARYING- SINUSOIDALLY.C THlE AMP~LITUDE VARIATION OF THE SIGNAL IS ALSO SINUSOIDALC 60C/S IS ASSUMED TO BE THIE NOISE FREQ TO BE ELLIMINATED)C THE OUT IS A PLOT OF POWER SPECTRUM IN OB.C TME I/P AND 0/P PLOTS ARE I N THlE SAMlE SCALrCC
COMPLEX P1,P2,X2,X3,XII,XG,X7,X9COUIPLEX Yl(257 ),S(257) RFI(257),RF2c25-),Xl(
2 57)
P1=(0.,O.)T2=().P2=(0.,0.).
X3=(O.,.Oj.-
X3=(0.,0.).
X7-(0.O.)
DO 9 1=1,257
RF2 (I )=(0.,..R( I )=(o. ,o0)
9 CONTINUEA=O.U=0.GA=O. 0.U2=0.0GA2=0.0
C ---- PRIMARY INPUT (NOISE CURRUPTED) -----DO 1 1=1,256U=U4.0GA=U/ 100F=I4 00.0+ (S I(GA)* 100.0N'U2=U2+4s.0.2A2=U2/100.0F2=G0.0+'SIN(GA)*10.0)A=5.0*SiII(T)
Pl=CiU.PLX(A,O.)P2=CMPLX(LI,0.)S(1 ):P1+P2T=T+2 .*.3* 14*F/3.000.0-T2=T2+2.0.*3.l4-*F2/25
6.01 CONTINUE
C -- REFERENCE INPUT--------A=o.
T~il .zj
zz...........
orcoto
zL
rL
o o
- Appendix C
1. A VLSI Residue Arithmetic multiplier, IEEE Transactions of Circuitsand Systems, W.K. Jenkins Editor, revised paper acceptea forpublication, approx. 10 pages.
2. Large Multiplier Multipliers, ICASSP 80 Proceedings Denver,Colorado, April 8-11, 4 pages.
3. Large Mioduli Multipliers, 1980 International Symiposium onCircuits and Systems, Houston, Texas, April 28-30, 3 pages.
4. A New Technique For WFTA Input/Output Reordering, InternatinnalJournal of Comput.er and Informnation Scierces, J. Tou editor,accept2d for Vol. 10, Number 1, approx. 15 pages.
5. .ihe Realization of Adaptive Kalmnan Filter, pending ACTA, M. HamzaEditor.
PULiSLISED INi 1K PROCEEDINGS& Of TilE ICASSP 80
ONlVER, OLORAOAPRIL 9,10,11
t~nv~rityof ..od J. Taylor Oi 52
spart,-.ent of Electrical and Computer Engine~eringUnivrsit ofCincinnrati, cincinnati, ho 52
_______required, the residue structurg, alwais rrovidesIn th,- work tf~ tmesent a "ow table loku better perforit-anc with reard t ut~p.cation.
mhenallriuti~lcatonsmustbe ounedthe Z's;tcj~,~scteme3003 clss f aL l~l(ripcotrolement structure provides betzer perfornance.~.tltbliers cipable of iorkin,,, with exact Sinc- most nion-trivial filter and 'rainsformi'dulr) ulderig ss,.n..applications reqtui.e i high plurality of multiply
kleror, savings associated with the fler look-up and arid operations (almost always insuring themnultiplie", when compared to contemporary overflow of the limited integer 4ynamic rangesfletr'ods, are shown to Le on the order of-2/1' currently being implemented (<71 tyo.)) the,hi-re %=2n, neinpJt wor~flength. Th!'ougnp~.t is luture of residue based digital systeq mav apoears'ic-mi to-: be e.-!al -0 that oDotined ising VLSI1 limrited at first glanice. We s1hall, in tnis work,4nd classic architectures. present some new results which overcome this
contemporary deficien,.y and Hi fact -rake theputential of mrdular arithmetic systems even moreexc.e.ing.
:ntr~ucton:Residue ALU's.),.;tal signal processing is a study inder.going Tedsdatgso h eiu ~~brss~naccelerated growtn, acceptance, and appliczation. Tre didanld. Sinc the rel ses riobe siotn-4ith the possible exception of number theoretic ficant digit, decimal to residue conversion,tansfolms, digital slignal processing has b~en ivsomgtuec prsnad rth tcprincipally advanced through technological dvsomgiuecmaioadaiheiacnievements. These incluce the mricroprocessor, shift operations are cumbersome and snould below cost high performance nmer.ory, and the read avoided. Rigister overflow, due to its finite
onlymemry ROM. Te aailbilty f te ~dynamic range, impose a severe cnstraint on tnehias Lhailenged our traditional attitude towards, R.-S operations. Uinlike weighted numbers (deciffld,
perfrmig dgitl aithmtic Inparicuarbinary, etc.) where rounding or truncating leastthe rt f fied ointnulioliatin ha uner- significant digits can control overflow, such is
gone a partial rwetamorphosis through tire use of irt the case in the RNS. Since there is anROM based table look-ups. Since multiplication ibsence of least significant digits, the more
nas eena prncial seedcostandcompexiy general and inefficient operation known as scalinglimittin to dricital fildterng, adacment m~ust be used. SincL s:aling is a forn of division,4liintin are hiave bee ltrly , racevae ent its use should be discourageo. To 13in insight
in tis rea avebeenwarly rceied.into this problem, consider the inner product ofE:!SI;4 LOK-U ARTHMTICTECUIQES:two 31-dimensional real vectors x ar'd y whoseEXISINGLOOKUP RITHETI TECNIQES:entries are encoded as residue digits with respect
"'dch of the reported work on ROM based fixed to P=(32,31.29,27). Without scaling, the dynamicpoint multiplication has been i's support of range of x and y would be limited to V-3 wherelinear shift invariant digital filtering. Authors V=f.Ir/2)/3W25O56. Therefore, to insure that nosucn as Jenkirs and Leon, SolerStrand, etc., have worst c se overflow can occur, a 7.3-oit (ie: V. 5=studied the cost-speed metrics of digital filters 158,Q71) dynamic range limitation mrust be imposedusing the residue numbering system. The princioal on 7 and y. With scaling, larger input ranges caradvaintage of the rezidue numibering system ;s thrat be *jsed at the expense of statistical accuracy in.jrpport! fixed point multiplication and addition !the output space (analogous to rounooff er'rors).
wirchout need of preserving "car'ry information". Dt~e to the dynamic ran'ie of RNS systems, one isThus, parallel operations are idraissible. Inadaition, -nodular multipliers wyere shown to e generall1% forcea to accept one of the followingrealizable using table ;ook-,rp methods and RUM'1s. two overflow Prevention strategies.
J~enkins recently questioned winetnrer the perform- 1. Increase the dynamic range to a sufl'icientlyInce af tne res idue nuemrni nc System'r was due to large value by adding m~ore mnoduli to '. orthe intrinsic properties of the system or the use ~ '~ cln oeefcetoeainOf look-up nv lt ipl iers. It aas concluded that'it The first nir,~ni reprpsanfs i htri't- f-rc, Y-appears that irnen no roundinrg (scalinq) is to ttie problen. Such an approach wiill increase to
cost a~nd cc rlexit! .-etr'c; j-.' a 1*nr:-. tni el uction o absenceaddition. tne noculi sot Z ;ust be tailoried to a -,f raufitionai 3catinl ocer-3ons) can beunique filter. Tne otnier , roacn app~ears to ce 4chieved. 7rinally. se-,eral versions of tre *the yiost pOODIar 4t t.iiS i. Szaibo ad I~rtA can he ccnside~ed Thav Airt saiarizedTanaka, and others, nave concentrated on the in figu-e 3.scaling efficiency through the Mc.~ie of thethree-tirnle mod-ili se 1, 2 n 1j 2 n.7n~~ po closer iniestigation of the table look-up
iroduli set has the ability to efficiently scale j~t bsaPotential nui;ance can be foond. It
a residue number by any one of the cnosen moduli. can b1e e-(am~plified y observing that if ;-9.Noweve' , there is an intrinsic limitation P=32, then !s-)z<9 14>3,20.23. Therefore, it
plaguing this method and it is its dynamic m~ay be required that t.~ocadditional fractional
range. Using a large high-speed miemory unit, bits -..ay need to be 3dded to the table's otutputsay I YxI the input addressing space ;s limited word length. 4owever, this is not the case as
.o 2 2. This means that a godull P. is suqgested by the following theorem:Tehiel iitdt .? (e ~~~21)riorem,: L,2t 'jv!1 deote the integer value of v.
Thierefore, the dynamic raige of ar~y zodular 21cperation is giver . __(~- - )(21Y1nl)i i 3 -M-z~[(s-s)~in iiany applications, an 18-bt, resolution is That is, only *ne integer valup of p need be usedlrsuf'iclent resolution. and the fractional bits of 6(s-) can be ignored.
:1ew Results Proof: Let (-<+y)/2 4-v+k/2; (X-y)/2 qb/2 where
Tw'o new mewsory efficient algorithms have been k=O or 1. Then zl<<(x+y) /4> 2
ae'-ived and is based on a no~el factorizition --!%_y)414> N :<:vkvb /1> -<Q+kvc i /4> >of a bilinear for-i. Over a real field it i; . p~i'4p q- P. pt( )-
which ~n modular form. becomes As A re sult, the rarallel architecture is equiva-
<x~=<(+) s-,2. lent to that shown in figure 4.
where t~s)=,s2> 3 with s"=1x+y)I2 and s5%(x-y)/2. X*odu)lo ± dder3tfistglnc tisalorth. hihshall be The 1-f1 multiplier requires a modulo p Adder be
rtefrste ln t i s a lma orhm dla muhiih used to combine the two component parts of therfre 4toa inra eoymdlrmli solution (namely (s+) and ~(). Modulo p
plier (; ), would appear to be counterproductive adders pose an interesting de-sign problem.with respect to a memory conservation metric. nesafs iouopadrcnb arctdThe memory requirements associated with thej
4 nesafs ouopadrcnb arctdwill be shown to be substantially less than the overheadassociated with addition will offset
those of direct mechanizations. First, it should any gain in throughput achieved through tables~ and ~look-ups. For the moaduli chosen, 2n~1, 2n, andbe apparent that the integer sans-foulqd 9n4l. only the modulo 2tn adder can be realizedin equation 2 is bounded from above by 2n" d.ecl (nbtAdrwt goe vrlw. I
Therfor, ony a(n~l-bi tab~ adresingwould however, be desirable to use a n-bit adderspace is required to realize (s-) versus the to realize the modula 2n-1 and 2n41 adder as well.2n-bit space needed for direct architectures.It would appear however, thiat there is an Usirg n-bit AND gates to sense the zero conditionexception to this rule. Since one of the mioduli Of <S> M, the overflow bit OVF, and the sign bitschosen is ,~2n.., Her- the -naxi-aal value of of tbi)and bs-, a combinational logics+(or s-) is 2n+1 which would techniczjly routine can be defined -which will convert sz4require a (n+2)-bit address. However, by using into <s> .It can be noted that the mappingthe 9rntco.l fo~rnd in fio3ure 1, the table size requiremBAts are:can be reduced to 2n+I Words for all moduli. 1.ff ='.,mp oso - 4I=4s
Wrthe ov~rlow bit serves to differentiate 1.'2p2~lmpst or s2+s S 14s-C fom .2. for r-2 . map s to s-.2 l<s>
The 14 system architecuie i.s abstracted in 2Figure 2. This uses 2n )~rd liigsi-soeed memory 3. o ~ 11, '.a o rs2 1-"s -lfor modular arithmetic look-hip operations. Using, 2'for exaimple. the previously referenced 4K-3O0ns S'jooose the !joduli pr 41. nzI2, is to be ivpedievice, mnoduli having An Il-bit dynaic rangn ricnted. iPy using two conimerci.:lly avjilable
Q~.6-bit in the direct for-;) can be mechanized. 150g PLA's in -arallel. the 12-bit )utcomt. of anThis wvould yild~ Whree-m~duli dynamic rantle on n-bit adder and the four control bits, can letha order of V L8,~. 6-1W1. That is, without converted to 13-bit mask. The mask would trans-an increaise in inenoFy size (and t ereforn access form the iutplit of a hinh-speed n-bit adler to
- - i~'), te dnami rage o th M~ s ~s or s-2n-1, depending on the state of the 4* times larger than that obtainable through direct .control bits. O3ased on a 25-ns 12-bit Schottky
Teans: This large increase in dynamic range look-ahead adder, a 20-ns 16x9 ?LA. and l0-nsn~akes the RIJS a viable alternative to traditional --E -ask ;witches i 6E-ns .cdjlo 3 adder, for
fter design methods. Botn improved precision 7~~1 n, and 2 10 can be rea-lized. The
]
oresence of a 65-n- n'oduio 3 )dle- ,ill no'v aic-a l4C-ns Ilire 'moduli -esijtue -ulrt.o iier -toe -b.sed on 35-n 4Kxl U'OS -- or/ units. or -.moduli set Q2*-), 21'4', -, a fited poll't'qlti~jier, having in ,utput d'ynamic ran~ge 3. ,4 -21, can tus be faDricated havinq a .ord -
rate of 7.14311 multiplicatilons per second r-28.5: -mt ilcto- per second if a pipe- ., "' = 'I
lined architecture is used. -I 1Summary: : "The residue number systen offers the pozential - .for high speed parallel aritnnetic. This classof arithmetic has been denonstrated to be usefulin designing recursive al)jorithms, t-ansforms. ' '.-
and digital filters. Cne of tbe principallimitations to its u-e is its limited practical Figure 1
dynamic range. To overcome this problem,.large moduli multiplier, for the moduli set2n-l, 2n, 2n~l), was designed. Inis high-speed
large noduli systen was the product of the newM4' algorithm and new technologies (RAIl and PLA's).The oractical resioue multiolier is capabl ofspipoc,.ting a pipelined execution rate of 28.5:. 1-T--.--.'ultiojiers per second.
References
1. G.A. Jullien. "Residue Number Scaling andOther Operations Using ROM Arrays," IEE JTrans. Computers, Vol C-27, No. 4, " .. ,pp 325-337, April 1978. ' ,= :
2. C.d. Huang and F.J. Taylor, "Memory Figure 2Compression Scheme For Modular Arithmetic,to be published, IEEE ASSP.
3. ,!.A. Soderstrand, "A High Speed Low-CostRecursive Filter Using Residue Arithm.etic,"Proc. IZEE, Vol. 65, No. 7, July 1977.
4. U.S. Szabo and R.I. Tanaka, Residue Arithme-tic and Its Applications To Tomrp-uter -,-,." ..-..
Tecnnoloyo, ;cr4 11aw-Hil, 1967.
5. W.4. Jenkins and B.J. Leon, "The Use Of .,_t--..Residue lumber System in the Design of ..
Finite impulse Response Filters," IEEE: -7
CS:, CA5-24, April 1977. ,T. . I.rm , -4-14
6. W.K. Jenkins, "Techniques for Residue-to-Analog Conversion for Residue-EncodedDigital Filters," IEEE C&S, CA5-25,July 1978.
7. A. Peled and S. Liu, ",A Vew Hardware.Realization of Digital Filters," IEEE ASSP,
S 3. W4.J. Jenkins, 'A% ll~hly ,Efficient Qesid,je- -- "°Co-ibinatorial Architecture for Digit3%
Filters," Proc. of the IEEE (Letter), - I:-- I '. I . .. ,I Vol. 66, Nlo. 6, Jne 1973. pp 700-702.
This work was partialiy suopvorted under AFOSP Figure 3
Grant :o. AF49620-79-C-0066.
/70
"1t
LW
TI
IJ I -- ":
U~vf )J
mI
LARGE MODULI MULTIPLIERS
Fred J. Taylor
Department of Electrical and Computer EngineeringUniversity of Cincinnati, Cincinnati, Ohio 45221
ABSTRACT: required, the residue structure always providesIn this work we present a new table look-up better performance with regard to multiplication.storage scheme and a class of table looK-up When all multiplications must be rounded, the 2'smultipliers capable of working with exact complement structure provides better performance.(modular) numbering systems. Since most non-trivial filter and transform
applicat-ons require a high plurality of multiply
Memory savings associated with the new look-up and add operations (almost always insuring themultiplier, when compared to contemporary overflow of the limited integer dynamic rangesmethods, are shown to be on the order of 2/N currently being implemented (<216 typ.)) thewhere N=2n, n=input wordlength. Throughput is future of residue based digital system may appearshown to be equal to that obtained using VLSI limited at first glance. We shall, in this work,and classic architectures. present some new results which overcome this
contemporary deficiency and in fact make thepotential of molular arithmetic systems even moreexciting.
Introduction: Residue ALU's
Digital signal processing is a study undergoing The disadvantages of the residue number systnisaccelerated growth, acceptance, and application. are diad n e the re ses no signi-With the possible exception of number theoretic are anifod. Since the RNS possess no signi-transforms, digital signal processing has been ficant digit, decimal to residue conversion,principally advanced through technological division, magnitude comparison, and arithmeticachievements. These include the microprocessor, shift operations are cumbersome and should below cost high performance memory, and the read avoided. Register overflow, oue to its finiteonly memory (ROM). The availability of the ROM dynamic range, impose a severe constraint on thehas challenged our traditional attitude towards RNS operations. Ul1ike weighted numbers (decimal,hasrchallnged ouritraiti .nl atiubinary, etc.) where rounding or truncating leastperforming digital arithmetic. In particular, sgiiatdgt a oto vrlw uhithe art of fixed point multiplication has under- sigificant digits cn control verflow, such is
gone a partial metamorphosis through the use of not the case in the RNS. Since there is anROM based table look-ups. Since multiplication absence of least significant digits, the morehas been a principal speed-cost-and complexity genrai anzd incfficient operation known as scalinglimitation to digital filtering, advancements must be used. Since scaling is a form of division.
hnwarmly received, its use should be discouraged. To gain insightin this area have been rinto this problem, consider the inner product ofEXISTING LOOK-UP ARITHMETIC TECHNIQUES: two 31-dimensional real vectors x and y whose
entries are encoded as residue digits with respectMuch of the reported work on ROM based fixed to P=(32,31,29,27). Without scaling, t e dynamicpoint multiplication has been in support of range of x and y would be limited to V-- wherelinear shift invariant digital filtering. Authors V=(M/2)/31=25O56. Therefore, to insure that nosuch as Jenkins and Leon, Soderstrand, etc., have worst c se overflow can occur, a 7.3-bit (ie: V- -studied the cost-speed metrics of digital filters l58%27.i) dynamic range limitation must be imposedusing the residue numbering system. The nrincipal on x and y. With scaling, larger input ranges canadvantage of the residue numbering system is that be used at Zae expense of statistical accuracy insupports fixed point multiplication and addition the output space (analogous to roundoff errors).
i without need of preserving "carry information".wThut naae oprerting arry adinfos ti. Due to the dynamic range of RNS systems, one isThus, parallel operations are admissible. Inaddition, modular multipliers'were shown to be generally forced to accept one of the followingrealizable using table look-up methods and ROM's. two overflow prevention strategies.
Jenkins recently questioned whether the perform- 1. increase the dynamic range to a sufficientlylarge value by adding more moduli to P, orance of the residue numbering system was due to lare valueb a more dui to.the intrinsic pioperties of the system or the useof look-ap multipliers. It was concluded that "it The first option represents a brute force attackappears that when no rounding (scaling) is to the problem. Such an approach will increase to
j H 792i 'IlS ' 4K{IiW | 074)25(X1.7i ' ic l ! :1+IP I
cost and complexity ,etrics of a filter. In .nd throughput (thrcugh the reduction or absenceaddition, the moduli set P must be tailored to a 3V traditional scali.ng operations) can beunique filter. The otner approach appears to be achieveJ. finally, several versions or the M4
the most popular at this time. Szabo a,.d algoritim can be considered. They are swmarizedTanaka, and others, have concentrated on the in figure 3.scaling efficiency through the choice of thethree-tupie ;noduli set P={2n-i,2n,2ndl). This Upon closer investigation of th~e table look-up
Smoduli set hasI the ab2lty to eff2ntly. Thscl data base, a potential nuisance can be foynd. Itmoduli set has the ability to efficiently scale can be examplified Ljy observing that if s-=9,
a residue number by any one of the chosen moduli. can be e4i 12 o r Thtefor-,9iHowever, there is an intrinsic limitation p32, then (s-)'q e /4>t2at .5. Therefore, itplaguing this method and it is its dynamic may be required that tAA additional fractionalrange. Using a large high-speed memory unit, bits may need to be added to the table's outputsayix the input addressing space is limited word length. However, this is not the case assy Is n suggested by the following theorem:to 2". lThis en that a oduli p- ;s 1technically limited to p.<2 (ie: x.&y.<2 ). Theorem: Let Ilvil denote the integer value of v.Therefore, the dynamic rAnge of anyImo,ular .-operation is given by M=ni-)(2 n)(2n )02 3 8.2 l8
. Then z=< j4(s )I'-I(s)Ii>p •
In many applications, an 18-bit resolution is That is, only the integer valu of 0 need be usedinsufficient resolution. and the fractional bits of *(s-) can be ignored.
New Results Proof: Let (x+y)/2-v+k/2; (x-y)/2-q+b'/2 where
Two new memory efficient algorithms have been k=O or 1. Then z<<(xfy) 2/4>p -
derived and is based on a novel factorization <(x-y)2/4> > -<<,,kv~b2/4> -<q4kv+k 2/1> >of a bilinear forn. Over a real field it is Pp " 2 p P
obvious that =<<v+kv> pk /4-<q+kq> -b /4> p=<(s )-xy=((x+y)/2) 2_((x-y)/2) 2 |. V(s,)> p
which in modular form, becomes As a result, the parallel architecture is equiva-
<xy>p=< (S +)-0(s')>p 2. lent to that shown in figure 4.
where o(s)=<s2 >p with s*=(x+y)/2 and s"=(x-y)/2. Modulo p Adder
At first glance this algorithm, which shall be The M 4 multiplier requires a modulo p adder beused to combine the two component parts of the
referred4 to as a minimal memory modular multi- solution (namely O(s4) and *(s_)). Modulo pplier (M ), would appear to be counterproductive adders pose an interesting design problem.with respect to a memory conservation metric. adr oea neetn einpolmwthespectoa memory reqiren s eoatith methc Unless a fast modulo p adder can be fabricated,The memory requirements associated with the th v rH"-s o i t d wi h a d t o i l o f ewill be shown to be substantially less than the overhead associated with addition will offset
wil b shwntobe ubtanialyles tanany 9jin in throughput achieved through tablethose of direct mechanizations, First, it should any gi r thughpu chee thrug tabe apparent that the integer s ad s look-ups. For the moduli chosen, 2-l, 2, and
in equation 2 is bounded from above by 2 n. 2ir, only the modulo 2 n adder can be realizedTherefore, only a (n~l)-bit tb addressing directly (n-bit adder with ignored overflow). It
eis requred to reaizet a ddessi would however, be desirable to use a n-bit adderspace isrqie*t elz (s-) versus the to realize the modulo 2n~l and 2nf, adder as well.2n-bit space needed for direct architectures,It would a')pear however, that there is an Using n-bit AND gates to sense the zero conditionexception to this rule. Since one of the moduli of <s>N, the overflow bit OVF, and the sign bitschosen is p=2nt1 Here the maximal value of of f(s+) and *(si), a combinational logicst(or s-) is 2n*1 which hould technically routine can be defined which will conv.rt <S>2Nrequire a (n+2)-bit address. Ilowever, by using into <s> . It can be noted that the mappingthe protocol found in figure 1, the table size requirenAts are:can be reduced to 2n+l words for all moduli. N N_Hre, the oveflow bit serves to differentiate . N N
s-~ fom .2. for p=.2 map s to s-2 <=S>1 2NThe t44 system architectyre is abstracted in 2NFigure 2. This uses 2n,1 word high-speed memory 3. for p=2 N1, map s to s or s-2-l-"-<<SN-I>2Nfor modular arithmetic look-up operations. Using, 2 2for example, the previously referenced 4K-3Ons Suppose the moduli p2 n 1, n=12, is to be imple-device, moduli having an il-bit dynamic range mented. By using two comnercially available(vs. 6-bit in the direct form) can be mechanized. 16x9 PLA's in parallel, the 12-bit outcome of anThis would yie ,Ithree-mduli dynamic range on n-bit adder and the four control bits, can be
o Ithe order of 21 -)%8.6-10'. That is, without converted to 13-bit mask. The mask would trans-1an increase in memory size (and therefor access form the output of a high-speed n-bit adder to
i- time), the dynamic range of the 1" is 233/218=215 s or s 2 n, depending on the state of the 4times larger than that obtainable through direct control bits. Based on a 25-ns 12-bit Schottky
Ameans! This large increase in dynamic range look-ahead adder, a 20-ns 16x9 PLA, and 10-nsmakes the RNS a viable alternative to traditional FET mask switches-a 65-ns modulo p adder, forfilter design methods. Both improved precision p-2P-l, 2n, and 2n+1 can be realized. The
793
presence of a 65-ns modulo p adder will now allowa 140-ns large moduli residue multiplier to bebased on 35-ns Kxl 1IOS nenory units. For amoduli set 122- 2WZ, 2m +1), a fixed po;nt , "
rtlutiplier, having an output dynamic range of j•24 -*>1, can thus be fabricated having a word -
rate of 7.143M mutiplications per second orW1028.5M multiplications per second if a pipe-lined architecture is used. :f 01"
SummlaryLThe residue number system offes the potentialfor high speed parallel arithmietic. This classof arithmetic has been demonstrted to be useful
MW-
ir designing recursive algorith ,, transforms, !$VA T PA
and digital filters. One of the principallimitations to its use is its limited practical Figure 1dynamic range. To overcome this problem, alarge moduli multiplier, for the moduli set{2n-1 , 2n
, 2 n+l}, was designed. This high-speed
large noduli system was the product of the newM4 algorithm and new technologies (RAM and PIA's).The practical residue multiplier is capable o;supporting a pipelined execution rate of 28.5Kmultipliers per second.
References
1. G.A. Jullien, "Residue Number Scaling andOther Operations Using ROM Arrays," IEEE e 1Trans. Computers, Vol C-27, No. 4,pp 325-337, April 1978. -I o , U) WUHUTISI
2. C.H. Huang and F.J. Taylor, "Memory Figure 2Compression Scheme For Modular Arithmetic,to be published, IEEE ASSP.
3. M.A. Soderstrand, "A High Speed Low-CostRecursive Filter Using Residue Arithmetic,"Proc. IEEE, Vol. 65, No. 7, July 1977.
4. N.S. Szabo and R.I. Tanaka, Residue Arithme-tic and Its Applications To Computer - ."-.&t3 P
Technoloqy, McGraw-Hill, 167-.
5. W.K. Jenkins and B.J. Leon, "The Use Of " *°Jl a rj- uResidue Number System in the Design of " t ','iFinite Impulse Response Filters," IEEEC&S, CAS-24, April 1977. IS Im Won It I -Wt
6. W.K. Jenkins, "Techniques for Residue-to-Analog Conversion for Residue-EncodedDigital Filters,' zEE. C&S, CA5-25,July 1978. "1C;W
7. A. Peled and B. Liu, "A New HardwareRealization of Digital filters," IEEE ASSP,ASSP-22, December 1974.
8. 1.J. Jenkins, "A Highly Efficient Residue- ,.:W)Combinatorial Architecture for Digital
Filters," Proc. of the IEEE (Letter), I A. W. j omI um" .' rnwsVol. 66, No. 6, June 1978, pp 700-702.
This work was partially supported under AFOSR Figure3-Grant No. AF49620-79-C-0066.
794 4
tA- j
I ( 'j~rI II) -)
4 1:IAI Ii~i
TI I *IIyC4
00
I T-ALi
Figre
795.
LANGE~ M~OiJIII I miui.rPL IER StOR SINAL PROCISSINC
by
F.J. 7aylor
11nivor~i Ly of CincinnatLi
Abs tracr
The residue numbher system has recently been showi4 to be a
vifable signal processing media. However, it dfoes possess limitLa-
tions. One of the most serious is overflow prevention through
magnitude scaling. One muethod of overcoming this defect is to
increase the dynamic range of the nuibering system. To this end
a new high-speed large moduli multiplier has been developed. The
zwltipl icr, which is the resuilt of comhbininq the quarter squared
algorithm with recent breakthrouqhs in dfe-vice technology. As
a result, equivalent 1s-bit fju precision produicts can 1)0 obtained
at a pipelined riate of 28.511 multiple- per second.
This work was partially supported under AFOSR grant f:9620-79-C-0066.
I. IIJTRODUCT ION
Recently, thle residue number sys'evi (141$) has received rene-wed attention
in the lit-erature [1-3]. This inathetinatically maiture study was, until tile
,present, in thie background of diglital system'l design because of its historic in-
abI)ility of r(lii La ha r~ida LO to iIIJort UNS £1 1i UJICi ct 14 Hoevr r OCC ccei)
break throughis in thie a rea of read-onlyu memnory LeChnlog0y hdr, sign ifCicantly
al tered tlis case. Us i ng hi gb-speed bi polar ROM' s, the ability of tile RW1S
to support ultra high-speed digital filtering has been experimentally
demonstrated [5]. The question of decimal-to-residue 1/0 operations has also
been addressed [6]. However, a major obstical to the cause of I4S fil tering
has been register overflow protection. In order to guarantee that system
registers do not overflow during run-timie, an inefficient operation referred
to as scal i'g las to be perforiied. If scaling were not required, 144S fil ters
was shown to possess higher throughput rates than those obtainable using
distributed arithmetic fie: bit-slice; ref [7]) [8]. However, when scaling
is required, the RNS architecture was shown to be at a disadvantage. It
should be remiembered however that thle distributed arithmetic filter is a
constant coefficient device (ie: shift-irvariant) whereas thle linear RNS
filter is general (ie: variable coefficient). Therefore thle RIJS provides
the user with the versatility needed to pt~rform adaptive, optimal (ex:
minimal variance in a non-stationary stochastic environment), frequency
tuneable filI er-ng which cannot be supported in a hi t-sIice configuratLion.
In this work, a new multiplier architecture is dheveloped which
significantly enhances the case for RNS fillers bly significantly reducing
jscaling overhead. The high-speed residue multiplier will be shown to
Aicrease the dynamic range of the RNS to a value which ei ther reduces th6
iitmi1bcr of scalIi 11( opera Lionis to a smiall fraction of their originalI number
or make scaling -innecessary. AH li- is eccouniished without increasing
the memory budget over above that found in contemwporary RS designs.
II. RNS OVERVIEI
Interest in the RNS is due to its ability to perform high-speed arith-
metic. Speed is achieved throug. the use of a high degr e of parallelism
and an absence of carry information requirements. These two attributes are
a byproduct of the fact that there does not exist a most (least) significznt
residue digit. That is, all residue digits are of equal i,,portance. More
specifically, if P is a mduli set such t-at P {p1 ,.,I, and the pi's
are relatively prime, then if x: [-M/2, 14/2), x is uniquely re)resented by
the L- tupl e
" x ](x , .... xL)
wi th
x;- i f !, 1)
xi 2.two.inteqe>, o thiesi si
L
where <x> demotes x Podulo pi and M = i pi. The bilinear composition ofPii
two inteqlers, say x-v(x , .... X) and yy ... , is given by xy (where
o denotes addition, subtraction, or multiplicatio,) is given by
xoY-.(xJl ... x ,XLYt 3. _- :
It can be seen each residue digit, namely xioy i can be computed i.de-
pflendfhnt of ,:ii ol. (i ,: no carry infoimaw !tion r,'i rmveaI. ). In p1-acti: +,
the ,'W))ing Of x i and yi into xioYi is accamlished using table lookups where
the table residue on randomly accessed read-only memory. Typical high-speed
memory modules, whici are currently available, are:-
Device :rip. Technology Co-figuration Access-SIed
0149 RO- EC. 256x4 20 n3SN14,S H ,.1- TTI. 0 1-.,^., 352147111 11'-! HIMS 4 096X I 30 n
The iirt)d",ct of two reidiie- ImodunIo. i, Pi p I Con he preoIIJL( d (ind
stored in a 2 xn-hit memory uni where ,,=mZn. Using a large exisLing high-
speed memory ('4Kxl at 30 ns), residues having up to six bit integer values
can be used (ex: P = (64,63,...}). Thus, fixed-point multipliers having a
dynaeic range of [-t4/2,M/2) can be architected which have execution rates
in the low nanoseconds.
The disadvantages of the residue number systems are mnifold. Since
the RNS possess no most significant digit, decinal to residue conversion,
division, magnitude comparison, and arithmetic shi ft operations are c.]umber-
some and shouid be avoided. Register overflow, due to its finite dynamic
range, impose a severe constraint on the RNS operations. Unlike weighted
numbers (decimal, binary, etc.) where rounding or truncating least significant
digits can control overflow, such is not the case in the RNS. Since there
is an absence of least significant digits, the more general and inefficient
operation kno.wn as scaling must be used. Since scaling is a formi of division,
its use should be discouragad. To gain insigilt into this problem, consider
the inner product of lwo 31-dlimensinnal real vcftors x an:d y whose entries
are encoded as residue digits with respect to 1 '3,31,29,27). Witimnul sca-
ing, the worst-case value of x and y would be limited to V" where
V (1/2)/31 = 25056. Therefore, to insure that no worst case overflow
5 7can occur, a 7.3-bit (ie:. V 158 -2 dynamic range limitation must
be imposed on x and y. With scaling, larger input rang.s can be used at
agc e
the expense of tatistical a c. a;T" - - - ..
roundof f errors).
Dlue to the vynamic ranre imI "a in n R -It onr 0s m 1
forced to arcno:l one of n-, fn ir !..n n-r , _-y-pt inn no ,
I. i tN S C L(II dI,,. iC r', 'u- 1 c i 1i, i ,,; - r ;q l -Vald-- bi y ,1 nq
more ;-nodul i to P. or
2. ;:ake scaling a 1o-.-_e efficient opera tion.-
The first option represents a brute force attack to the problcmi. Such
an approach will increase to cost and complexity metrics of a filter. In
addition, the noduli set P must be tailored to unique filter. The other
approach appears to 1- 'e -ost popular at this time. Sazho and Tanaka,
and others, -ha'Te concentrated or: the scaling efficiency thirouch the choice
of the three-tuple moduli set P - , 2 2n1+1 This moduli set has
the ability to efficiently scale a residue number by any one of the chosen
moduli. However, there is an intrinsic limitation plauging this method and
it is its dynamic range. Using a large high-speed memory unit, say 4Kxi,
the input addressing space is limited to 2 This means that a maduli .
612is technically limiited to P- < 2 (ie: x-Y. _2 T dynamic
range of any modular operation is given by M = (2h2-l)(2n)(2+l) t 2
In many applications, an 18-bit resolution is insufficient resolution, i >
III. Principal result
It is desiranle to keep the previous y discussed tihr w moduli st-rucLure
for purposes of- potential scaling needs. However, in order to overcour 1be
existing disadvantages of this syste, ,that of dynamic range, a new approach
is called for. Since it is unrealistic to- assure substantially larger
density higu,-speed P"nories- will continue to lbecote available, it is
_ i-nctrabent t at more m~emory efficient . esidtje aritlinjtic uni t -fe designed. --
An of ficient aloori tin, which is ideal ly sui Led for this aIppl icat ion, is
known as the (IItate-r-sq(Iuare Put I tip Iit cr [9- 111].
Over a real field it is obvious that
xy =((xI-y)/2) 2- ((x-y)I2V) 2'1.
which in modular formi. it becomes
where (;q(s) = s with s = (x-1y)/2 and s-= (x-y)12.
The. quarter-squanred mul tiplier has been studied by JJM. Poll ard (1976) in a
Galois fi'21d. Questions of hardware impl ementa Lion wcre not considered anti,
due to the Galois field limitation, only prime moduli could be considered.
It. flussbaum-i- (1976) studied the quarter-square musltipl ier over real fields
for use in ROM intensive diqi tal fil ters. Soderstrand and Fields (1977) m~ade
brief reference to th-is mul tipl ier for residtie arithmetic but offered no satis-
factory hardware realization. fn this paper, a practical residue arithmetic
c(arter-squared muiltiplieir wil 11be archi tected usinq commerciil ly available
hardware.
A problem that would seem to plaugle the quarter-square multiplier is the
need to realize the division by twq the sums and differences found in Eq. '1.-1 -
fin general, the existance of an N , such that <Pi N> =1, can only be guaranteedp
n~f M is realitively prime to p. Since one of the chosen moduli is p=2 ,mul ti-
plicative inverse of 2 cannot be guaranteed to exist. Therefore, equation 4
cano fe n eLred ais thkl equation m~/ 'xt)-xy >..Te poten-
tial problem of dividinq the sum of differences, found in equation 4, by
2, will be explicitly and efficiently treated for the first tinie later in this
'paper. For a 2 word memory unit, the direct product architecture (ie.: xy)
would 1limi t the maximal modul i to be hounded by 2, n=m/2. In fact, this
clim can hoe x teraried to the case when p 2 (1 thrniicii use 011 the Foll1owingir
modification. Observe that if x. 0, then, it automatically followas that
<X yi~i =0. Therefore, if x i =,1-1,0AO~ (w0 hich is dete, table condition in
that the (n+l)st hit and remaining n-hit block is zero (04A\00 ... 0) ) the out-
put register would he automatically cleared. Therefore,, the lookup table, need
not be accessed for this case. Instead, thre all zero n-hit portion of thle
table addrev-allocated to x,, can he us-ed to represent x . 2 1 where x.
2 .I A 00-. 0 (see Figcure 1) . Here , the table woul d he programed to map
Yi into <2 n Yi>P using only a 211 word memory.
The memory requi rements associa ted with the quarter-squoare mul tiplier are
substantially less than those of direct mechanizations. First, it should be
apparent that thle integers s and s, found in equation 5, are bounded from
above by 2n~ Therefore, only a (n+l)-bit table addressing space is required
to realize (s-) versus the 2n-bi t space needed for direct archi tectures. It
would appear however, that there is an exception to this rule. Since one
of thle modul i chosen is p = 2 n+1 tHere thle maximal value of s+ (or s-) j- 2" +1
which would technically require (ni-2)-bi t address. However, by using thle
protocol foun:d in Figure 2, which is an adaptation of thle network found in
Figure 1, the table size can be reduced to 2'~ words for all moduli. Here,
the overflow bit serves to differentiate s =0 from 2nlIThle quarter-squared architecture is abstracted in Figure 3. I t uses a
2 word high-speed memory for modular arithn!.etic lookup operations. Using,
for example, thle previouisly re ferencedI '1K-30ns device, modul i havinig an 11-b it
dynamic range (vs. 6-hit in thle direct form) can be inechdnized. This would
yield a three-moduli.dynamic range on the order of 231)" 8.6-10'. That
is, withhout ain increase in memnory size (and therefore access time) , the
7
dynamic range oF the quarter-squared R 2 33/2 1 215 Limes larger than that
obtainable through direct means! This large increase in dynamic range makes
the RNS a viable alternative to traditional filter design methods. Both
improved precis ion and l:hrnuqhpu I (Lthrouqh the reduction or absepce or tradi-
tional scaling operalions) can be achieved.
Several versions or the multiplier algorithm, can he considered. They are
sumwnarized in Figure 4. The first, called the sequential form, would have
an estimated throughput rate of 240 ns based on a 60 ns lookahead adder and
memory having an access Lime of 30 ns with a cycle time of 60 ns. The second
architecture, called the parallel form, would run at a 180 ns rate. The
parallel architecture is preferred because its higher speed, simpler control.
A 60 ns pipelined execution rate can he purchased at a small hardware cost.
Example: p - 211 . 2048, x = 1040, y = 352, then
Z = <xy>p 1536S + +
Al:s = 1376; (j(s+) = <484416>p = 1088;
Al:s = 688; A(s-) = <118336>p = 1600;
A2 QS <(S)-l(S-)>p <-512>p = 1536
Upon closer investigation of the table lookup data base, a potential
nuisance can be found. It can be examplified by observing that if s- 9,
92p = 32, then +(s ) = /4>32 = 20.25. Therefore, it may be required that
two additional fractional hits may need to le added to the table's output
wordl ng L. However, this is not the Case ds sUg.l(lLed by the following
theorem:
Theorem: Let vIl denote the integer value of v. Then z +<p(s+)U-lP(sX5>p.
That is, only the integer value or , need he used and the fractional bits of
l,( s+) can be ignored.
I
Proof: For x, y and k integeri'S, one nmy define two rational numbers, namely
(x4y)/2 A v+k/2; (x-.y)/2 A qih/2 where k = 0 or i. Then z = <xy ) P
- <(x-y) 2/4> > <<v4kv+b 2/4> -<q+dv+k 2/4> " <<vrkv> + k 2/4-q k > -b 2 /4>p p p p p p p p
As a result, the parallel archilecture is equivalent to t.hat shown in
Fiqure 5. rurLhermore, by derivi||q the above Lhi,,rem over a ra t.ional field,
and showing that the results pertain to the integers, several
classica, probtemsare overcome:
1. The quarter-squared multiplier is not restricted to the Galois fields
sucigested by Pollard.
2. The question of the existence of the multiplicative inverse of 4 is
now moot.
I -
. .... .....
N- X\ f IYI J)1 lA
2 11
ANTA~Pl~ip (4 sIINd1
'~l1It fyL)
i r 0VH--=1
N11)IIy (Y~lq~ln; m mm ~UIs
10-
The quarter-square I11111 tipl ieor rCq:jires a mo(l o p adder he used to Comb inethe two compollont partls of the solution (namely ,]( s) and ,,( s- ) ) Ml(hil o( p
adders pose an interesting dc-ign problem. 0nless a fasL rodulo ) adder canbe fabricated, tle overhead associated with additionll will offeL ally gail illthrouqhpil achievI thr(qli h il) Ic Ui0 . I 1( J ildu I (.h lookiis, )1-- 1 1''
and 2i1+1, only the milOdt;!1 u " adder can be realized directly (n-bit adderwit, ignored overflow). It would however, be desirable to use a n-bit adderto realize the modulo 2"-1 and 2nfl adder as well. For the purpose of clarity,let s be defined to be the suin of ,(s + ) and ,(s-). The following observation
then follows:
TABLE 1Dynamic Integer Modulo 2 Adder Modul o J) Adder Example:N=3Case Range of S <s> 2N1 OVF-BIT Pi %sPi s <s> 1
1 s=O 0 0 2fl-1 0 0 02 1<s< 2N-2 S 0 2
- 1 4 43 s= 2 1-I S 0 -1 0 7 04 s=2 N 0 1 20 1 s-2N+1 8 15 2N+ ~I<.s<2 i4 s-N
2N -I 2N 10 3s--
6 s=O 0 0 2 0 0 07 I<s<2 N-I s 0 2 s 4 4
N (N N (19 2N+<s<2N 2 s-2N 1 2 s- 10 2
10 s=O 0 0 2N+1 0 0 oI 1 <s<2N I s , +1 s 4 4
1-20 1 i s 8 8?:i[ 13 2N+I<s< 2N !-I s_2 N 1I~.S2S-2 1 2N+1 s-2N1 10 1
4(spec i4 I ca 1v_ _ _ _ _-_
Using n-bit AIl) gates to sense the zero condition of - 211, the overflow
bit OVi the sign bits of 1,(s+ ) and ,,(s-), combinatienal logic
can be defined ,.:hich will <s> 2N into <s> . It can be noted from ilhe
data found in Table 1 Lhat the mapping requirements are:
1. for p 2 N1, map s to S or s-2l : .<s>.NI>
2. for p 2 map s to s-2 =s,
3. for p 211+1, ra) s to s or s-2 fl-I = -l> ,
Mapping two is trivially satisfied with an n-hit adder. The other two
mappings require that s remains unchanged or it is decremented or incremented
by unity. There, are several ways to approach this problem. Bioul , Davis, arid
Quisquater have presented an, unorthodox architecture for a modulo (2n1-1) adder
using two-input qates[12] Modulo (2nl) adders can also I)e realized through
the use of end-aroun(-carries. lowever, compared to modulo 2 addition, this
approach would almost double the addition delay. This exLen(ded delay problem
can be overcome through added complexity (ie: time multiplexing two end-around-
carry adders). Mapping one and three can be efficiently realized in the manner
suggested by the example found in Appendix A. The functional operation of
adding one (mapping 1) or subtracLi n one (mapping 3) from the output of an
n-bit adder is performed by a PLA. The PLA will provide an overlay mask which
accomplishes the required task. The derivation and utility of the mask can be |
understood in the context of the following example. Exampl e: Suppose s is an
ll-bit word having a decimal valin: of ql(1 92 or s2 00001011100. If Sl)-l
91 or (r, 101 12 100 10011 is desired, one notes that only the 3-l.SII's of s
need be altered. in gereral, for n=12, only the following 13 distinct binary
masks are renuired fo form (s 1o-i)2
0SI a tietrn S 1 I.S8 L i oil
XX XX X X XX X X x X =leave corresponding bitX X X X X X X X X X X 0 location of s? unichanged l(or 0)X X X X X X X X X X 0 1 -- change correspondIing hit
locf-iLion of' s to I (or 0)
0i 1 1 1 1 1 1 1 I 1 1 1 Table 11. MM K
Suppose thle moduli p) 2' .11 n 1 2, i s to be iml) (21111tcdl. Bly us incj
two coiuiercial ly available 16x9 PLA's in parallel, thle 12-hit outpt of anl
n-hit adder (shown as <> in Table 1) and the four priviously specified
control hi ts,* call he conver Led to 13-hi t miask. The mask would( trans form the
Output Of a hi gb-speed n-hi t adder to s or s-2n 1i, dependinrg Onl tile s ta te of
thle 4 control hits. Based o;-- i 25-tis 12-bit Schottky lookahead adder, a
?0-tis 1 6Y9 PLA * and 1 O-ns 1-1T miask swiklches (it) nloLaLi on conII.:; ts of Table 11)
a 65-ns modulo p) adJder, for p) 20..-1, 2'~ and 2 1-1 can b~e realI i zed. The
presence of a 65-ois modulo p adder will now allow a 1'10-nis large moduli
residue multiplier- based on 35-ns 'lKxl lIMOS memory units. (See Figure 5)
12 1? 12For oul e 1,l 2 *,2 M-l, a fixed point nil ile, aiga
output dynamic range of 23 2 can thus be fabricated having a word rate
of 7.14") 1. mul tipl ica tions per second. This compares favorably with new
16x16 VLSI multipliers. Using a pipelined architecture, which requires the
insertion of the storage registers found in Figure 5, a very impressive
throughput figure of 28.51Flmul tip] ications per second. It is JimpIortant, and
fortunate to realize timat tile I tel lI140S memory unit, used in this analysis,
has a cycle time equal to the access timie. If, as is often found in practice,
a memory unit has a cycle time approximately twice the access time, thenl
pipeline delay would increase from 35- us to 70-is.
Suulmary:
The residue number system offer , t he potentia f or high1 speed para le I
a r i imhe Li c. lThis class of ari 1.h1Lic has bee, demoms tra led Lo be useful if)
designing recursive algorithms, transforms, and digital filters. One of
the principal l ii tations to its use is its 1 imi Led practical dylami c rale.
To overcome Lihis problm('I, a lar(g , modtili mill.i p ir, for tLh, mo l i set
(2 -I, 2 , 21 1l I, ds ( esiune(I. Ihlis high-speed large modul i system, was
the product of the novel algorithm and new technologies (RAM and PLA's).
The practical residue multiplier is capable of supporting a pipel ined
execution rate of 28.5 l. multipliers per secon(.
LasLly, the performance of the residue multiplier is :noted to be technology
dependent. As memory densities increase and speed improve, the multiplier per-
formance will directly benefit. As a result, the higher speeds associated
with the next cicneration of submicron techln6logy devices can provide a speed-up
of two to five. In the more distant future, when and if the Josephsen tech-
nology becomes a viable design tool, residue multiplication rates, using the
proposed methodology, may approach SOOM mui tip Ii cat ion0 per Seo.l
Re ferences
1. G.A. Jullien, "Residue Number Scaling and Other Operations Using ROMArrays," I[LE Trans. Computers, Vol C-27, No. 4, pp 325-337, April 1978.
2. C.11. Hluang and F.J. Taylor, "Memory Compression Scheme for Modular
Arithmetic, to be published, IEEE ASSP.
3. M.A. SodersLrand, "A lii gh Speed l.ov-Cost Reclrsive l'iltLer Isinog ResidueArithueLi(:," Proc. II.1., Vol. (5, No. 7, July 19/1.
4. N.S. Szabo and R.I. Tanaka, Residue Arithmetic and Its App] icaLionsTo Coj mLer 1'echnol gy, Mc raw-Ill I , . ...
5. W.K. Jenkins and B.J. Leon, "TheiUse Of Residue Number System in the_ Design of inite impulse Response Filters," IEEE- CAS , CA5-24, April 1977.
6. W.K. Jenkins, "Techniques for Residue-to-Analog Conversion for Residue-)incoded Figital Filte sp" IE CFS, CA5-25, July 1971% C - Apl9
,. ~ . 1'C o~ at ! P. L.iu:, "A flevi 1Iardi- iie 1) a . 'd o Of Oigi La! ri I Ic "JULL ISSP, ASSP- 22. Deccemb~er 19/4,.B. i~iJodnk ims, "A li ghl E~Cff den' R eS i dte-C0lad utoila] Ardl~i tec Lur-e forNOW La IM crs,' PI-0C. of the ILI.. (Letter), Vol1. 66, [Io. f; Ju~Ile 19 B,PP) 700- 702.
9. ~ 11 Po l . rd , I pl 0Iicn La Li on of P u e - hcr' Tic Iran sforms , 1.1 -C t1r,11icshtters, V10. 12, hi "I. Jl 17) p3839
LetteS, Vol. Il, 1976, p)p1 2 94 11 - Ie22m c
Digital Fi 1 ers, Llectronic Letters, Vol. 13, No C6, Prh17, 1977.12. G. lBioul , M. OwyS, and J.j. Quisq'wartcr," A Conmputat 1011 Schcz,.,Ie for ailAdder Ml'lculo (2t~-1I) ," 1);r a1!ops CrJ1PbI hrS:Ite ni( 97!-)~ 309-318
I PP Jill.
APIH-I-N TX A:
An exazapIe of a PLA controlledI 2n+! adder, for n=3, isdiagrammed in Figure A.T. Tn this figtulre, tlhe sm A=5 and l5
Inodulo (231+1) (i: (5+6) modulo 9=2) i. eult li n ed. Also, theaddli ion dl 'y for = 2. ha.,ed on Commolrcially availale hle dvare,J co!p}uted to be I O+2 +5=G5nsec. The ,-rafI a lh. i Lecture of tlheadder is diagrammed in Figure A.2.
FTGURiE CAPTTONS:
Figure 1: Modulo 2n+IALU
Figure 2: Ifemory Comprossion for S--
Fi lure 3: Modulo p. MliiLp ier
FigTure 4 : Archi tectures
FI" g re 5: T arge Modul1i Iu, Lip1Ier
F rure A. 1 Exanmpl e Problem
Figure A.2: General Architecture
I
I
iI -
z4
xI
0I
z2 x n + I Z b0 i
y - Xy I F X ~L n
S=yIF Y Lt
Ln.L
Figr H0~o~,1111 2 +1- ALU
n+1
xz
fl+. if OVR=O0 n~y l
n~l If OVfli
SELECT
Figure 2. Memory Compression For s+
CL
cNI I
CI-,
GO.
y +
'AyX-Ys + A 2 :C SEQUENTIAL
2'ns
PARALLEL
A1:X+Y y4~L
A1:X-y 2
-60- 60 60 18.9ns.
C)Vid "
YI)
QQ)Cl) -I
-4
(U
-1)U)
0I
bD*1-4I
leri
A B3
DE LAY
__ FOR N :i2
S:0V MOD 2~
* LS[3j
A BSW
MSB PLA /0 20 n
IAA/DP DP DP n
0X DON'T CARE
SIATESDP= DOUB3LE POLE
_00
=21 ELECTRONIC.00
Fgu ro A.].: E:'"ImPle Protaem
0I
Qc:
Co +)
I< Cs- + (Ili-- ~-.- u -
Co-)C (c)
c C) >
CD)
00
biD
- z -o
______-#~ ".