NTTs
Transcript of NTTs
-
8/6/2019 NTTs
1/77
Number Theoretic
Transforms
G.A. Jullien
Extract from Number Theoretic Techniques in Digital Signal
Processing
Book Chapter
Advances in Electronics and Electron Physics, Academic PressInc., (Invited), vol. 80, Chapter 2, pp. 69-163, 1991.
-
8/6/2019 NTTs
2/77
Introduction
This chapter discusses the application of number theory to the intensive computations required in
many Digital Signal Processing (DSP) systems. Number theory was always looked upon as one of
the last vestiges of pure mathematical pursuit, where the idea of applying the theories to practical
purposes was irrelevant. Over the last few decades we have been discovering very practical
applications for number theory; among these are applications in coding theory, communications,
physics and digital information (Schroeder (1985)). Our specific interest in this work is the
application of number theory to the mechanics of arithmetic computations themselves. Modern day
computer technology is heavily dependent on the manipulation of numbers, in fact the automization
of arithmetic operations was the quest of all the early computer pioneers, from Pascal to Babbage to
Von-Neumann. The involvement of number theory in this endeavor, however, has been strangely
very limited. The use of number theory in helping to perform the basic arithmetic operations has not
been accepted by the commercial world. Perhaps the reason for this is the pervasion of the binary
number system in the mechanization of computer arithmetic. A large number of hardware designers
have little or no knowledge that alternative ways of computing arithmetic exist!
There is one field in which the high speed arithmetic manipulation of numbers is of paramount
importance. This is the field of Digital Signal Processing (DSP), and it is here that alternative
arithmetic hardware may find a niche. This is indeed the subject of this paper.
Why is it that some DSP systems are so computationaly intensive? Typically the DSP system we
are discussing will be involved with the processing of streams of data that are indeterminately long;the requirement of the system is to produce processed output data at the same rate that the input
data is entering the processor. This is referred to as real-time processing. Often the arithmetic
manipulations are simple in form (cascades of additions and multiplications in a well defined
structure) but the numbers of operations that have to be computed every second can be enormous. It
Page 1
-
8/6/2019 NTTs
3/77
will be constructive at this point to consider an example to illustrate the processing power required
by a typical DSP system.
Let us consider a basic problem; the filtering of television pictures in real-time. We will assume that
a two-dimensional filter is required and is given by the following equation:
ym,n = i=-2
2
j=-2
2
wi, j . xm-i,n-j
Here wi,j is the filter kernel, and represents the weights on a two-dimensional averaging procedure
around each output sample (ym,n). The averaging is based on a neighborhood of input samples {xi
,j
}, that extends 5 samples in each dimension and centered on the output pixel position. The double
summation, in fact, represents two-dimensional convolution. Convolution is a basic operation to all
linear signal processing problems since it represents the mapping function of an input signal to an
output signal through a linear system. The real time constraint is important since, as we will see, an
extremely large computational requirement will be necessary in order to construct the filter.
In order to see the speed of operation required by our filter, we will perform some elementary
calculations based on typical requirements for the filter. We assume a 525 line scan with 500
picture elements (pixels) in each line. We also assume that each primary colour is to be filtered
independently with the 5 x 5 filter. With an interlaced system we need to filter 30 images every
second. Each output pixel (made up of three primary colour sub-pixels) requires 75
multiply/accumulate operations (M/A Ops) to implement the filter. There are 262,500 pixels in each
image and so 1.96875 x 107 M/A Ops are required for each image. Since we require to filter 30
images per second, the total number of M/A Ops per second is 5.9 x 108. Assuming that an M/A
Op is considered a basic operation in the processor used to filter the image, then we need a machine
that delivers 590 Mega Operations per second (MOPS). In these terms we require a super
computer to deliver this performance!
Number Theoretic Techniques in Digital Signal Processing
Page 2
-
8/6/2019 NTTs
4/77
Clearly the implementation has to be somewhat different to a large main frame super computer, it
would certainly be difficult to fit one into a normal sized television set!
The filtering operation has the advantage of not requiring the full features of a general purpose
computer (the filtering operation does not have to be performed within the typical multi-user,
complex operating system environment of a main frame machine) and the computations are simply
repeated arithmetic operations in a predetermined structure. In fact, it is rather pointless to talk about
the computational requirements in normal general computer terms (MIPs, MOPs, MFLOPs, etc.),
but rather to talk in direct terms of the data flow requirements; that is, the throughput rate of the
filter in samples per second (in this example, a sample is a pixel). For our example filter we have to
filter pixels at a rate of 7.875 x 106 pixels per second. If we provide arithmetic hardware which
contains 75 Multiply/Accumulate circuits, then each circuit has to operate within 120nS. If we use
currently available integrated circuits that operate within 60nS, then we can build a circuit with only
38 integrated circuits for the basic arithmetic operations, and perform two operations, sequentially,
with one circuit. We have therefore reduced the original requirement of a supercomputer to a
system containing a relatively small number of 'chips'.
It is the purpose of this paper to discuss ways in which number theoretic techniques can be used to
perform DSP operations, such as this example filter, by reducing the amount of hardware involved
in the circuitry. Since we are going to be directly concerned with the implementation of DSP
arithmetic, it might be helpful to finish off our example by considering the arithmetic details of each
Multiply/Accumulate operation.
The representation of each primary colour pixel by 8-bits is more than adequate for very high
definition of the image and the quantization of each filter weight by 8-bits is also very generous.
This naturally leads to a 16-bit multiplier structure. Since there are an accumulation of 25 such
multiplications for each output pixel, an extra 5-bits are required for this. With 21-bits we can, in
fact, compute each output pixel with no error in the computation process. Since 21-bits is a
pessimistic upper bound for the computational dynamic range (only required if both the filter
Number Theoretic Techniques in Digital Signal Processing
Page 3
-
8/6/2019 NTTs
5/77
coefficients and the input pixels are at their maximum values - a very rare occurrence), we can easily
specify a 20-bit range for accurate computation.
This calculation leads us to an important observation that we do not need floating point arithmetic.
We can, in fact, perform error-free computation within the dynamic range of a typical floating point
standard mantissa length - no manipulation of the exponent being required at all! This example
represents a fine-grained computation (one in which successive parts of the computation are of a
limited dynamic range) and is typical of many DSP algorithms. The limitations we are able to
impose on the arithmetic calculations will be a great help in reducing the hardware required in the
very high-speed throughput systems to be discussed in this paper.
For the purposes of this paper we will restrict ourselves to the computation of linear filter and
transform algorithms which have, for the one-dimensional case, the inner product form:
yn = m=0
N-1
xm.wf(n,m)
In the case of linear filtering (convolution) f(n,m) = n-m; for the case of a linear transform with the
convolution property, wf(n,m) is a primitive Nth root of unity. For the DFT, wf(n,m) = e-i
2
N
n m
.
Although this limitation appears overly restrictive, the general form of equation (1) probably
accounts for the vast majority of digital signal processing functions implemented commercially, and
also leads us to a body of literature that uses number theory to indirectly compute the convolution
form of the inner product formulation. We will also restrict our investigation of the application of
number theoretic techniques to those that use finite field arithmetic directly in the computations.
This will not cover those algorithms whose structure is derived via number theoretic techniques, and
whose computation is performed over the usual complex field (e.g. Winograd Fourier Transforms);
there is a vast amount of literature already available on this separate subject, and the reader is
directed to the work by McClellan and Rader (1979) as a starting point.
Number Theoretic Techniques in Digital Signal Processing
Page 4
-
8/6/2019 NTTs
6/77
Our discussion in this review will concentrate on two areas: the use of finite ring/field operations
for computing ordinary integer computations (Residue Number System) for high speed signal
processing and the use of algorithms defined over finite rings/fields in which not only the arithmetic
is performed over such finite structures, but also the properties of the algorithm depend on the
properties of the finite field/ring over which the algorithm is defined. This is an important
differentiation. Our sole concern here is to examine arithmetic and general computations over finite
mathematical systems.
Our starting point should be at the recognized starting point for the modern field of digital signal
processing, the Fast Fourier Transform (FFT). The FFT is not a new transform, but really a name
for a set of algorithms that compute the Discrete Fourier Transform (DFT) with much reduced
arithmetic operations. In particular, multiplications, traditionally a hardware and/or time consuming
arithmetic operation, are reduced considerably. With the publishing of the first FFT algorithm
(Cooley and Tukey (1965)) digital signal processing began to emerge as a field in its own right.
Only a few years after that landmark paper, the first applications of finite field transforms to
computing convolution appeared. We will not attempt to retrace the history here, but will discuss
these two discoveries as initial steps on the road to the application of number theory to the digital
processing of signals. The discovery of efficient ways of computing the DFT allowed both the
computation of spectra, and the implementation of finite impulse response filters to be carried out
with much reduced computational time. The most interesting use of the FFT, in the initial years, was
in the latter application. The ability of the DFT to indirectly compute the inner product form for FIR
filters comes from its convolution property. The result is surprising because it seems, on the
surface, to involve more work in that three transforms have to be computed. It is also surprising that
a transform with a basis set involving trigonometrical functions can be used to efficiently compute
convolutions between, say, integer sequences, but it does indeed work.
In order to examine this remarkable property we will briefly discuss FIR filters and then introduce
the DFT.
Number Theoretic Techniques in Digital Signal Processing
Page 5
-
8/6/2019 NTTs
7/77
A. Finite Impulse Response Digital Filtering using DFTs
Finite impulse response (FIR) digital filters produce an output based on a weighted sum of present
and past inputs. Mathematically this is a weighted averaging scheme represented by:
yk = i=0
N-1
hn . xk-n
where {hn} is the weighting sequence, and {xn} is the input sequence.
The weighting sequence, {hn} characterizes the filter, and it is the determination of the sequence that
corresponds to the design of the filter. The sequence is sometimes referred to as the impulse
response based on the equivalent characterization of analog filters. In the digital case the impulse
function is a perfectly realizable sequence {1, 0, 0, 0, .....}. If we use this impulse function as the x
sequence, the output of the filter will be the {hn} sequence. We can also define a frequency
response for the filter by using an input sequence that is a complex sinusoid, xn = ein
. Here, the
frequency is simply an angular increment, , which locates the value of the input, at sample n, on
the unit circle at an angle ofn. As with continuous systems, we can invoke the use of a transform
to map convolution into multiplication. The transform we use is the z-transform (Jury (1964))
which is defined as:
X(z) =i=0
N-1
xn z-n
Using this definition, and the convolution property, we are able to define a transfer function for the
FIR filter as:
Y(z)X(z) = H(z)
By letting z= ei
, we can find the magnitude and phase shift (polar coordinates) of H(ei
), and
hence the frequency response (H(ei
) vs ). We note that a simple inversion procedure exists; we
Number Theoretic Techniques in Digital Signal Processing
Page 6
-
8/6/2019 NTTs
8/77
simply examine the coefficients of H(z) and assign them to delayed samples of the impulse
response. H(z) can also be obtained by a polynomial division of Y(z) by X(z); for the FIR filter we
know that this division will always have finite length and so H(z) (for z an indeterminate complex
number) has only zeros. For filters which have an infinite length response (infinite impulse
response (IIR) filters), H(z) contains poles as well as zeros; i.e. we represent H(z) as a rational
function of polynomials. Closed solutions exist for determining the impulse response for IIR
filters based on the roots of H(z) (Jury (1964)). We will not consider IIR filters here since our
interest is in implementing a finite convolution process. IIR filters possess the property that they
are able to implement arbitrary frequency magnitude responses with much fewer multiplications
than equivalent FIR filters. They have unfortunate side-effects, however, of: possible instability due
to inexact multipliers; possibility of producing limit cycle oscillations; possibility of overflow
oscillations; difficulty in implementing linear phase responses. Although a great deal of effort has
been spent on studying such filters (see, for example, Rabiner and Gold (1975)), it is not clear that
the implementation efficiency suitably compensates for the problems. Perhaps the best solution is
to concentrate on efficient implementation strategies for FIR filters. This is the main subject of this
paper.
II. Discrete Fourier Transforms
The DFT is defined as a transformation that maps a sequence xnDfiXk where:
Xk = n=0
N-1
xn e-j2nk/N
= n=0
N-1
xn ejWNnk (1)
with WN=2N
In terms of the Fourier transform, this may be regarded as an approximation to the Fourier integral,
but it turns out that it is a transform in its own right with exact properties. Two of the properties that
Number Theoretic Techniques in Digital Signal Processing
Page 7
-
8/6/2019 NTTs
9/77
are important to the subject of this paper are the inverse and convolution property. In fact we need
the first to prove the second.
A. Inverse and Convolution Property
One important property is that the transform is reversible, i.e. the mapping xnD-1fi Xk exists, with
an inverse given by:
xn =1N
k=0
N-1
Xk ej-WNnk (2)
The proof that the inverse exists yields the basis for the second important property, the convolution
property.
The easiest way to prove that (2) is the inverse of (1) is by direct substitution. We will make the
substitution using an appropriate dummy variable, m; the result should yield that xn=xm,"m=n.
xn =1N
k=0
N-1
m=0
N-1
xm ejWNmk
e-jWNnk
interchanging the order of summation and gathering terms we find:
xn =1N
m=0
N-1
xm
k=0
N-1
ejWNk(m-n)
or
xn =1N
m=0
N-1
xm
k=0
N-1
WN(k[m-n]) (3)
The function, W(k[m-n]), displays rather an interesting property.
Number Theoretic Techniques in Digital Signal Processing
Page 8
-
8/6/2019 NTTs
10/77
W(k[m-n]) is a function on the unit circle in the complex plane with N unique values: 1, ej[m-n]WN
, ej2[m-n]WN , .... , e
j(N-1)[m-n]WN for k = 0, 1, 2, .... , N-1. For [m-n]=KN, where K is some integer,
all values of the function are 1 and so the sum, k=0
N-1
WN(k[m-n]), evaluates to N. For [m-n]KN
we can invoke the fact that the function yields a finite geometric series with the sum:
k=0
N-1 WN(k[m-n]) =
1 - ejN[m-n]WN
1 - ej[m-n]WN
= 0.
Thus, the only non-zero values generated by (3) are for [m-n]=KN, and for these values we obtain
the result xn=xn+KN. If we restrict both the sample and transformation domain to be periodic, with
period N, then we have proved that the transform is invertible.
Based on this result for the existence of an exact inverse for the DFT, we can now proceed to
examine the convolution property.
We start by considering the result of multiplying two transform domain sequences together.
Assume that the sample domain sequences are xn and yn with transform domain sequences of Xk
and Yk. The result of multiplying the transform domain sequences is a new sequence Zk = Xk . Yk .
Using the DFT of the x sequence explicitly, we have:
Zk = n=0
N-1 xn. WN(kn) .Yk (4)
Inverting equation (4) yields:
zm =1N
k=0
N-1
n=0
N-1
xn. WN(kn) .Yk . WN(-km) (5)
We now interchange the order of summation to find:
Number Theoretic Techniques in Digital Signal Processing
Page 9
-
8/6/2019 NTTs
11/77
zm = n=0
N-1
xn
1N
k=0
N-1
Yk . WN(k[n-m])(6)
The inner expression of equation (6), in parenthesis, is the sequence y[m-n]
. Since {y} is periodic,
we can retain values from a single period by forming y[mN(-n)]. Note that here we are using the
notation Nto indicate addition modulo N. This notation will be expanded later.
From (6) we obtain the final result for the output sequence:
zm = n=0
N-1
xn y[mN(-n)](7)
zm is therefore the convolution of the sequence {x} with the sequence {y}. It is the convolution of
two periodic sequences, and is often called cyclic, or periodic, convolution. As noted before, we only
need to know the samples of a single period in order to perform the calculation. If we require to
determine the periodic convolution of two sequences, then we can perform the calculation indirectly
by taking DFT's of each sequence, multiplying in the transform domain, and inverting the result. If
we require the convolution of two aperiodic sequences, it turns out that we can still use periodic
convolution, providing that we are prepared to perform several periodic convolutions with sub
samples of the sequences, and assemble the final sequence from partial outputs of the periodic
convolutions; these are the so-called overlap/save and overlap/add procedures. We are not able to
give full details of such procedures here, but there are many fine text books which detail the two
approaches (e.g. Rabiner and Gold (1975)).
The use of an indirect computation of convolution is advantageous when the amount of computation
involved is considerably less than that required using the direct convolution sum. An N point
periodic convolution requires N2 multiplications and N2-N binary (two valued) additions. Can we
reduce this by an indirect computation involving DFT's? The answer is yes, but only if the
transform can be computed efficiently. An algorithm that efficiently computes a DFT is called a fast
Number Theoretic Techniques in Digital Signal Processing
Page 10
-
8/6/2019 NTTs
12/77
algorithm, and the computation of a DFT with such an algorithm is called a Fast Fourier Transform
(FFT). Note that the FFT is not a separate transform, just a fast (or efficient) way of computing a
DFT.
For simplicity we will discuss the development of fixed radix decimation algorithms that use a
power of two as the radix. These were the first Fast algorithms to be developed, and still represent
the major implementation procedure for commercial hardware and software solutions.
III. Fast Fourier Transform (FFT) Algorithms
We will briefly examine the classical decimation in time algorithm to appreciate the magnitude ofsavings that can occur in computing the DFT using a fast algorithm.
A. Decimation in Time FFT Algorithm
There are two basic decimation (sequence slicing) procedures: decimation over the sample domain
(decimation in time, DIT) or decimation over the transform domain (decimation in frequency, DIF).
We will consider the first procedure, with the assumption that N contains some power of 2 factors.
The transformation: xnDfiXk can be written:
Xk = k=0
N-1
xn WN(nk) (8)
and we can decompose the computation by computing (8) for the even and odd samples, separately,
and then put the two results back together. This yields:
Xk = n=0
N
2 -1
x2n WN2
(nk) + WN(k) . n=0
N
2 -1
x2n+1 WN2
(nk) (9)
Number Theoretic Techniques in Digital Signal Processing
Page 11
-
8/6/2019 NTTs
13/77
We have now reduced the computational requirement for an N-point DFT to that required for twoN2 point DFTs and an extra N multiplications. If we are interested in reducing multiplications then
we have reduced the requirement from N2 multiplications toN2
2 + N multiplications. Although this
may not represent a large reduction, we are able to repeat the process by decimating each of the
N
2
point DFTs to twoN4 point DFTs. We can continue in this way until we run out of factors of N.
The greatest decimation occurs for N, a power of 2, in which the basic computational element
reduces to a 2-point DFT (which can be computed using real numbers). In this case the number of
multiplications reduces toN2 log2N (in fact some of these multiplications are trivial, but we will use
this figure as an upper bound). The calculation reduces to the calculation ofN2 log2N two-sample
DFT's with a multiplication by a power of W. This multiplier is popularly referred to as a 'twiddle
factor'.
The basic computational element is shown in the signal flow graph in Fig 1 below, and is often
referred to as a 'butterfly' based on its structure.
A
B
WN
r
+
-
C = A + B.WN
r
D = A - B. WN
r
+
+
Figure 1 Butterfly calculation with twiddle factor multiplier
Number Theoretic Techniques in Digital Signal Processing
Page 12
-
8/6/2019 NTTs
14/77
Notice that there is only one complex multiplication and two complex additions per butterfly. An
example is shown in Figure 2 for an 8-point transform using 4log28 = 4 x 3 = 12 butterfly
calculations. Appropriate pairs of shaded nodes represent the addition/subtraction network of the
butterfly calculation.
Note that although there are 12 multiplications indicated, in fact only 5 of these are non-trivial (i.e.
not a multiplication by unity). Before we leave this brief introduction to fast algorithms it can be
observed that the output sequence has been scrambled in the process of reducing the number of
multiplications, and that the structure is not the same from stage to stage. Algorithms are available
that do not scramble the output data (or require the input data to be scrambled) and there are also
algorithms that maintain a constant structure, but not together (a sort of Heisenberg uncertainty
principle for FFT's !).
X 0
X4
X 2
X 6
X 1
X 5
X 3
X 7
x
x
x
x
x
x
x
x
0
1
2
3
4
5
6
7
W8
0
W8
0
W 80
W8
0
W8
0
W8
0
8W
1
W 8
2
W8
3W
8
2
W8
0
W8
2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
Figure 2 FFT Signal Flow Graph for 8-point DFT transform
Number Theoretic Techniques in Digital Signal Processing
Page 13
-
8/6/2019 NTTs
15/77
It seems to be a law of nature that as one attempts to reduce the number of arithmetic operations (or
at least the most costly..multiplication) the structure suffers. This comes back to haunt us when we
examine VLSI implementations of such algorithms.
There are many works available on the myriad of fast algorithms that have been discovered since the
initial algorithm was published (for example Blahut (1985)). It has been our intent here simply to
point out that these algorithms exist and to briefly discuss one of the earliest.
B. Use of the DFT in indirect computation of convolution
Based on the fact that DFT's can be computed very efficiently, it becomes clear that the operation of
convolution can also be computed much more efficiently than directly implementing the convolution
sum. Assuming that we wish to compute circular convolution, then an N-point convolution can be
computed using 2 DFTs and an inverse DFT. Since an inverse DFT is the same complexity as a
forward DFT, the number of multiplications involved are 3N log2 N + N. The reduction in
multiplications for an N-point circular convolution is1N { }1 + 3log2 N . For the case of
filtering with a fixed set of weights, we can store the DFT of the weight sequence and obtain a
greater reduction of 1N { }1 + 2. log2 N multiplications. These savings are reduced if one is
interested in aperiodic convolution (the normal filtering operation) but for large filter kernels (large
N) the savings can be significant enough to warrant using this indirect approach.
We note one unfortunate fact, even if we are convolving only real sequences, it is still necessary to
use complex arithmetic. In fact our reduction calculations above are predicated on the need for
complex sequence convolution. There is another unfortunate effect that since the roots of unity are
obtained from transcendental functions, we are not able to compute convolutions without error; even
the convolution between sequences of integers, which in the direct form can be easily computed
without invoking any errors, will have to be computed using approximations to the roots, and will, in
general, invoke errors in the final convolution sequence.
Number Theoretic Techniques in Digital Signal Processing
Page 14
-
8/6/2019 NTTs
16/77
There are mathematical systems in which roots of unity can be generated and used in calculations
without any approximations. In fact, approximations are meaningless. The mathematical systems
are finite algebras, and we will devote the remainder of this chapter to discussing their use in the
computation of DSP algorithms.
It is important to establish some basic concepts and theories relating to the subject of finite algebra;
the following section is included for that purpose.
IV. Finite Algebras
For our specific purposes we will be using algebraic structures that emulate normal integercomputations. The entities to be discussed are groups, rings and fields which contain a finite
number of integers as their elements and allow arithmetic operations that are modifications of
normal integer arithmetic operations. This will not be an exhaustive review; the reader is referred to
standard texts on number theory (e.g. Dudley, 1969, and Vinogradov, 1954) as well as texts on
modern algebra (e.g. Burton, 1970, and Schilling and Piper, 1975).
As a starting point, we will formally define the notion of a ring and field.
A. Rings and Fields
Definition 1:
If R is a non empty set on which there are defined binary operations of addition, +, and
multiplication, , such that the following postulates (I) - (VI) hold, we say that R is a ring,
(I) Commutative law of addition a+b =b+a
(II) Associative law (a+b)+c = a+(b+c)
(III) Distributive law a(b+c) = ab + ac
Number Theoretic Techniques in Digital Signal Processing
Page 15
-
8/6/2019 NTTs
17/77
(IV) Existence of an element, denoted by the symbol 0, of R such that a+0 = a for every a
R
(V) Existence of an additive inverse. For each a R, there exists x R such that
a+x = 0
(VI) Closure, where it is understood that a, b, c are arbitrary elements of R in the above
postulates
A ring in which multiplication is a commutative operation is called a commutative ring. A ring with
identity is a ring in which there exists an identity element, 1, for the operation of multiplication,
a1 = 1a = a for all a R.
The ring of integers is a well known example of a commutative ring with identity. In abstract
algebra, elements of a ring are not necessarily the integers, even not necessarily numbers; e.g. they
can be polynomials. Given a ring R with identity1, an element a R is said to be invertible, or to
be a unit, whenever a possesses an inverse with respect to multiplication. The multiplicative inverse,
a-1, is an element such that a-1a = 1. The set of all invertible elements of a ring is a group with
respect to the operation of multiplication and is called a multiplicative group.
Definition 2:
If R is a ring and 0a R, then a is called a divisor of zero if there exists someb0, b R such
that ab = 0.
Definition 3:
An integral domain is a commutative ring with identity which has no divisors of zero.
Definition 4:
A ring, F, is said to be a field provided that the set F-{0} is a commutative group under
multiplication.
Number Theoretic Techniques in Digital Signal Processing
Page 16
-
8/6/2019 NTTs
18/77
Viewed otherwise: a field is a commutative ring with identity in which each non-zero element
possesses an inverse under multiplication. It follows (McCoy, 1972) that every field is an integral
domain. The rings (fields) with a finite number of elements are called finite rings (fields). A ring of
integers modulo m, denoted here as Zm, is an example of a finite ring.
In every finite field with p elements, the nonzero elements form a multiplicative group. This
multiplicative group of order p-1 is cyclic; i.e. it contains an element, a, whose powers exhaust the
entire group. This element is called a generator, or a primitive root of unity, and the period or order
ofa is p-1. The order of any element x in the multiplicative group is the least positive integer t such
that xt = 1, xs1, s[1,t). The order t is a divisor of p-1 and x is called a primitive t-th root of
unity.
For every element x of the set F-{0} the mapping defined by: x=at ,t {0, 1, ..., p-2} is the
isomorphism of the additive group of integers with addition modulo p-1 and the multiplicative
group of the field F.
The integert is called the index ofx relative to the base , denoted indax.
1. The ring of residue classes
In order to describe the system, the notion of congruence should be introduced. The basic theorem,
known as the division algorithm will be established first.
Division Algorithm
For given integers a and b, b not zero, there exists two unique integers, q and r, such that a = b.q +
r, r [0, b). It is clear that q is the integer value of the quotient ab . The quantity r is the least
positive (integer) remainder of the division of a by b and is designated as the residue of a modulo b,
or ab. We will say that b divides a (written ba) if there exists an integer k such that a = b.k.
Definition 5:
Number Theoretic Techniques in Digital Signal Processing
Page 17
-
8/6/2019 NTTs
19/77
Two integers c and d are said to be congruent modulo m, written
c d (mod m) if and only if m(c-d)
Since m0, c c (mod m) by definition
Another alternative definition of congruence can be stated as follows:
Two integers c and d are congruent modulo m if and only if they leave the same remainder when
divided by m, cm = dm .
Definition 6:
A set of integers containing exactly those integers which are congruent, modulo m, to a fixed integer
is called a residue class, modulo m.
The residue classes (mod m) form a commutative ring with identity with respect to the modulo m
addition and multiplication, traditionally known as the ring of integers modulo m or the residue ring,
and denoted Zm . The ring of residue classes (mod m) contains exactly m distinct elements. The
ring of residue classes (mod m) is a field if and only if m is a prime number, because multiplicative
inverses, denoted a-1 (mod m) or a-1m , exist for each nonzero a Zm. Thus the nonzero classes
ofZm form a cyclic multiplicative group of order m-1, {1, 2, ..., m-1}, with multiplication modulo
m, isomorphic to the additive group {0, 1, ..., m-2} with addition modulo m-1.
If m is composite,Zm is not a field. Multiplicative inverses do not exist for those nonzero elements
a Zm for which the gcd (a, m) 1. The Euler's Totient Function, denoted f(m), is a widely used
number theoretic function and is defined as the number of positive integers less than m andrelatively prime to it. It follows that the number of invertible elements ofZm is equal to f(m). If m
has the prime power factorization m = m1e1 . m2
e2 . .... mLeL then we can write:
f(m) = m
1 -
1m1
1 -
1m2
......
1 -
1mL
Number Theoretic Techniques in Digital Signal Processing
Page 18
-
8/6/2019 NTTs
20/77
The Euler-Fermat theorem states that if c is an integer, m is a positive integer and
(c, m) = 1 , then cf(m) 1 (mod m).
The important consequence of this theorem is that there is an upper limit on the order of any
element a Zm. Specifically, the order t is a divisor off(m).
2. Galois Fields
For any prime m and any positive integer n, there exists a finite field with mn elements. This
(essentially unique) field is commonly denoted by the symbol GF(mn) and is called a Galois field
in honor of the French mathematician Evariste Galois. Since any finite field with mn elements is a
simple algebraic extension of the fieldZm, a brief review of the basic concepts of the extensions of
a given field will be presented.
Let F be a field. Then any field K containing F is an extension of F. Ifl is algebraic over F, i.e. ifl
is a root of some irreducible polynomial f(x) F[x] such that f(l) = 0, then the extension field
arising from a field F by the adjunction of a root l is called a simple algebraic extension, denoted
F(l).
Each element of F(l) can be uniquely represented as a polynomial:
a0 + a1l + .... + an-1 ln-1 , ai F.
This unique representation closely resembles the representation of a vector in terms of the vectors
of a basis
1, l, ....... , ln-1.
The vector space concepts are sometimes applied to the extension fields, and F(l) is considered as a
vector space of dimension n over F.
Number Theoretic Techniques in Digital Signal Processing
Page 19
-
8/6/2019 NTTs
21/77
The field of complex numbers is an example of an extension of the field of real numbers; it is
generated by adjoining a root j = -1 of the irreducible polynomial x2 + 1.
If f(x) is an irreducible polynomial of degree n over Zm, m a prime, then the Galois Field with mn
elements, GF(mn) is usually defined as the quotient field Zm[x] / (f(x)), i.e., the field of residue
classes of polynomials of Zm[x] reduced modulo f(x). All fields containing mn elements are
isomorphic to each other. In particular, Zm[x] / (f(x)) is isomorphic to the simple algebraic
extension Zm(l), where l is a root of f(x) = 0.
Now that we have some basic background in finite algebraic structures, we can return to the
problem of indirectly computing convolution via transforms that exhibit the cyclic convolution
property. We will expand our definition of such transforms from the DFT to the NTT (Number
Theoretic Transform).
V. Number Theoretic Transforms
Number Theoretic Transforms (NTT's) were discovered independently by a number of researchers
(Nicholson 1971; Pollard 1971; Rader 1972) as a generalization of the standard DFT with respect
to the ring over which the transform is defined. The transform retains the same cyclic convolution
property as the DFT, but allows error free computation with the promise of much faster hardware
solutions.
The NTT has the same form as the DFT, with the exception that it is computed over a finite ring, or
field, rather than over the field of complex numbers.
Xk = n=0
N-1p [ ]xn p ank (10)
where the ring, or field, is R(p) or GF(p) respectively. Note that a new notation has been introduced
here. In order to explicitly identify arithmetic operations over finite fields, or rings, we will use the
Number Theoretic Techniques in Digital Signal Processing
Page 20
-
8/6/2019 NTTs
22/77
symbols p,p to represent addition and multiplication modulo p, respectively, and p to
represent summation modulo p. Where it is clear that a single ring or field is being referred to, then
the subscript may be dropped (except on the summation). This notation will be used throughout the
rest of the paper, and will be extended where appropriate.
The NTT that has probably been studied more than any other is the Fermat Number Transform
(FNT). This, in large part, is due to the ability to use modified binary arithmetic to perform the
computations. It is not clear, given the present body of knowledge on VLSI implementations, that
this ability to use modified binary hardware is necessarily a desirable attribute of number theoretic
algorithms. It is the intent of this paper to present alternative forms of implementation of number
theoretic techniques that do not have to be predicated on the existence of standard computational
elements. This opens up the possibility of using a wider variety of number theoretic transforms. Let
us, however, start off by exploring number theoretic transforms based on Fermat numbers.
A. Fermat Number Transforms
Although the NTT allows convolution to be computed without error (unlike the DFT which
introduces errors due to the requirement for arithmetic with coefficients obtained from
transcendental functions), the use of such a finite field (or ring) transform is only appropriate when
the hardware has no greater complexity than normal computations over the reals. Satisfying this
requirement can be a problem when the field modulus is not chosen with the implementation in
mind. Among the earliest implementations considered (and still occupying a small, but current,
research interest) are Fermat Number Transforms (FNT's).
The FNT is initially defined over a Galois Field GF(Ft) where the modulus Ft = 22t+1 is the t'th
Fermat Number (FN), where the FN is prime (primes do not exist for all values of t). This
transform brings about three main implementation simplicities (Agarwal and Burrus, 1974):
Number Theoretic Techniques in Digital Signal Processing
Page 21
-
8/6/2019 NTTs
23/77
1 The arithmetic is very similar to ordinary binary arithmetic because the modulus is close to a
power of 2.
2 Simple forms can be found for the multiplicative group generator that reduces the
multiplications required to compute the transform to modified bit-pattern shifts.
3 A standard Fast algorithm, obtained from an appropriate FFT, can be used to compute the
transform.
The idea of being able to compute over a finite field but with only slightly modified binary hardware
is obviously appealing, albeit constraining.
The FNT pair is defined as:
Xk = n=0
N-1 Ft
xn Ft a
nk (11)
xn = N-1
k=0
N-1 Ft
Xk Ft a
-nk (12)
There are some points that we should note:
1 Since NFt -1, it is clear that we can generate a maximally composite N= 22t and so use
one of the fast algorithms available for the DFT, to optimize a computational structure for the FNT.
2 If we restrict N to be even, N-1 will always exist.
3 The first four Fermat Numbers are prime, and 3 is a primitive root for all four of these
primes.
From 1 and 3 we find that 322t1. If we wish to find a generator, a, for a multiplicative group of
order 2b(this will allow a transform of length 2b), then a2b 32
2t 1; hence a 32
(2t-b). This
last expression allows us to determine the relationship between suitable values of a (for ease of
implementation) and possible transform lengths. The easiest value ofa to choose is clearly a = 2.
Number Theoretic Techniques in Digital Signal Processing
Page 22
-
8/6/2019 NTTs
24/77
This will allow multiplication by powers ofa to be replaced by shifts modulo Ft. Since Ft is close
to a power of 2, the hardware should be only moderately more complex than shifts modulo a power
of 2 (binary shifts).
It turns out that for a = 2, the order is 2t+1. This can be verified, using ordinary arithmetic, from:
22(t+1)
= (22t
+ 1) . (22t
- 1) + 1 1
The largest Fermat prime is F4, for which the order is 32; the arithmetic is modified 17-bit binary.
If we examine F5 and F6, we find that they are composite numbers of the form Ft = K.2t+2 + 1.
We have to pause at this moment to reflect on the fact that Ft, t {5, 6} is not a prime number.
Clearly we will not be computing over a field, so will the NTT still work? It turns out that the
conditions for the existence of the transform still exist when computing over a ring. The restriction
is that the transform has to exist over each field constructed from each of the prime factors of the
composite modulus. The transform length, N, therefore satisfies NF(i)
t "i, where
Ft =F(i)t (this assumes no powers of primes as factors) . For F5 and F6 we can show thatN = 2t+2 is the maximum transform length supported by these Fermat numbers; as a necessary
condition, we can see that 2t+2(Ft - 1). Although we are only computing over a ring of integers
ZFt, 2t+2/ K.2t+2+1, and so N-1 exists. The length supported by a = 2 is 2t+1, as before.
For all the Fermat Numbers considered here, it turns out that the transform length, N = 2t+2, can be
achieved by using a value ofa = 2 (Agarwal and Burrus (1974)) ; i.e. a2 = 2. The expression for
a = 2 is given by:
a = 22(t-2)
. (22(t-1)
- 1)
which is clearly representable by a 2-bit (2 ones in a field of zeros) generator. For a fast DIT or
DIF algorithm, only the first, or last, stage will require odd powers of a, and so most of the
multiplications will only require single-bit multipliers. Given the use ofa = 2 and the fifth FN,
Number Theoretic Techniques in Digital Signal Processing
Page 23
-
8/6/2019 NTTs
25/77
we have a dynamic range of just over 32-bits, with a transform length of N = 128. It is clear that
there is an unfortunate link between dynamic range and transform length, assuming the requirement
for a fast algorithm.
It might be helpful to take an example of convolution over GF(F2) in order to reinforce the results
presented here. In order to illustrate the difference between performing convolution using an NTT
and an 8-point DFT, we will chose a = 32 = 9 in order that the NTT may also generate an 8-point
transform length.
The transform pair is:
Xk = n=0
7
17[ ]xn 17 9nk
xn = 8-1
n=0
7
17[ ]Xk 17 9-nk
We can concisely show the calculation using matrix notation (the modulo arithmetic is implied):
X = T x (13)
where T is the transformation matrix, whose elements are given by:
Tkn = 9kn
Note that the corresponding DFT transformation matrix has elements given by:
Tkn = e-jkn/4
We can immediately see the major difference in the computation of both transforms. Although the
DFT uses normal arithmetic, the calculations are over the complex field, with elements that are not
representable in a finite machine. The NTT, although requiring finite field (modulo) arithmetic,
requires calculations over an integer field using elements that are defined without error.
Number Theoretic Techniques in Digital Signal Processing
Page 24
-
8/6/2019 NTTs
26/77
As an example calculation, in order to compare the two approaches, we will cyclically convolve the
sequence {x} = {1, 2, 3, 4, 0, 0, 0, 0} with itself. This example is chosen to illustrate several points,
as can be seen if we compute the cyclic convolution directly:
yk = n=0
7xn . x[k8(-n)] or {y} = {1, 4, 10, 20, 25, 24, 16, 0}
If we now compute the aperiodic (normal) convolution we find:
yk = n=0
7
xn . xk-n or {y} = {1, 4, 10, 20, 25, 24, 16, 0}
We can observe that both the computed sequences are identical, which shows that it is possible to
generate normal aperiodic convolution from periodic convolution by padding the sequences to be
convolved with zeros. This forms the basis for the classical techniques of overlap-add and overlap-
save for using cyclic convolution to compute normal convolution (Gold and Rader (1969)). As a
second point, we also notice that some of the elements of the convolution output lie outside the
elements of the field over which the NTT is to be computed, GF(17). This will be discussed after
the example calculations.
We will generate the forward transform as the matrix multiplication of equation (13).
10
16
6
11
15
13
7
15
=
1 1 1 1
1 9 13 15
1 13 16 4
1 15 4 9
1 16 1 16
1 8 13 2
1 4 16 13
1 2 4 8
1 1 1 1
16 8 4 2
1 13 16 4
16 2 13 8
1 16 1 16
16 9 4 15
1 4 16 13
16 15 13 9
Q17
1
2
3
4
0
0
0
0
The symbol, Q17 , indicates matrix multiplication over GF(17). Since we are convolving the
sequence with itself, we generate the transform domain point-wise multiplication (over GF(17)) as
below, and feed this result to the input of the inverse transform calculation.
Number Theoretic Techniques in Digital Signal Processing
Page 25
-
8/6/2019 NTTs
27/77
15
1
2
2
4
16
15
4
=
10
16
6
11
15
13
7
15
17
10
16
6
11
15
13
7
15
815
12
7
13
5
9
0
=
1 1 1 11 2 4 8
1 4 16 13
1 8 13 2
1 16 1 16
1 15 4 9
1 13 16 4
1 9 13 15
1 1 1 116 15 13 9
1 4 16 13
16 9 4 15
1 16 1 16
16 2 13 8
1 13 16 4
16 8 4 2
Q17
151
2
2
4
16
15
4
The inverse transform requires a final multiplication by 8-1= 15 to obtain the cyclic convolution
output.
1
4
10
3
8
7
16
0
= 8-1 17
8
15
12
7
13
5
9
0
We note that some of the elements of the output convolution do not equal the original calculation
using the direct convolution sum. They are, in fact, the result of finding the least positive residue,
modulo 17, of the correct output. In fact the results are not wrong in the accepted sense; they are
simply a mapping of the correct values onto GF(17). We will see later that such results can still be
Number Theoretic Techniques in Digital Signal Processing
Page 26
-
8/6/2019 NTTs
28/77
used, and in fact will provide a technique for removing some of the problems of the linkage of
dynamic range with transform length.
It will be constructive at this point to examine the use of the DFT in performing the same
convolution. Because we are performing the calculations over the complex field, we will treat the
'real' and 'imaginary' parts of the sequences and W elements as the elements of a 2-tuple; the
interaction between the 'real' and 'imaginary' parts will use the usual rules of complex arithmetic.
The forward transform is shown below:
10.00
-0.41
-2.00
2.41
-2.00
2.41
-2.00
-0.41
0.00
-7.24
2.00
-1.24
0.00
1.24
-2.00
7.24
=
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.00
0.71
0.00
-0.71
-1.00
-0.71
0.00
0.71
0.00
-0.71
-1.00
-0.71
0.00
0.71
1.00
0.71
1.00
0.00
-1.00
0.00
1.00
0.00
-1.00
0.00
0.00
-1.00
0.00
1.00
0.00
-1.00
0.00
1.00
1.00
-0.71
0.00
0.71
-1.00
0.71
0.00
-0.71
0.00
-0.71
1.00
-0.71
0.00
0.71
-1.00
0.71
1.00
-1.00
1.00
-1.00
1.00
-1.00
1.00
-1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.00
-0.71
0.00
0.71
-1.00
0.71
0.00
-0.71
0.00
0.71
-1.00
0.71
0.00
-0.71
1.00
-0.71
1.00
0.00
-1.00
0.00
1.00
0.00
-1.00
0.00
0.00
1.00
0.00
-1.00
0.00
1.00
0.00
-1.00
1.00
0.71
0.00
-0.71
-1.00
-0.71
0.00
0.71
0.00
0.71
1.00
0.71
0.00
-0.71
-1.00
-0.71
1.00
2.00
3.00
4.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
The transform domain input to the inverse transform is the point-wise multiplication of the forward
transform with itself:
Number Theoretic Techniques in Digital Signal Processing
Page 27
-
8/6/2019 NTTs
29/77
100.00
-52.28
0.00
4.28
4.00
4.28
0.00
-52.28
0.00
6.00
-8.00
-6.00
0.00
6.00
8.00
-6.00
10.00
-0.41
-2.00
2.41
-2.00
2.41
-2.00
-0.41
0.00
-7.24
2.00
-1.24
0.00
1.24
-2.00
7.24
10.00
-0.41
-2.00
2.41
-2.00
2.41
-2.00
-0.41
0.00
-7.24
2.00
-1.24
0.00
1.24
-2.00
7.24
=
The inverse transform is computed below with a final division by 8.
8.00
32.00
80.00
160.00
200.00
192.00
128.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
=
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.00
0.71
0.00
-0.71
-1.00
-0.71
0.00
0.71
0.00
0.71
1.00
0.71
0.00
-0.71
-1.00
-0.71
1.00
0.00
-1.00
0.00
1.00
0.00
-1.00
0.00
0.00
1.00
0.00
-1.00
0.00
1.00
0.00
-1.00
1.00
-0.71
0.00
0.71
-1.00
0.71
0.00
-0.71
0.00
0.71
-1.00
0.71
0.00
-0.71
1.00
-0.71
1.00
-1.00
1.00
-1.00
1.00
-1.00
1.00
-1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.00
-0.71
0.00
0.71
-1.00
0.71
0.00
-0.71
0.00
-0.71
1.00
-0.71
0.00
0.71
-1.00
0.71
1.00
0.00
-1.00
0.00
1.00
0.00
-1.00
0.00
0.00
-1.00
0.00
1.00
0.00
-1.00
0.00
1.00
1.00
0.71
0.00
-0.71
-1.00
-0.71
0.00
0.71
0.00
-0.71
-1.00
-0.71
0.00
0.71
1.00
0.71
100.00
-52.28
0.00
4.28
4.00
4.28
0.00
-52.28
0.00
6.00
-8.00
-6.00
0.00
6.00
8.00
-6.00
Number Theoretic Techniques in Digital Signal Processing
Page 28
-
8/6/2019 NTTs
30/77
1.00
4.00
10.00
20.00
25.00
24.00
16.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
8.00
32.00
80.00
160.00
200.00
192.00
128.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
=1
8
The calculations were performed at great precision, and so the effect of quantization noise is not
seen in this limited precision format. The quantization noise is present in both the inexact
representation of the W coefficients, and in the scaling procedures used to keep the data within the
dynamic range of the computational system. Even at large precision, zero elements become non-
zero, albeit very small, values. The fact that the input sequence is purely real, shows up in the
hermitian symmetry of the input to the inverse transform (even symmetry in real part about the
center sample (sample 4) and odd symmetry in the imaginary part). Although we have explicitly
shown all details of the calculations (not normally shown in most texts) an examination of the DFT
calculation compared to the GF(17) NTT calculation, demonstrates the simplicity and error free
behavior of computing indirect convolution over finite fields.
We are now in a position to summarize the features about each approach to computing indirect
convolution. This is presented in Table I below:
Table I Comparison of DFT and FNT integer sequence convolution
Comments
Feature DFT FNT
Transform Length No restriction Restricted by requirement
NF(i)
t "i
Number Theoretic Techniques in Digital Signal Processing
Page 29
-
8/6/2019 NTTs
31/77
Dynamic Range No restriction Restricted by requirement
|Xk|