NTTs

8/6/2019 NTTs

1/77

Number Theoretic

Transforms

G.A. Jullien

Extract from Number Theoretic Techniques in Digital Signal

Processing

Book Chapter

Advances in Electronics and Electron Physics, Academic PressInc., (Invited), vol. 80, Chapter 2, pp. 69-163, 1991.

8/6/2019 NTTs

2/77

Introduction

This chapter discusses the application of number theory to the intensive computations required in

many Digital Signal Processing (DSP) systems. Number theory was always looked upon as one of

the last vestiges of pure mathematical pursuit, where the idea of applying the theories to practical

purposes was irrelevant. Over the last few decades we have been discovering very practical

applications for number theory; among these are applications in coding theory, communications,

physics and digital information (Schroeder (1985)). Our specific interest in this work is the

application of number theory to the mechanics of arithmetic computations themselves. Modern day

computer technology is heavily dependent on the manipulation of numbers, in fact the automization

of arithmetic operations was the quest of all the early computer pioneers, from Pascal to Babbage to

Von-Neumann. The involvement of number theory in this endeavor, however, has been strangely

very limited. The use of number theory in helping to perform the basic arithmetic operations has not

been accepted by the commercial world. Perhaps the reason for this is the pervasion of the binary

number system in the mechanization of computer arithmetic. A large number of hardware designers

have little or no knowledge that alternative ways of computing arithmetic exist!

There is one field in which the high speed arithmetic manipulation of numbers is of paramount

importance. This is the field of Digital Signal Processing (DSP), and it is here that alternative

arithmetic hardware may find a niche. This is indeed the subject of this paper.

Why is it that some DSP systems are so computationaly intensive? Typically the DSP system we

are discussing will be involved with the processing of streams of data that are indeterminately long;the requirement of the system is to produce processed output data at the same rate that the input

data is entering the processor. This is referred to as real-time processing. Often the arithmetic

manipulations are simple in form (cascades of additions and multiplications in a well defined

structure) but the numbers of operations that have to be computed every second can be enormous. It

Page 1

8/6/2019 NTTs

3/77

will be constructive at this point to consider an example to illustrate the processing power required

by a typical DSP system.

Let us consider a basic problem; the filtering of television pictures in real-time. We will assume that

a two-dimensional filter is required and is given by the following equation:

ym,n = i=-2

2

j=-2

2

wi, j . xm-i,n-j

Here wi,j is the filter kernel, and represents the weights on a two-dimensional averaging procedure

around each output sample (ym,n). The averaging is based on a neighborhood of input samples {xi

,j

}, that extends 5 samples in each dimension and centered on the output pixel position. The double

summation, in fact, represents two-dimensional convolution. Convolution is a basic operation to all

linear signal processing problems since it represents the mapping function of an input signal to an

output signal through a linear system. The real time constraint is important since, as we will see, an

extremely large computational requirement will be necessary in order to construct the filter.

In order to see the speed of operation required by our filter, we will perform some elementary

calculations based on typical requirements for the filter. We assume a 525 line scan with 500

picture elements (pixels) in each line. We also assume that each primary colour is to be filtered

independently with the 5 x 5 filter. With an interlaced system we need to filter 30 images every

second. Each output pixel (made up of three primary colour sub-pixels) requires 75

multiply/accumulate operations (M/A Ops) to implement the filter. There are 262,500 pixels in each

image and so 1.96875 x 107 M/A Ops are required for each image. Since we require to filter 30

images per second, the total number of M/A Ops per second is 5.9 x 108. Assuming that an M/A

Op is considered a basic operation in the processor used to filter the image, then we need a machine

that delivers 590 Mega Operations per second (MOPS). In these terms we require a super

computer to deliver this performance!

Number Theoretic Techniques in Digital Signal Processing

Page 2

8/6/2019 NTTs

4/77

Clearly the implementation has to be somewhat different to a large main frame super computer, it

would certainly be difficult to fit one into a normal sized television set!

The filtering operation has the advantage of not requiring the full features of a general purpose

computer (the filtering operation does not have to be performed within the typical multi-user,

complex operating system environment of a main frame machine) and the computations are simply

repeated arithmetic operations in a predetermined structure. In fact, it is rather pointless to talk about

the computational requirements in normal general computer terms (MIPs, MOPs, MFLOPs, etc.),

but rather to talk in direct terms of the data flow requirements; that is, the throughput rate of the

filter in samples per second (in this example, a sample is a pixel). For our example filter we have to

filter pixels at a rate of 7.875 x 106 pixels per second. If we provide arithmetic hardware which

contains 75 Multiply/Accumulate circuits, then each circuit has to operate within 120nS. If we use

currently available integrated circuits that operate within 60nS, then we can build a circuit with only

38 integrated circuits for the basic arithmetic operations, and perform two operations, sequentially,

with one circuit. We have therefore reduced the original requirement of a supercomputer to a

system containing a relatively small number of 'chips'.

It is the purpose of this paper to discuss ways in which number theoretic techniques can be used to

perform DSP operations, such as this example filter, by reducing the amount of hardware involved

in the circuitry. Since we are going to be directly concerned with the implementation of DSP

arithmetic, it might be helpful to finish off our example by considering the arithmetic details of each

Multiply/Accumulate operation.

The representation of each primary colour pixel by 8-bits is more than adequate for very high

definition of the image and the quantization of each filter weight by 8-bits is also very generous.

This naturally leads to a 16-bit multiplier structure. Since there are an accumulation of 25 such

multiplications for each output pixel, an extra 5-bits are required for this. With 21-bits we can, in

fact, compute each output pixel with no error in the computation process. Since 21-bits is a

pessimistic upper bound for the computational dynamic range (only required if both the filter


Page 3

8/6/2019 NTTs

5/77

coefficients and the input pixels are at their maximum values - a very rare occurrence), we can easily

specify a 20-bit range for accurate computation.

This calculation leads us to an important observation that we do not need floating point arithmetic.

We can, in fact, perform error-free computation within the dynamic range of a typical floating point

standard mantissa length - no manipulation of the exponent being required at all! This example

represents a fine-grained computation (one in which successive parts of the computation are of a

limited dynamic range) and is typical of many DSP algorithms. The limitations we are able to

impose on the arithmetic calculations will be a great help in reducing the hardware required in the

very high-speed throughput systems to be discussed in this paper.

For the purposes of this paper we will restrict ourselves to the computation of linear filter and

transform algorithms which have, for the one-dimensional case, the inner product form:

yn = m=0

N-1

xm.wf(n,m)

In the case of linear filtering (convolution) f(n,m) = n-m; for the case of a linear transform with the

convolution property, wf(n,m) is a primitive Nth root of unity. For the DFT, wf(n,m) = e-i

2

N

n m

.

Although this limitation appears overly restrictive, the general form of equation (1) probably

accounts for the vast majority of digital signal processing functions implemented commercially, and

also leads us to a body of literature that uses number theory to indirectly compute the convolution

form of the inner product formulation. We will also restrict our investigation of the application of

number theoretic techniques to those that use finite field arithmetic directly in the computations.

This will not cover those algorithms whose structure is derived via number theoretic techniques, and

whose computation is performed over the usual complex field (e.g. Winograd Fourier Transforms);

there is a vast amount of literature already available on this separate subject, and the reader is

directed to the work by McClellan and Rader (1979) as a starting point.


Page 4

8/6/2019 NTTs

6/77

Our discussion in this review will concentrate on two areas: the use of finite ring/field operations

for computing ordinary integer computations (Residue Number System) for high speed signal

processing and the use of algorithms defined over finite rings/fields in which not only the arithmetic

is performed over such finite structures, but also the properties of the algorithm depend on the

properties of the finite field/ring over which the algorithm is defined. This is an important

differentiation. Our sole concern here is to examine arithmetic and general computations over finite

mathematical systems.

Our starting point should be at the recognized starting point for the modern field of digital signal

processing, the Fast Fourier Transform (FFT). The FFT is not a new transform, but really a name

for a set of algorithms that compute the Discrete Fourier Transform (DFT) with much reduced

arithmetic operations. In particular, multiplications, traditionally a hardware and/or time consuming

arithmetic operation, are reduced considerably. With the publishing of the first FFT algorithm

(Cooley and Tukey (1965)) digital signal processing began to emerge as a field in its own right.

Only a few years after that landmark paper, the first applications of finite field transforms to

computing convolution appeared. We will not attempt to retrace the history here, but will discuss

these two discoveries as initial steps on the road to the application of number theory to the digital

processing of signals. The discovery of efficient ways of computing the DFT allowed both the

computation of spectra, and the implementation of finite impulse response filters to be carried out

with much reduced computational time. The most interesting use of the FFT, in the initial years, was

in the latter application. The ability of the DFT to indirectly compute the inner product form for FIR

filters comes from its convolution property. The result is surprising because it seems, on the

surface, to involve more work in that three transforms have to be computed. It is also surprising that

a transform with a basis set involving trigonometrical functions can be used to efficiently compute

convolutions between, say, integer sequences, but it does indeed work.

In order to examine this remarkable property we will briefly discuss FIR filters and then introduce

the DFT.


Page 5

8/6/2019 NTTs

7/77

A. Finite Impulse Response Digital Filtering using DFTs

Finite impulse response (FIR) digital filters produce an output based on a weighted sum of present

and past inputs. Mathematically this is a weighted averaging scheme represented by:

yk = i=0

N-1

hn . xk-n

where {hn} is the weighting sequence, and {xn} is the input sequence.

The weighting sequence, {hn} characterizes the filter, and it is the determination of the sequence that

corresponds to the design of the filter. The sequence is sometimes referred to as the impulse

response based on the equivalent characterization of analog filters. In the digital case the impulse

function is a perfectly realizable sequence {1, 0, 0, 0, .....}. If we use this impulse function as the x

sequence, the output of the filter will be the {hn} sequence. We can also define a frequency

response for the filter by using an input sequence that is a complex sinusoid, xn = ein

. Here, the

frequency is simply an angular increment, , which locates the value of the input, at sample n, on

the unit circle at an angle ofn. As with continuous systems, we can invoke the use of a transform

to map convolution into multiplication. The transform we use is the z-transform (Jury (1964))

which is defined as:

X(z) =i=0

N-1

xn z-n

Using this definition, and the convolution property, we are able to define a transfer function for the

FIR filter as:

Y(z)X(z) = H(z)

By letting z= ei

, we can find the magnitude and phase shift (polar coordinates) of H(ei

), and

hence the frequency response (H(ei

) vs ). We note that a simple inversion procedure exists; we


Page 6

8/6/2019 NTTs

8/77

simply examine the coefficients of H(z) and assign them to delayed samples of the impulse

response. H(z) can also be obtained by a polynomial division of Y(z) by X(z); for the FIR filter we

know that this division will always have finite length and so H(z) (for z an indeterminate complex

number) has only zeros. For filters which have an infinite length response (infinite impulse

response (IIR) filters), H(z) contains poles as well as zeros; i.e. we represent H(z) as a rational

function of polynomials. Closed solutions exist for determining the impulse response for IIR

filters based on the roots of H(z) (Jury (1964)). We will not consider IIR filters here since our

interest is in implementing a finite convolution process. IIR filters possess the property that they

are able to implement arbitrary frequency magnitude responses with much fewer multiplications

than equivalent FIR filters. They have unfortunate side-effects, however, of: possible instability due

to inexact multipliers; possibility of producing limit cycle oscillations; possibility of overflow

oscillations; difficulty in implementing linear phase responses. Although a great deal of effort has

been spent on studying such filters (see, for example, Rabiner and Gold (1975)), it is not clear that

the implementation efficiency suitably compensates for the problems. Perhaps the best solution is

to concentrate on efficient implementation strategies for FIR filters. This is the main subject of this

paper.

II. Discrete Fourier Transforms

The DFT is defined as a transformation that maps a sequence xnDfiXk where:

Xk = n=0

N-1

xn e-j2nk/N

= n=0

N-1

xn ejWNnk (1)

with WN=2N

In terms of the Fourier transform, this may be regarded as an approximation to the Fourier integral,

but it turns out that it is a transform in its own right with exact properties. Two of the properties that


Page 7

8/6/2019 NTTs

9/77

are important to the subject of this paper are the inverse and convolution property. In fact we need

the first to prove the second.

A. Inverse and Convolution Property

One important property is that the transform is reversible, i.e. the mapping xnD-1fi Xk exists, with

an inverse given by:

xn =1N

k=0

N-1

Xk ej-WNnk (2)

The proof that the inverse exists yields the basis for the second important property, the convolution

property.

The easiest way to prove that (2) is the inverse of (1) is by direct substitution. We will make the

substitution using an appropriate dummy variable, m; the result should yield that xn=xm,"m=n.

xn =1N

k=0

N-1

m=0

N-1

xm ejWNmk

e-jWNnk

interchanging the order of summation and gathering terms we find:

xn =1N

m=0

N-1

xm

k=0

N-1

ejWNk(m-n)

or

xn =1N

m=0

N-1

xm

k=0

N-1

WN(k[m-n]) (3)

The function, W(k[m-n]), displays rather an interesting property.


Page 8

8/6/2019 NTTs

10/77

W(k[m-n]) is a function on the unit circle in the complex plane with N unique values: 1, ej[m-n]WN

, ej2[m-n]WN , .... , e

j(N-1)[m-n]WN for k = 0, 1, 2, .... , N-1. For [m-n]=KN, where K is some integer,

all values of the function are 1 and so the sum, k=0

N-1

WN(k[m-n]), evaluates to N. For [m-n]KN

we can invoke the fact that the function yields a finite geometric series with the sum:

k=0

N-1 WN(k[m-n]) =

1 - ejN[m-n]WN

1 - ej[m-n]WN

= 0.

Thus, the only non-zero values generated by (3) are for [m-n]=KN, and for these values we obtain

the result xn=xn+KN. If we restrict both the sample and transformation domain to be periodic, with

period N, then we have proved that the transform is invertible.

Based on this result for the existence of an exact inverse for the DFT, we can now proceed to

examine the convolution property.

We start by considering the result of multiplying two transform domain sequences together.

Assume that the sample domain sequences are xn and yn with transform domain sequences of Xk

and Yk. The result of multiplying the transform domain sequences is a new sequence Zk = Xk . Yk .

Using the DFT of the x sequence explicitly, we have:

Zk = n=0

N-1 xn. WN(kn) .Yk (4)

Inverting equation (4) yields:

zm =1N

k=0

N-1

n=0

N-1

xn. WN(kn) .Yk . WN(-km) (5)

We now interchange the order of summation to find:


Page 9

8/6/2019 NTTs

11/77

zm = n=0

N-1

xn

1N

k=0

N-1

Yk . WN(k[n-m])(6)

The inner expression of equation (6), in parenthesis, is the sequence y[m-n]

. Since {y} is periodic,

we can retain values from a single period by forming y[mN(-n)]. Note that here we are using the

notation Nto indicate addition modulo N. This notation will be expanded later.

From (6) we obtain the final result for the output sequence:

zm = n=0

N-1

xn y[mN(-n)](7)

zm is therefore the convolution of the sequence {x} with the sequence {y}. It is the convolution of

two periodic sequences, and is often called cyclic, or periodic, convolution. As noted before, we only

need to know the samples of a single period in order to perform the calculation. If we require to

determine the periodic convolution of two sequences, then we can perform the calculation indirectly

by taking DFT's of each sequence, multiplying in the transform domain, and inverting the result. If

we require the convolution of two aperiodic sequences, it turns out that we can still use periodic

convolution, providing that we are prepared to perform several periodic convolutions with sub

samples of the sequences, and assemble the final sequence from partial outputs of the periodic

convolutions; these are the so-called overlap/save and overlap/add procedures. We are not able to

give full details of such procedures here, but there are many fine text books which detail the two

approaches (e.g. Rabiner and Gold (1975)).

The use of an indirect computation of convolution is advantageous when the amount of computation

involved is considerably less than that required using the direct convolution sum. An N point

periodic convolution requires N2 multiplications and N2-N binary (two valued) additions. Can we

reduce this by an indirect computation involving DFT's? The answer is yes, but only if the

transform can be computed efficiently. An algorithm that efficiently computes a DFT is called a fast


Page 10

8/6/2019 NTTs

12/77

algorithm, and the computation of a DFT with such an algorithm is called a Fast Fourier Transform

(FFT). Note that the FFT is not a separate transform, just a fast (or efficient) way of computing a

DFT.

For simplicity we will discuss the development of fixed radix decimation algorithms that use a

power of two as the radix. These were the first Fast algorithms to be developed, and still represent

the major implementation procedure for commercial hardware and software solutions.

III. Fast Fourier Transform (FFT) Algorithms

We will briefly examine the classical decimation in time algorithm to appreciate the magnitude ofsavings that can occur in computing the DFT using a fast algorithm.

A. Decimation in Time FFT Algorithm

There are two basic decimation (sequence slicing) procedures: decimation over the sample domain

(decimation in time, DIT) or decimation over the transform domain (decimation in frequency, DIF).

We will consider the first procedure, with the assumption that N contains some power of 2 factors.

The transformation: xnDfiXk can be written:

Xk = k=0

N-1

xn WN(nk) (8)

and we can decompose the computation by computing (8) for the even and odd samples, separately,

and then put the two results back together. This yields:

Xk = n=0

N

2 -1

x2n WN2

(nk) + WN(k) . n=0

N

2 -1

x2n+1 WN2

(nk) (9)


Page 11

8/6/2019 NTTs

13/77

We have now reduced the computational requirement for an N-point DFT to that required for twoN2 point DFTs and an extra N multiplications. If we are interested in reducing multiplications then

we have reduced the requirement from N2 multiplications toN2

2 + N multiplications. Although this

may not represent a large reduction, we are able to repeat the process by decimating each of the

N

2

point DFTs to twoN4 point DFTs. We can continue in this way until we run out of factors of N.

The greatest decimation occurs for N, a power of 2, in which the basic computational element

reduces to a 2-point DFT (which can be computed using real numbers). In this case the number of

multiplications reduces toN2 log2N (in fact some of these multiplications are trivial, but we will use

this figure as an upper bound). The calculation reduces to the calculation ofN2 log2N two-sample

DFT's with a multiplication by a power of W. This multiplier is popularly referred to as a 'twiddle

factor'.

The basic computational element is shown in the signal flow graph in Fig 1 below, and is often

referred to as a 'butterfly' based on its structure.

A

B

WN

r

+

-

C = A + B.WN

r

D = A - B. WN

r

+

+

Figure 1 Butterfly calculation with twiddle factor multiplier


Page 12

8/6/2019 NTTs

14/77

Notice that there is only one complex multiplication and two complex additions per butterfly. An

example is shown in Figure 2 for an 8-point transform using 4log28 = 4 x 3 = 12 butterfly

calculations. Appropriate pairs of shaded nodes represent the addition/subtraction network of the

butterfly calculation.

Note that although there are 12 multiplications indicated, in fact only 5 of these are non-trivial (i.e.

not a multiplication by unity). Before we leave this brief introduction to fast algorithms it can be

observed that the output sequence has been scrambled in the process of reducing the number of

multiplications, and that the structure is not the same from stage to stage. Algorithms are available

that do not scramble the output data (or require the input data to be scrambled) and there are also

algorithms that maintain a constant structure, but not together (a sort of Heisenberg uncertainty

principle for FFT's !).

X 0

X4

X 2

X 6

X 1

X 5

X 3

X 7

x

x

x

x

x

x

x

x

0

1

2

3

4

5

6

7

W8

0

W8

0

W 80

W8

0

W8

0

W8

0

8W

1

W 8

2

W8

3W

8

2

W8

0

W8

2

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

Figure 2 FFT Signal Flow Graph for 8-point DFT transform


Page 13

8/6/2019 NTTs

15/77

It seems to be a law of nature that as one attempts to reduce the number of arithmetic operations (or

at least the most costly..multiplication) the structure suffers. This comes back to haunt us when we

examine VLSI implementations of such algorithms.

There are many works available on the myriad of fast algorithms that have been discovered since the

initial algorithm was published (for example Blahut (1985)). It has been our intent here simply to

point out that these algorithms exist and to briefly discuss one of the earliest.

B. Use of the DFT in indirect computation of convolution

Based on the fact that DFT's can be computed very efficiently, it becomes clear that the operation of

convolution can also be computed much more efficiently than directly implementing the convolution

sum. Assuming that we wish to compute circular convolution, then an N-point convolution can be

computed using 2 DFTs and an inverse DFT. Since an inverse DFT is the same complexity as a

forward DFT, the number of multiplications involved are 3N log2 N + N. The reduction in

multiplications for an N-point circular convolution is1N { }1 + 3log2 N . For the case of

filtering with a fixed set of weights, we can store the DFT of the weight sequence and obtain a

greater reduction of 1N { }1 + 2. log2 N multiplications. These savings are reduced if one is

interested in aperiodic convolution (the normal filtering operation) but for large filter kernels (large

N) the savings can be significant enough to warrant using this indirect approach.

We note one unfortunate fact, even if we are convolving only real sequences, it is still necessary to

use complex arithmetic. In fact our reduction calculations above are predicated on the need for

complex sequence convolution. There is another unfortunate effect that since the roots of unity are

obtained from transcendental functions, we are not able to compute convolutions without error; even

the convolution between sequences of integers, which in the direct form can be easily computed

without invoking any errors, will have to be computed using approximations to the roots, and will, in

general, invoke errors in the final convolution sequence.


Page 14

8/6/2019 NTTs

16/77

There are mathematical systems in which roots of unity can be generated and used in calculations

without any approximations. In fact, approximations are meaningless. The mathematical systems

are finite algebras, and we will devote the remainder of this chapter to discussing their use in the

computation of DSP algorithms.

It is important to establish some basic concepts and theories relating to the subject of finite algebra;

the following section is included for that purpose.

IV. Finite Algebras

For our specific purposes we will be using algebraic structures that emulate normal integercomputations. The entities to be discussed are groups, rings and fields which contain a finite

number of integers as their elements and allow arithmetic operations that are modifications of

normal integer arithmetic operations. This will not be an exhaustive review; the reader is referred to

standard texts on number theory (e.g. Dudley, 1969, and Vinogradov, 1954) as well as texts on

modern algebra (e.g. Burton, 1970, and Schilling and Piper, 1975).

As a starting point, we will formally define the notion of a ring and field.

A. Rings and Fields

Definition 1:

If R is a non empty set on which there are defined binary operations of addition, +, and

multiplication, , such that the following postulates (I) - (VI) hold, we say that R is a ring,

(I) Commutative law of addition a+b =b+a

(II) Associative law (a+b)+c = a+(b+c)

(III) Distributive law a(b+c) = ab + ac


Page 15

8/6/2019 NTTs

17/77

(IV) Existence of an element, denoted by the symbol 0, of R such that a+0 = a for every a

R

(V) Existence of an additive inverse. For each a R, there exists x R such that

a+x = 0

(VI) Closure, where it is understood that a, b, c are arbitrary elements of R in the above

postulates

A ring in which multiplication is a commutative operation is called a commutative ring. A ring with

identity is a ring in which there exists an identity element, 1, for the operation of multiplication,

a1 = 1a = a for all a R.

The ring of integers is a well known example of a commutative ring with identity. In abstract

algebra, elements of a ring are not necessarily the integers, even not necessarily numbers; e.g. they

can be polynomials. Given a ring R with identity1, an element a R is said to be invertible, or to

be a unit, whenever a possesses an inverse with respect to multiplication. The multiplicative inverse,

a-1, is an element such that a-1a = 1. The set of all invertible elements of a ring is a group with

respect to the operation of multiplication and is called a multiplicative group.

Definition 2:

If R is a ring and 0a R, then a is called a divisor of zero if there exists someb0, b R such

that ab = 0.

Definition 3:

An integral domain is a commutative ring with identity which has no divisors of zero.

Definition 4:

A ring, F, is said to be a field provided that the set F-{0} is a commutative group under

multiplication.


Page 16

8/6/2019 NTTs

18/77

Viewed otherwise: a field is a commutative ring with identity in which each non-zero element

possesses an inverse under multiplication. It follows (McCoy, 1972) that every field is an integral

domain. The rings (fields) with a finite number of elements are called finite rings (fields). A ring of

integers modulo m, denoted here as Zm, is an example of a finite ring.

In every finite field with p elements, the nonzero elements form a multiplicative group. This

multiplicative group of order p-1 is cyclic; i.e. it contains an element, a, whose powers exhaust the

entire group. This element is called a generator, or a primitive root of unity, and the period or order

ofa is p-1. The order of any element x in the multiplicative group is the least positive integer t such

that xt = 1, xs1, s[1,t). The order t is a divisor of p-1 and x is called a primitive t-th root of

unity.

For every element x of the set F-{0} the mapping defined by: x=at ,t {0, 1, ..., p-2} is the

isomorphism of the additive group of integers with addition modulo p-1 and the multiplicative

group of the field F.

The integert is called the index ofx relative to the base , denoted indax.

1. The ring of residue classes

In order to describe the system, the notion of congruence should be introduced. The basic theorem,

known as the division algorithm will be established first.

Division Algorithm

For given integers a and b, b not zero, there exists two unique integers, q and r, such that a = b.q +

r, r [0, b). It is clear that q is the integer value of the quotient ab . The quantity r is the least

positive (integer) remainder of the division of a by b and is designated as the residue of a modulo b,

or ab. We will say that b divides a (written ba) if there exists an integer k such that a = b.k.

Definition 5:


Page 17

8/6/2019 NTTs

19/77

Two integers c and d are said to be congruent modulo m, written

c d (mod m) if and only if m(c-d)

Since m0, c c (mod m) by definition

Another alternative definition of congruence can be stated as follows:

Two integers c and d are congruent modulo m if and only if they leave the same remainder when

divided by m, cm = dm .

Definition 6:

A set of integers containing exactly those integers which are congruent, modulo m, to a fixed integer

is called a residue class, modulo m.

The residue classes (mod m) form a commutative ring with identity with respect to the modulo m

addition and multiplication, traditionally known as the ring of integers modulo m or the residue ring,

and denoted Zm . The ring of residue classes (mod m) contains exactly m distinct elements. The

ring of residue classes (mod m) is a field if and only if m is a prime number, because multiplicative

inverses, denoted a-1 (mod m) or a-1m , exist for each nonzero a Zm. Thus the nonzero classes

ofZm form a cyclic multiplicative group of order m-1, {1, 2, ..., m-1}, with multiplication modulo

m, isomorphic to the additive group {0, 1, ..., m-2} with addition modulo m-1.

If m is composite,Zm is not a field. Multiplicative inverses do not exist for those nonzero elements

a Zm for which the gcd (a, m) 1. The Euler's Totient Function, denoted f(m), is a widely used

number theoretic function and is defined as the number of positive integers less than m andrelatively prime to it. It follows that the number of invertible elements ofZm is equal to f(m). If m

has the prime power factorization m = m1e1 . m2

e2 . .... mLeL then we can write:

f(m) = m

1 -

1m1

1 -

1m2

......

1 -

1mL


Page 18

8/6/2019 NTTs

20/77

The Euler-Fermat theorem states that if c is an integer, m is a positive integer and

(c, m) = 1 , then cf(m) 1 (mod m).

The important consequence of this theorem is that there is an upper limit on the order of any

element a Zm. Specifically, the order t is a divisor off(m).

2. Galois Fields

For any prime m and any positive integer n, there exists a finite field with mn elements. This

(essentially unique) field is commonly denoted by the symbol GF(mn) and is called a Galois field

in honor of the French mathematician Evariste Galois. Since any finite field with mn elements is a

simple algebraic extension of the fieldZm, a brief review of the basic concepts of the extensions of

a given field will be presented.

Let F be a field. Then any field K containing F is an extension of F. Ifl is algebraic over F, i.e. ifl

is a root of some irreducible polynomial f(x) F[x] such that f(l) = 0, then the extension field

arising from a field F by the adjunction of a root l is called a simple algebraic extension, denoted

F(l).

Each element of F(l) can be uniquely represented as a polynomial:

a0 + a1l + .... + an-1 ln-1 , ai F.

This unique representation closely resembles the representation of a vector in terms of the vectors

of a basis

1, l, ....... , ln-1.

The vector space concepts are sometimes applied to the extension fields, and F(l) is considered as a

vector space of dimension n over F.


Page 19

8/6/2019 NTTs

21/77

The field of complex numbers is an example of an extension of the field of real numbers; it is

generated by adjoining a root j = -1 of the irreducible polynomial x2 + 1.

If f(x) is an irreducible polynomial of degree n over Zm, m a prime, then the Galois Field with mn

elements, GF(mn) is usually defined as the quotient field Zm[x] / (f(x)), i.e., the field of residue

classes of polynomials of Zm[x] reduced modulo f(x). All fields containing mn elements are

isomorphic to each other. In particular, Zm[x] / (f(x)) is isomorphic to the simple algebraic

extension Zm(l), where l is a root of f(x) = 0.

Now that we have some basic background in finite algebraic structures, we can return to the

problem of indirectly computing convolution via transforms that exhibit the cyclic convolution

property. We will expand our definition of such transforms from the DFT to the NTT (Number

Theoretic Transform).

V. Number Theoretic Transforms

Number Theoretic Transforms (NTT's) were discovered independently by a number of researchers

(Nicholson 1971; Pollard 1971; Rader 1972) as a generalization of the standard DFT with respect

to the ring over which the transform is defined. The transform retains the same cyclic convolution

property as the DFT, but allows error free computation with the promise of much faster hardware

solutions.

The NTT has the same form as the DFT, with the exception that it is computed over a finite ring, or

field, rather than over the field of complex numbers.

Xk = n=0

N-1p [ ]xn p ank (10)

where the ring, or field, is R(p) or GF(p) respectively. Note that a new notation has been introduced

here. In order to explicitly identify arithmetic operations over finite fields, or rings, we will use the


Page 20

8/6/2019 NTTs

22/77

symbols p,p to represent addition and multiplication modulo p, respectively, and p to

represent summation modulo p. Where it is clear that a single ring or field is being referred to, then

the subscript may be dropped (except on the summation). This notation will be used throughout the

rest of the paper, and will be extended where appropriate.

The NTT that has probably been studied more than any other is the Fermat Number Transform

(FNT). This, in large part, is due to the ability to use modified binary arithmetic to perform the

computations. It is not clear, given the present body of knowledge on VLSI implementations, that

this ability to use modified binary hardware is necessarily a desirable attribute of number theoretic

algorithms. It is the intent of this paper to present alternative forms of implementation of number

theoretic techniques that do not have to be predicated on the existence of standard computational

elements. This opens up the possibility of using a wider variety of number theoretic transforms. Let

us, however, start off by exploring number theoretic transforms based on Fermat numbers.

A. Fermat Number Transforms

Although the NTT allows convolution to be computed without error (unlike the DFT which

introduces errors due to the requirement for arithmetic with coefficients obtained from

transcendental functions), the use of such a finite field (or ring) transform is only appropriate when

the hardware has no greater complexity than normal computations over the reals. Satisfying this

requirement can be a problem when the field modulus is not chosen with the implementation in

mind. Among the earliest implementations considered (and still occupying a small, but current,

research interest) are Fermat Number Transforms (FNT's).

The FNT is initially defined over a Galois Field GF(Ft) where the modulus Ft = 22t+1 is the t'th

Fermat Number (FN), where the FN is prime (primes do not exist for all values of t). This

transform brings about three main implementation simplicities (Agarwal and Burrus, 1974):


Page 21

8/6/2019 NTTs

23/77

1 The arithmetic is very similar to ordinary binary arithmetic because the modulus is close to a

power of 2.

2 Simple forms can be found for the multiplicative group generator that reduces the

multiplications required to compute the transform to modified bit-pattern shifts.

3 A standard Fast algorithm, obtained from an appropriate FFT, can be used to compute the

transform.

The idea of being able to compute over a finite field but with only slightly modified binary hardware

is obviously appealing, albeit constraining.

The FNT pair is defined as:

Xk = n=0

N-1 Ft

xn Ft a

nk (11)

xn = N-1

k=0

N-1 Ft

Xk Ft a

-nk (12)

There are some points that we should note:

1 Since NFt -1, it is clear that we can generate a maximally composite N= 22t and so use

one of the fast algorithms available for the DFT, to optimize a computational structure for the FNT.

2 If we restrict N to be even, N-1 will always exist.

3 The first four Fermat Numbers are prime, and 3 is a primitive root for all four of these

primes.

From 1 and 3 we find that 322t1. If we wish to find a generator, a, for a multiplicative group of

order 2b(this will allow a transform of length 2b), then a2b 32

2t 1; hence a 32

(2t-b). This

last expression allows us to determine the relationship between suitable values of a (for ease of

implementation) and possible transform lengths. The easiest value ofa to choose is clearly a = 2.


Page 22

8/6/2019 NTTs

24/77

This will allow multiplication by powers ofa to be replaced by shifts modulo Ft. Since Ft is close

to a power of 2, the hardware should be only moderately more complex than shifts modulo a power

of 2 (binary shifts).

It turns out that for a = 2, the order is 2t+1. This can be verified, using ordinary arithmetic, from:

22(t+1)

= (22t

+ 1) . (22t

- 1) + 1 1

The largest Fermat prime is F4, for which the order is 32; the arithmetic is modified 17-bit binary.

If we examine F5 and F6, we find that they are composite numbers of the form Ft = K.2t+2 + 1.

We have to pause at this moment to reflect on the fact that Ft, t {5, 6} is not a prime number.

Clearly we will not be computing over a field, so will the NTT still work? It turns out that the

conditions for the existence of the transform still exist when computing over a ring. The restriction

is that the transform has to exist over each field constructed from each of the prime factors of the

composite modulus. The transform length, N, therefore satisfies NF(i)

t "i, where

Ft =F(i)t (this assumes no powers of primes as factors) . For F5 and F6 we can show thatN = 2t+2 is the maximum transform length supported by these Fermat numbers; as a necessary

condition, we can see that 2t+2(Ft - 1). Although we are only computing over a ring of integers

ZFt, 2t+2/ K.2t+2+1, and so N-1 exists. The length supported by a = 2 is 2t+1, as before.

For all the Fermat Numbers considered here, it turns out that the transform length, N = 2t+2, can be

achieved by using a value ofa = 2 (Agarwal and Burrus (1974)) ; i.e. a2 = 2. The expression for

a = 2 is given by:

a = 22(t-2)

. (22(t-1)

- 1)

which is clearly representable by a 2-bit (2 ones in a field of zeros) generator. For a fast DIT or

DIF algorithm, only the first, or last, stage will require odd powers of a, and so most of the

multiplications will only require single-bit multipliers. Given the use ofa = 2 and the fifth FN,


Page 23

8/6/2019 NTTs

25/77

we have a dynamic range of just over 32-bits, with a transform length of N = 128. It is clear that

there is an unfortunate link between dynamic range and transform length, assuming the requirement

for a fast algorithm.

It might be helpful to take an example of convolution over GF(F2) in order to reinforce the results

presented here. In order to illustrate the difference between performing convolution using an NTT

and an 8-point DFT, we will chose a = 32 = 9 in order that the NTT may also generate an 8-point

transform length.

The transform pair is:

Xk = n=0

7

17[ ]xn 17 9nk

xn = 8-1

n=0

7

17[ ]Xk 17 9-nk

We can concisely show the calculation using matrix notation (the modulo arithmetic is implied):

X = T x (13)

where T is the transformation matrix, whose elements are given by:

Tkn = 9kn

Note that the corresponding DFT transformation matrix has elements given by:

Tkn = e-jkn/4

We can immediately see the major difference in the computation of both transforms. Although the

DFT uses normal arithmetic, the calculations are over the complex field, with elements that are not

representable in a finite machine. The NTT, although requiring finite field (modulo) arithmetic,

requires calculations over an integer field using elements that are defined without error.


Page 24

8/6/2019 NTTs

26/77

As an example calculation, in order to compare the two approaches, we will cyclically convolve the

sequence {x} = {1, 2, 3, 4, 0, 0, 0, 0} with itself. This example is chosen to illustrate several points,

as can be seen if we compute the cyclic convolution directly:

yk = n=0

7xn . x[k8(-n)] or {y} = {1, 4, 10, 20, 25, 24, 16, 0}

If we now compute the aperiodic (normal) convolution we find:

yk = n=0

7

xn . xk-n or {y} = {1, 4, 10, 20, 25, 24, 16, 0}

We can observe that both the computed sequences are identical, which shows that it is possible to

generate normal aperiodic convolution from periodic convolution by padding the sequences to be

convolved with zeros. This forms the basis for the classical techniques of overlap-add and overlap-

save for using cyclic convolution to compute normal convolution (Gold and Rader (1969)). As a

second point, we also notice that some of the elements of the convolution output lie outside the

elements of the field over which the NTT is to be computed, GF(17). This will be discussed after

the example calculations.

We will generate the forward transform as the matrix multiplication of equation (13).

10

16

6

11

15

13

7

15

=

1 1 1 1

1 9 13 15

1 13 16 4

1 15 4 9

1 16 1 16

1 8 13 2

1 4 16 13

1 2 4 8

1 1 1 1

16 8 4 2

1 13 16 4

16 2 13 8

1 16 1 16

16 9 4 15

1 4 16 13

16 15 13 9

Q17

1

2

3

4

0

0

0

0

The symbol, Q17 , indicates matrix multiplication over GF(17). Since we are convolving the

sequence with itself, we generate the transform domain point-wise multiplication (over GF(17)) as

below, and feed this result to the input of the inverse transform calculation.


Page 25

8/6/2019 NTTs

27/77

15

1

2

2

4

16

15

4

=

10

16

6

11

15

13

7

15

17

10

16

6

11

15

13

7

15

815

12

7

13

5

9

0

=

1 1 1 11 2 4 8

1 4 16 13

1 8 13 2

1 16 1 16

1 15 4 9

1 13 16 4

1 9 13 15

1 1 1 116 15 13 9

1 4 16 13

16 9 4 15

1 16 1 16

16 2 13 8

1 13 16 4

16 8 4 2

Q17

151

2

2

4

16

15

4

The inverse transform requires a final multiplication by 8-1= 15 to obtain the cyclic convolution

output.

1

4

10

3

8

7

16

0

= 8-1 17

8

15

12

7

13

5

9

0

We note that some of the elements of the output convolution do not equal the original calculation

using the direct convolution sum. They are, in fact, the result of finding the least positive residue,

modulo 17, of the correct output. In fact the results are not wrong in the accepted sense; they are

simply a mapping of the correct values onto GF(17). We will see later that such results can still be


Page 26

8/6/2019 NTTs

28/77

used, and in fact will provide a technique for removing some of the problems of the linkage of

dynamic range with transform length.

It will be constructive at this point to examine the use of the DFT in performing the same

convolution. Because we are performing the calculations over the complex field, we will treat the

'real' and 'imaginary' parts of the sequences and W elements as the elements of a 2-tuple; the

interaction between the 'real' and 'imaginary' parts will use the usual rules of complex arithmetic.

The forward transform is shown below:

10.00

-0.41

-2.00

2.41

-2.00

2.41

-2.00

-0.41

0.00

-7.24

2.00

-1.24

0.00

1.24

-2.00

7.24

=

1.00

1.00

1.00

1.00

1.00

1.00

1.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.71

0.00

-0.71

-1.00

-0.71

0.00

0.71

0.00

-0.71

-1.00

-0.71

0.00

0.71

1.00

0.71

1.00

0.00

-1.00

0.00

1.00

0.00

-1.00

0.00

0.00

-1.00

0.00

1.00

0.00

-1.00

0.00

1.00

1.00

-0.71

0.00

0.71

-1.00

0.71

0.00

-0.71

0.00

-0.71

1.00

-0.71

0.00

0.71

-1.00

0.71

1.00

-1.00

1.00

-1.00

1.00

-1.00

1.00

-1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

-0.71

0.00

0.71

-1.00

0.71

0.00

-0.71

0.00

0.71

-1.00

0.71

0.00

-0.71

1.00

-0.71

1.00

0.00

-1.00

0.00

1.00

0.00

-1.00

0.00

0.00

1.00

0.00

-1.00

0.00

1.00

0.00

-1.00

1.00

0.71

0.00

-0.71

-1.00

-0.71

0.00

0.71

0.00

0.71

1.00

0.71

0.00

-0.71

-1.00

-0.71

1.00

2.00

3.00

4.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

The transform domain input to the inverse transform is the point-wise multiplication of the forward

transform with itself:


Page 27

8/6/2019 NTTs

29/77

100.00

-52.28

0.00

4.28

4.00

4.28

0.00

-52.28

0.00

6.00

-8.00

-6.00

0.00

6.00

8.00

-6.00

10.00

-0.41

-2.00

2.41

-2.00

2.41

-2.00

-0.41

0.00

-7.24

2.00

-1.24

0.00

1.24

-2.00

7.24

10.00

-0.41

-2.00

2.41

-2.00

2.41

-2.00

-0.41

0.00

-7.24

2.00

-1.24

0.00

1.24

-2.00

7.24

=

The inverse transform is computed below with a final division by 8.

8.00

32.00

80.00

160.00

200.00

192.00

128.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

=

1.00

1.00

1.00

1.00

1.00

1.00

1.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.71

0.00

-0.71

-1.00

-0.71

0.00

0.71

0.00

0.71

1.00

0.71

0.00

-0.71

-1.00

-0.71

1.00

0.00

-1.00

0.00

1.00

0.00

-1.00

0.00

0.00

1.00

0.00

-1.00

0.00

1.00

0.00

-1.00

1.00

-0.71

0.00

0.71

-1.00

0.71

0.00

-0.71

0.00

0.71

-1.00

0.71

0.00

-0.71

1.00

-0.71

1.00

-1.00

1.00

-1.00

1.00

-1.00

1.00

-1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

-0.71

0.00

0.71

-1.00

0.71

0.00

-0.71

0.00

-0.71

1.00

-0.71

0.00

0.71

-1.00

0.71

1.00

0.00

-1.00

0.00

1.00

0.00

-1.00

0.00

0.00

-1.00

0.00

1.00

0.00

-1.00

0.00

1.00

1.00

0.71

0.00

-0.71

-1.00

-0.71

0.00

0.71

0.00

-0.71

-1.00

-0.71

0.00

0.71

1.00

0.71

100.00

-52.28

0.00

4.28

4.00

4.28

0.00

-52.28

0.00

6.00

-8.00

-6.00

0.00

6.00

8.00

-6.00


Page 28

8/6/2019 NTTs

30/77

1.00

4.00

10.00

20.00

25.00

24.00

16.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

8.00

32.00

80.00

160.00

200.00

192.00

128.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

=1

8

The calculations were performed at great precision, and so the effect of quantization noise is not

seen in this limited precision format. The quantization noise is present in both the inexact

representation of the W coefficients, and in the scaling procedures used to keep the data within the

dynamic range of the computational system. Even at large precision, zero elements become non-

zero, albeit very small, values. The fact that the input sequence is purely real, shows up in the

hermitian symmetry of the input to the inverse transform (even symmetry in real part about the

center sample (sample 4) and odd symmetry in the imaginary part). Although we have explicitly

shown all details of the calculations (not normally shown in most texts) an examination of the DFT

calculation compared to the GF(17) NTT calculation, demonstrates the simplicity and error free

behavior of computing indirect convolution over finite fields.

We are now in a position to summarize the features about each approach to computing indirect

convolution. This is presented in Table I below:

Table I Comparison of DFT and FNT integer sequence convolution

Comments

Feature DFT FNT

Transform Length No restriction Restricted by requirement

NF(i)

t "i


Page 29

8/6/2019 NTTs

31/77

Dynamic Range No restriction Restricted by requirement

|Xk|

NTTs

Documents

Transcript of NTTs