NTTs

download NTTs

of 77

Transcript of NTTs

  • 8/6/2019 NTTs

    1/77

    Number Theoretic

    Transforms

    G.A. Jullien

    Extract from Number Theoretic Techniques in Digital Signal

    Processing

    Book Chapter

    Advances in Electronics and Electron Physics, Academic PressInc., (Invited), vol. 80, Chapter 2, pp. 69-163, 1991.

  • 8/6/2019 NTTs

    2/77

    Introduction

    This chapter discusses the application of number theory to the intensive computations required in

    many Digital Signal Processing (DSP) systems. Number theory was always looked upon as one of

    the last vestiges of pure mathematical pursuit, where the idea of applying the theories to practical

    purposes was irrelevant. Over the last few decades we have been discovering very practical

    applications for number theory; among these are applications in coding theory, communications,

    physics and digital information (Schroeder (1985)). Our specific interest in this work is the

    application of number theory to the mechanics of arithmetic computations themselves. Modern day

    computer technology is heavily dependent on the manipulation of numbers, in fact the automization

    of arithmetic operations was the quest of all the early computer pioneers, from Pascal to Babbage to

    Von-Neumann. The involvement of number theory in this endeavor, however, has been strangely

    very limited. The use of number theory in helping to perform the basic arithmetic operations has not

    been accepted by the commercial world. Perhaps the reason for this is the pervasion of the binary

    number system in the mechanization of computer arithmetic. A large number of hardware designers

    have little or no knowledge that alternative ways of computing arithmetic exist!

    There is one field in which the high speed arithmetic manipulation of numbers is of paramount

    importance. This is the field of Digital Signal Processing (DSP), and it is here that alternative

    arithmetic hardware may find a niche. This is indeed the subject of this paper.

    Why is it that some DSP systems are so computationaly intensive? Typically the DSP system we

    are discussing will be involved with the processing of streams of data that are indeterminately long;the requirement of the system is to produce processed output data at the same rate that the input

    data is entering the processor. This is referred to as real-time processing. Often the arithmetic

    manipulations are simple in form (cascades of additions and multiplications in a well defined

    structure) but the numbers of operations that have to be computed every second can be enormous. It

    Page 1

  • 8/6/2019 NTTs

    3/77

    will be constructive at this point to consider an example to illustrate the processing power required

    by a typical DSP system.

    Let us consider a basic problem; the filtering of television pictures in real-time. We will assume that

    a two-dimensional filter is required and is given by the following equation:

    ym,n = i=-2

    2

    j=-2

    2

    wi, j . xm-i,n-j

    Here wi,j is the filter kernel, and represents the weights on a two-dimensional averaging procedure

    around each output sample (ym,n). The averaging is based on a neighborhood of input samples {xi

    ,j

    }, that extends 5 samples in each dimension and centered on the output pixel position. The double

    summation, in fact, represents two-dimensional convolution. Convolution is a basic operation to all

    linear signal processing problems since it represents the mapping function of an input signal to an

    output signal through a linear system. The real time constraint is important since, as we will see, an

    extremely large computational requirement will be necessary in order to construct the filter.

    In order to see the speed of operation required by our filter, we will perform some elementary

    calculations based on typical requirements for the filter. We assume a 525 line scan with 500

    picture elements (pixels) in each line. We also assume that each primary colour is to be filtered

    independently with the 5 x 5 filter. With an interlaced system we need to filter 30 images every

    second. Each output pixel (made up of three primary colour sub-pixels) requires 75

    multiply/accumulate operations (M/A Ops) to implement the filter. There are 262,500 pixels in each

    image and so 1.96875 x 107 M/A Ops are required for each image. Since we require to filter 30

    images per second, the total number of M/A Ops per second is 5.9 x 108. Assuming that an M/A

    Op is considered a basic operation in the processor used to filter the image, then we need a machine

    that delivers 590 Mega Operations per second (MOPS). In these terms we require a super

    computer to deliver this performance!

    Number Theoretic Techniques in Digital Signal Processing

    Page 2

  • 8/6/2019 NTTs

    4/77

    Clearly the implementation has to be somewhat different to a large main frame super computer, it

    would certainly be difficult to fit one into a normal sized television set!

    The filtering operation has the advantage of not requiring the full features of a general purpose

    computer (the filtering operation does not have to be performed within the typical multi-user,

    complex operating system environment of a main frame machine) and the computations are simply

    repeated arithmetic operations in a predetermined structure. In fact, it is rather pointless to talk about

    the computational requirements in normal general computer terms (MIPs, MOPs, MFLOPs, etc.),

    but rather to talk in direct terms of the data flow requirements; that is, the throughput rate of the

    filter in samples per second (in this example, a sample is a pixel). For our example filter we have to

    filter pixels at a rate of 7.875 x 106 pixels per second. If we provide arithmetic hardware which

    contains 75 Multiply/Accumulate circuits, then each circuit has to operate within 120nS. If we use

    currently available integrated circuits that operate within 60nS, then we can build a circuit with only

    38 integrated circuits for the basic arithmetic operations, and perform two operations, sequentially,

    with one circuit. We have therefore reduced the original requirement of a supercomputer to a

    system containing a relatively small number of 'chips'.

    It is the purpose of this paper to discuss ways in which number theoretic techniques can be used to

    perform DSP operations, such as this example filter, by reducing the amount of hardware involved

    in the circuitry. Since we are going to be directly concerned with the implementation of DSP

    arithmetic, it might be helpful to finish off our example by considering the arithmetic details of each

    Multiply/Accumulate operation.

    The representation of each primary colour pixel by 8-bits is more than adequate for very high

    definition of the image and the quantization of each filter weight by 8-bits is also very generous.

    This naturally leads to a 16-bit multiplier structure. Since there are an accumulation of 25 such

    multiplications for each output pixel, an extra 5-bits are required for this. With 21-bits we can, in

    fact, compute each output pixel with no error in the computation process. Since 21-bits is a

    pessimistic upper bound for the computational dynamic range (only required if both the filter

    Number Theoretic Techniques in Digital Signal Processing

    Page 3

  • 8/6/2019 NTTs

    5/77

    coefficients and the input pixels are at their maximum values - a very rare occurrence), we can easily

    specify a 20-bit range for accurate computation.

    This calculation leads us to an important observation that we do not need floating point arithmetic.

    We can, in fact, perform error-free computation within the dynamic range of a typical floating point

    standard mantissa length - no manipulation of the exponent being required at all! This example

    represents a fine-grained computation (one in which successive parts of the computation are of a

    limited dynamic range) and is typical of many DSP algorithms. The limitations we are able to

    impose on the arithmetic calculations will be a great help in reducing the hardware required in the

    very high-speed throughput systems to be discussed in this paper.

    For the purposes of this paper we will restrict ourselves to the computation of linear filter and

    transform algorithms which have, for the one-dimensional case, the inner product form:

    yn = m=0

    N-1

    xm.wf(n,m)

    In the case of linear filtering (convolution) f(n,m) = n-m; for the case of a linear transform with the

    convolution property, wf(n,m) is a primitive Nth root of unity. For the DFT, wf(n,m) = e-i

    2

    N

    n m

    .

    Although this limitation appears overly restrictive, the general form of equation (1) probably

    accounts for the vast majority of digital signal processing functions implemented commercially, and

    also leads us to a body of literature that uses number theory to indirectly compute the convolution

    form of the inner product formulation. We will also restrict our investigation of the application of

    number theoretic techniques to those that use finite field arithmetic directly in the computations.

    This will not cover those algorithms whose structure is derived via number theoretic techniques, and

    whose computation is performed over the usual complex field (e.g. Winograd Fourier Transforms);

    there is a vast amount of literature already available on this separate subject, and the reader is

    directed to the work by McClellan and Rader (1979) as a starting point.

    Number Theoretic Techniques in Digital Signal Processing

    Page 4

  • 8/6/2019 NTTs

    6/77

    Our discussion in this review will concentrate on two areas: the use of finite ring/field operations

    for computing ordinary integer computations (Residue Number System) for high speed signal

    processing and the use of algorithms defined over finite rings/fields in which not only the arithmetic

    is performed over such finite structures, but also the properties of the algorithm depend on the

    properties of the finite field/ring over which the algorithm is defined. This is an important

    differentiation. Our sole concern here is to examine arithmetic and general computations over finite

    mathematical systems.

    Our starting point should be at the recognized starting point for the modern field of digital signal

    processing, the Fast Fourier Transform (FFT). The FFT is not a new transform, but really a name

    for a set of algorithms that compute the Discrete Fourier Transform (DFT) with much reduced

    arithmetic operations. In particular, multiplications, traditionally a hardware and/or time consuming

    arithmetic operation, are reduced considerably. With the publishing of the first FFT algorithm

    (Cooley and Tukey (1965)) digital signal processing began to emerge as a field in its own right.

    Only a few years after that landmark paper, the first applications of finite field transforms to

    computing convolution appeared. We will not attempt to retrace the history here, but will discuss

    these two discoveries as initial steps on the road to the application of number theory to the digital

    processing of signals. The discovery of efficient ways of computing the DFT allowed both the

    computation of spectra, and the implementation of finite impulse response filters to be carried out

    with much reduced computational time. The most interesting use of the FFT, in the initial years, was

    in the latter application. The ability of the DFT to indirectly compute the inner product form for FIR

    filters comes from its convolution property. The result is surprising because it seems, on the

    surface, to involve more work in that three transforms have to be computed. It is also surprising that

    a transform with a basis set involving trigonometrical functions can be used to efficiently compute

    convolutions between, say, integer sequences, but it does indeed work.

    In order to examine this remarkable property we will briefly discuss FIR filters and then introduce

    the DFT.

    Number Theoretic Techniques in Digital Signal Processing

    Page 5

  • 8/6/2019 NTTs

    7/77

    A. Finite Impulse Response Digital Filtering using DFTs

    Finite impulse response (FIR) digital filters produce an output based on a weighted sum of present

    and past inputs. Mathematically this is a weighted averaging scheme represented by:

    yk = i=0

    N-1

    hn . xk-n

    where {hn} is the weighting sequence, and {xn} is the input sequence.

    The weighting sequence, {hn} characterizes the filter, and it is the determination of the sequence that

    corresponds to the design of the filter. The sequence is sometimes referred to as the impulse

    response based on the equivalent characterization of analog filters. In the digital case the impulse

    function is a perfectly realizable sequence {1, 0, 0, 0, .....}. If we use this impulse function as the x

    sequence, the output of the filter will be the {hn} sequence. We can also define a frequency

    response for the filter by using an input sequence that is a complex sinusoid, xn = ein

    . Here, the

    frequency is simply an angular increment, , which locates the value of the input, at sample n, on

    the unit circle at an angle ofn. As with continuous systems, we can invoke the use of a transform

    to map convolution into multiplication. The transform we use is the z-transform (Jury (1964))

    which is defined as:

    X(z) =i=0

    N-1

    xn z-n

    Using this definition, and the convolution property, we are able to define a transfer function for the

    FIR filter as:

    Y(z)X(z) = H(z)

    By letting z= ei

    , we can find the magnitude and phase shift (polar coordinates) of H(ei

    ), and

    hence the frequency response (H(ei

    ) vs ). We note that a simple inversion procedure exists; we

    Number Theoretic Techniques in Digital Signal Processing

    Page 6

  • 8/6/2019 NTTs

    8/77

    simply examine the coefficients of H(z) and assign them to delayed samples of the impulse

    response. H(z) can also be obtained by a polynomial division of Y(z) by X(z); for the FIR filter we

    know that this division will always have finite length and so H(z) (for z an indeterminate complex

    number) has only zeros. For filters which have an infinite length response (infinite impulse

    response (IIR) filters), H(z) contains poles as well as zeros; i.e. we represent H(z) as a rational

    function of polynomials. Closed solutions exist for determining the impulse response for IIR

    filters based on the roots of H(z) (Jury (1964)). We will not consider IIR filters here since our

    interest is in implementing a finite convolution process. IIR filters possess the property that they

    are able to implement arbitrary frequency magnitude responses with much fewer multiplications

    than equivalent FIR filters. They have unfortunate side-effects, however, of: possible instability due

    to inexact multipliers; possibility of producing limit cycle oscillations; possibility of overflow

    oscillations; difficulty in implementing linear phase responses. Although a great deal of effort has

    been spent on studying such filters (see, for example, Rabiner and Gold (1975)), it is not clear that

    the implementation efficiency suitably compensates for the problems. Perhaps the best solution is

    to concentrate on efficient implementation strategies for FIR filters. This is the main subject of this

    paper.

    II. Discrete Fourier Transforms

    The DFT is defined as a transformation that maps a sequence xnDfiXk where:

    Xk = n=0

    N-1

    xn e-j2nk/N

    = n=0

    N-1

    xn ejWNnk (1)

    with WN=2N

    In terms of the Fourier transform, this may be regarded as an approximation to the Fourier integral,

    but it turns out that it is a transform in its own right with exact properties. Two of the properties that

    Number Theoretic Techniques in Digital Signal Processing

    Page 7

  • 8/6/2019 NTTs

    9/77

    are important to the subject of this paper are the inverse and convolution property. In fact we need

    the first to prove the second.

    A. Inverse and Convolution Property

    One important property is that the transform is reversible, i.e. the mapping xnD-1fi Xk exists, with

    an inverse given by:

    xn =1N

    k=0

    N-1

    Xk ej-WNnk (2)

    The proof that the inverse exists yields the basis for the second important property, the convolution

    property.

    The easiest way to prove that (2) is the inverse of (1) is by direct substitution. We will make the

    substitution using an appropriate dummy variable, m; the result should yield that xn=xm,"m=n.

    xn =1N

    k=0

    N-1

    m=0

    N-1

    xm ejWNmk

    e-jWNnk

    interchanging the order of summation and gathering terms we find:

    xn =1N

    m=0

    N-1

    xm

    k=0

    N-1

    ejWNk(m-n)

    or

    xn =1N

    m=0

    N-1

    xm

    k=0

    N-1

    WN(k[m-n]) (3)

    The function, W(k[m-n]), displays rather an interesting property.

    Number Theoretic Techniques in Digital Signal Processing

    Page 8

  • 8/6/2019 NTTs

    10/77

    W(k[m-n]) is a function on the unit circle in the complex plane with N unique values: 1, ej[m-n]WN

    , ej2[m-n]WN , .... , e

    j(N-1)[m-n]WN for k = 0, 1, 2, .... , N-1. For [m-n]=KN, where K is some integer,

    all values of the function are 1 and so the sum, k=0

    N-1

    WN(k[m-n]), evaluates to N. For [m-n]KN

    we can invoke the fact that the function yields a finite geometric series with the sum:

    k=0

    N-1 WN(k[m-n]) =

    1 - ejN[m-n]WN

    1 - ej[m-n]WN

    = 0.

    Thus, the only non-zero values generated by (3) are for [m-n]=KN, and for these values we obtain

    the result xn=xn+KN. If we restrict both the sample and transformation domain to be periodic, with

    period N, then we have proved that the transform is invertible.

    Based on this result for the existence of an exact inverse for the DFT, we can now proceed to

    examine the convolution property.

    We start by considering the result of multiplying two transform domain sequences together.

    Assume that the sample domain sequences are xn and yn with transform domain sequences of Xk

    and Yk. The result of multiplying the transform domain sequences is a new sequence Zk = Xk . Yk .

    Using the DFT of the x sequence explicitly, we have:

    Zk = n=0

    N-1 xn. WN(kn) .Yk (4)

    Inverting equation (4) yields:

    zm =1N

    k=0

    N-1

    n=0

    N-1

    xn. WN(kn) .Yk . WN(-km) (5)

    We now interchange the order of summation to find:

    Number Theoretic Techniques in Digital Signal Processing

    Page 9

  • 8/6/2019 NTTs

    11/77

    zm = n=0

    N-1

    xn

    1N

    k=0

    N-1

    Yk . WN(k[n-m])(6)

    The inner expression of equation (6), in parenthesis, is the sequence y[m-n]

    . Since {y} is periodic,

    we can retain values from a single period by forming y[mN(-n)]. Note that here we are using the

    notation Nto indicate addition modulo N. This notation will be expanded later.

    From (6) we obtain the final result for the output sequence:

    zm = n=0

    N-1

    xn y[mN(-n)](7)

    zm is therefore the convolution of the sequence {x} with the sequence {y}. It is the convolution of

    two periodic sequences, and is often called cyclic, or periodic, convolution. As noted before, we only

    need to know the samples of a single period in order to perform the calculation. If we require to

    determine the periodic convolution of two sequences, then we can perform the calculation indirectly

    by taking DFT's of each sequence, multiplying in the transform domain, and inverting the result. If

    we require the convolution of two aperiodic sequences, it turns out that we can still use periodic

    convolution, providing that we are prepared to perform several periodic convolutions with sub

    samples of the sequences, and assemble the final sequence from partial outputs of the periodic

    convolutions; these are the so-called overlap/save and overlap/add procedures. We are not able to

    give full details of such procedures here, but there are many fine text books which detail the two

    approaches (e.g. Rabiner and Gold (1975)).

    The use of an indirect computation of convolution is advantageous when the amount of computation

    involved is considerably less than that required using the direct convolution sum. An N point

    periodic convolution requires N2 multiplications and N2-N binary (two valued) additions. Can we

    reduce this by an indirect computation involving DFT's? The answer is yes, but only if the

    transform can be computed efficiently. An algorithm that efficiently computes a DFT is called a fast

    Number Theoretic Techniques in Digital Signal Processing

    Page 10

  • 8/6/2019 NTTs

    12/77

    algorithm, and the computation of a DFT with such an algorithm is called a Fast Fourier Transform

    (FFT). Note that the FFT is not a separate transform, just a fast (or efficient) way of computing a

    DFT.

    For simplicity we will discuss the development of fixed radix decimation algorithms that use a

    power of two as the radix. These were the first Fast algorithms to be developed, and still represent

    the major implementation procedure for commercial hardware and software solutions.

    III. Fast Fourier Transform (FFT) Algorithms

    We will briefly examine the classical decimation in time algorithm to appreciate the magnitude ofsavings that can occur in computing the DFT using a fast algorithm.

    A. Decimation in Time FFT Algorithm

    There are two basic decimation (sequence slicing) procedures: decimation over the sample domain

    (decimation in time, DIT) or decimation over the transform domain (decimation in frequency, DIF).

    We will consider the first procedure, with the assumption that N contains some power of 2 factors.

    The transformation: xnDfiXk can be written:

    Xk = k=0

    N-1

    xn WN(nk) (8)

    and we can decompose the computation by computing (8) for the even and odd samples, separately,

    and then put the two results back together. This yields:

    Xk = n=0

    N

    2 -1

    x2n WN2

    (nk) + WN(k) . n=0

    N

    2 -1

    x2n+1 WN2

    (nk) (9)

    Number Theoretic Techniques in Digital Signal Processing

    Page 11

  • 8/6/2019 NTTs

    13/77

    We have now reduced the computational requirement for an N-point DFT to that required for twoN2 point DFTs and an extra N multiplications. If we are interested in reducing multiplications then

    we have reduced the requirement from N2 multiplications toN2

    2 + N multiplications. Although this

    may not represent a large reduction, we are able to repeat the process by decimating each of the

    N

    2

    point DFTs to twoN4 point DFTs. We can continue in this way until we run out of factors of N.

    The greatest decimation occurs for N, a power of 2, in which the basic computational element

    reduces to a 2-point DFT (which can be computed using real numbers). In this case the number of

    multiplications reduces toN2 log2N (in fact some of these multiplications are trivial, but we will use

    this figure as an upper bound). The calculation reduces to the calculation ofN2 log2N two-sample

    DFT's with a multiplication by a power of W. This multiplier is popularly referred to as a 'twiddle

    factor'.

    The basic computational element is shown in the signal flow graph in Fig 1 below, and is often

    referred to as a 'butterfly' based on its structure.

    A

    B

    WN

    r

    +

    -

    C = A + B.WN

    r

    D = A - B. WN

    r

    +

    +

    Figure 1 Butterfly calculation with twiddle factor multiplier

    Number Theoretic Techniques in Digital Signal Processing

    Page 12

  • 8/6/2019 NTTs

    14/77

    Notice that there is only one complex multiplication and two complex additions per butterfly. An

    example is shown in Figure 2 for an 8-point transform using 4log28 = 4 x 3 = 12 butterfly

    calculations. Appropriate pairs of shaded nodes represent the addition/subtraction network of the

    butterfly calculation.

    Note that although there are 12 multiplications indicated, in fact only 5 of these are non-trivial (i.e.

    not a multiplication by unity). Before we leave this brief introduction to fast algorithms it can be

    observed that the output sequence has been scrambled in the process of reducing the number of

    multiplications, and that the structure is not the same from stage to stage. Algorithms are available

    that do not scramble the output data (or require the input data to be scrambled) and there are also

    algorithms that maintain a constant structure, but not together (a sort of Heisenberg uncertainty

    principle for FFT's !).

    X 0

    X4

    X 2

    X 6

    X 1

    X 5

    X 3

    X 7

    x

    x

    x

    x

    x

    x

    x

    x

    0

    1

    2

    3

    4

    5

    6

    7

    W8

    0

    W8

    0

    W 80

    W8

    0

    W8

    0

    W8

    0

    8W

    1

    W 8

    2

    W8

    3W

    8

    2

    W8

    0

    W8

    2

    -1

    -1

    -1

    -1

    -1

    -1

    -1

    -1

    -1

    -1

    -1

    -1

    Figure 2 FFT Signal Flow Graph for 8-point DFT transform

    Number Theoretic Techniques in Digital Signal Processing

    Page 13

  • 8/6/2019 NTTs

    15/77

    It seems to be a law of nature that as one attempts to reduce the number of arithmetic operations (or

    at least the most costly..multiplication) the structure suffers. This comes back to haunt us when we

    examine VLSI implementations of such algorithms.

    There are many works available on the myriad of fast algorithms that have been discovered since the

    initial algorithm was published (for example Blahut (1985)). It has been our intent here simply to

    point out that these algorithms exist and to briefly discuss one of the earliest.

    B. Use of the DFT in indirect computation of convolution

    Based on the fact that DFT's can be computed very efficiently, it becomes clear that the operation of

    convolution can also be computed much more efficiently than directly implementing the convolution

    sum. Assuming that we wish to compute circular convolution, then an N-point convolution can be

    computed using 2 DFTs and an inverse DFT. Since an inverse DFT is the same complexity as a

    forward DFT, the number of multiplications involved are 3N log2 N + N. The reduction in

    multiplications for an N-point circular convolution is1N { }1 + 3log2 N . For the case of

    filtering with a fixed set of weights, we can store the DFT of the weight sequence and obtain a

    greater reduction of 1N { }1 + 2. log2 N multiplications. These savings are reduced if one is

    interested in aperiodic convolution (the normal filtering operation) but for large filter kernels (large

    N) the savings can be significant enough to warrant using this indirect approach.

    We note one unfortunate fact, even if we are convolving only real sequences, it is still necessary to

    use complex arithmetic. In fact our reduction calculations above are predicated on the need for

    complex sequence convolution. There is another unfortunate effect that since the roots of unity are

    obtained from transcendental functions, we are not able to compute convolutions without error; even

    the convolution between sequences of integers, which in the direct form can be easily computed

    without invoking any errors, will have to be computed using approximations to the roots, and will, in

    general, invoke errors in the final convolution sequence.

    Number Theoretic Techniques in Digital Signal Processing

    Page 14

  • 8/6/2019 NTTs

    16/77

    There are mathematical systems in which roots of unity can be generated and used in calculations

    without any approximations. In fact, approximations are meaningless. The mathematical systems

    are finite algebras, and we will devote the remainder of this chapter to discussing their use in the

    computation of DSP algorithms.

    It is important to establish some basic concepts and theories relating to the subject of finite algebra;

    the following section is included for that purpose.

    IV. Finite Algebras

    For our specific purposes we will be using algebraic structures that emulate normal integercomputations. The entities to be discussed are groups, rings and fields which contain a finite

    number of integers as their elements and allow arithmetic operations that are modifications of

    normal integer arithmetic operations. This will not be an exhaustive review; the reader is referred to

    standard texts on number theory (e.g. Dudley, 1969, and Vinogradov, 1954) as well as texts on

    modern algebra (e.g. Burton, 1970, and Schilling and Piper, 1975).

    As a starting point, we will formally define the notion of a ring and field.

    A. Rings and Fields

    Definition 1:

    If R is a non empty set on which there are defined binary operations of addition, +, and

    multiplication, , such that the following postulates (I) - (VI) hold, we say that R is a ring,

    (I) Commutative law of addition a+b =b+a

    (II) Associative law (a+b)+c = a+(b+c)

    (III) Distributive law a(b+c) = ab + ac

    Number Theoretic Techniques in Digital Signal Processing

    Page 15

  • 8/6/2019 NTTs

    17/77

    (IV) Existence of an element, denoted by the symbol 0, of R such that a+0 = a for every a

    R

    (V) Existence of an additive inverse. For each a R, there exists x R such that

    a+x = 0

    (VI) Closure, where it is understood that a, b, c are arbitrary elements of R in the above

    postulates

    A ring in which multiplication is a commutative operation is called a commutative ring. A ring with

    identity is a ring in which there exists an identity element, 1, for the operation of multiplication,

    a1 = 1a = a for all a R.

    The ring of integers is a well known example of a commutative ring with identity. In abstract

    algebra, elements of a ring are not necessarily the integers, even not necessarily numbers; e.g. they

    can be polynomials. Given a ring R with identity1, an element a R is said to be invertible, or to

    be a unit, whenever a possesses an inverse with respect to multiplication. The multiplicative inverse,

    a-1, is an element such that a-1a = 1. The set of all invertible elements of a ring is a group with

    respect to the operation of multiplication and is called a multiplicative group.

    Definition 2:

    If R is a ring and 0a R, then a is called a divisor of zero if there exists someb0, b R such

    that ab = 0.

    Definition 3:

    An integral domain is a commutative ring with identity which has no divisors of zero.

    Definition 4:

    A ring, F, is said to be a field provided that the set F-{0} is a commutative group under

    multiplication.

    Number Theoretic Techniques in Digital Signal Processing

    Page 16

  • 8/6/2019 NTTs

    18/77

    Viewed otherwise: a field is a commutative ring with identity in which each non-zero element

    possesses an inverse under multiplication. It follows (McCoy, 1972) that every field is an integral

    domain. The rings (fields) with a finite number of elements are called finite rings (fields). A ring of

    integers modulo m, denoted here as Zm, is an example of a finite ring.

    In every finite field with p elements, the nonzero elements form a multiplicative group. This

    multiplicative group of order p-1 is cyclic; i.e. it contains an element, a, whose powers exhaust the

    entire group. This element is called a generator, or a primitive root of unity, and the period or order

    ofa is p-1. The order of any element x in the multiplicative group is the least positive integer t such

    that xt = 1, xs1, s[1,t). The order t is a divisor of p-1 and x is called a primitive t-th root of

    unity.

    For every element x of the set F-{0} the mapping defined by: x=at ,t {0, 1, ..., p-2} is the

    isomorphism of the additive group of integers with addition modulo p-1 and the multiplicative

    group of the field F.

    The integert is called the index ofx relative to the base , denoted indax.

    1. The ring of residue classes

    In order to describe the system, the notion of congruence should be introduced. The basic theorem,

    known as the division algorithm will be established first.

    Division Algorithm

    For given integers a and b, b not zero, there exists two unique integers, q and r, such that a = b.q +

    r, r [0, b). It is clear that q is the integer value of the quotient ab . The quantity r is the least

    positive (integer) remainder of the division of a by b and is designated as the residue of a modulo b,

    or ab. We will say that b divides a (written ba) if there exists an integer k such that a = b.k.

    Definition 5:

    Number Theoretic Techniques in Digital Signal Processing

    Page 17

  • 8/6/2019 NTTs

    19/77

    Two integers c and d are said to be congruent modulo m, written

    c d (mod m) if and only if m(c-d)

    Since m0, c c (mod m) by definition

    Another alternative definition of congruence can be stated as follows:

    Two integers c and d are congruent modulo m if and only if they leave the same remainder when

    divided by m, cm = dm .

    Definition 6:

    A set of integers containing exactly those integers which are congruent, modulo m, to a fixed integer

    is called a residue class, modulo m.

    The residue classes (mod m) form a commutative ring with identity with respect to the modulo m

    addition and multiplication, traditionally known as the ring of integers modulo m or the residue ring,

    and denoted Zm . The ring of residue classes (mod m) contains exactly m distinct elements. The

    ring of residue classes (mod m) is a field if and only if m is a prime number, because multiplicative

    inverses, denoted a-1 (mod m) or a-1m , exist for each nonzero a Zm. Thus the nonzero classes

    ofZm form a cyclic multiplicative group of order m-1, {1, 2, ..., m-1}, with multiplication modulo

    m, isomorphic to the additive group {0, 1, ..., m-2} with addition modulo m-1.

    If m is composite,Zm is not a field. Multiplicative inverses do not exist for those nonzero elements

    a Zm for which the gcd (a, m) 1. The Euler's Totient Function, denoted f(m), is a widely used

    number theoretic function and is defined as the number of positive integers less than m andrelatively prime to it. It follows that the number of invertible elements ofZm is equal to f(m). If m

    has the prime power factorization m = m1e1 . m2

    e2 . .... mLeL then we can write:

    f(m) = m

    1 -

    1m1

    1 -

    1m2

    ......

    1 -

    1mL

    Number Theoretic Techniques in Digital Signal Processing

    Page 18

  • 8/6/2019 NTTs

    20/77

    The Euler-Fermat theorem states that if c is an integer, m is a positive integer and

    (c, m) = 1 , then cf(m) 1 (mod m).

    The important consequence of this theorem is that there is an upper limit on the order of any

    element a Zm. Specifically, the order t is a divisor off(m).

    2. Galois Fields

    For any prime m and any positive integer n, there exists a finite field with mn elements. This

    (essentially unique) field is commonly denoted by the symbol GF(mn) and is called a Galois field

    in honor of the French mathematician Evariste Galois. Since any finite field with mn elements is a

    simple algebraic extension of the fieldZm, a brief review of the basic concepts of the extensions of

    a given field will be presented.

    Let F be a field. Then any field K containing F is an extension of F. Ifl is algebraic over F, i.e. ifl

    is a root of some irreducible polynomial f(x) F[x] such that f(l) = 0, then the extension field

    arising from a field F by the adjunction of a root l is called a simple algebraic extension, denoted

    F(l).

    Each element of F(l) can be uniquely represented as a polynomial:

    a0 + a1l + .... + an-1 ln-1 , ai F.

    This unique representation closely resembles the representation of a vector in terms of the vectors

    of a basis

    1, l, ....... , ln-1.

    The vector space concepts are sometimes applied to the extension fields, and F(l) is considered as a

    vector space of dimension n over F.

    Number Theoretic Techniques in Digital Signal Processing

    Page 19

  • 8/6/2019 NTTs

    21/77

    The field of complex numbers is an example of an extension of the field of real numbers; it is

    generated by adjoining a root j = -1 of the irreducible polynomial x2 + 1.

    If f(x) is an irreducible polynomial of degree n over Zm, m a prime, then the Galois Field with mn

    elements, GF(mn) is usually defined as the quotient field Zm[x] / (f(x)), i.e., the field of residue

    classes of polynomials of Zm[x] reduced modulo f(x). All fields containing mn elements are

    isomorphic to each other. In particular, Zm[x] / (f(x)) is isomorphic to the simple algebraic

    extension Zm(l), where l is a root of f(x) = 0.

    Now that we have some basic background in finite algebraic structures, we can return to the

    problem of indirectly computing convolution via transforms that exhibit the cyclic convolution

    property. We will expand our definition of such transforms from the DFT to the NTT (Number

    Theoretic Transform).

    V. Number Theoretic Transforms

    Number Theoretic Transforms (NTT's) were discovered independently by a number of researchers

    (Nicholson 1971; Pollard 1971; Rader 1972) as a generalization of the standard DFT with respect

    to the ring over which the transform is defined. The transform retains the same cyclic convolution

    property as the DFT, but allows error free computation with the promise of much faster hardware

    solutions.

    The NTT has the same form as the DFT, with the exception that it is computed over a finite ring, or

    field, rather than over the field of complex numbers.

    Xk = n=0

    N-1p [ ]xn p ank (10)

    where the ring, or field, is R(p) or GF(p) respectively. Note that a new notation has been introduced

    here. In order to explicitly identify arithmetic operations over finite fields, or rings, we will use the

    Number Theoretic Techniques in Digital Signal Processing

    Page 20

  • 8/6/2019 NTTs

    22/77

    symbols p,p to represent addition and multiplication modulo p, respectively, and p to

    represent summation modulo p. Where it is clear that a single ring or field is being referred to, then

    the subscript may be dropped (except on the summation). This notation will be used throughout the

    rest of the paper, and will be extended where appropriate.

    The NTT that has probably been studied more than any other is the Fermat Number Transform

    (FNT). This, in large part, is due to the ability to use modified binary arithmetic to perform the

    computations. It is not clear, given the present body of knowledge on VLSI implementations, that

    this ability to use modified binary hardware is necessarily a desirable attribute of number theoretic

    algorithms. It is the intent of this paper to present alternative forms of implementation of number

    theoretic techniques that do not have to be predicated on the existence of standard computational

    elements. This opens up the possibility of using a wider variety of number theoretic transforms. Let

    us, however, start off by exploring number theoretic transforms based on Fermat numbers.

    A. Fermat Number Transforms

    Although the NTT allows convolution to be computed without error (unlike the DFT which

    introduces errors due to the requirement for arithmetic with coefficients obtained from

    transcendental functions), the use of such a finite field (or ring) transform is only appropriate when

    the hardware has no greater complexity than normal computations over the reals. Satisfying this

    requirement can be a problem when the field modulus is not chosen with the implementation in

    mind. Among the earliest implementations considered (and still occupying a small, but current,

    research interest) are Fermat Number Transforms (FNT's).

    The FNT is initially defined over a Galois Field GF(Ft) where the modulus Ft = 22t+1 is the t'th

    Fermat Number (FN), where the FN is prime (primes do not exist for all values of t). This

    transform brings about three main implementation simplicities (Agarwal and Burrus, 1974):

    Number Theoretic Techniques in Digital Signal Processing

    Page 21

  • 8/6/2019 NTTs

    23/77

    1 The arithmetic is very similar to ordinary binary arithmetic because the modulus is close to a

    power of 2.

    2 Simple forms can be found for the multiplicative group generator that reduces the

    multiplications required to compute the transform to modified bit-pattern shifts.

    3 A standard Fast algorithm, obtained from an appropriate FFT, can be used to compute the

    transform.

    The idea of being able to compute over a finite field but with only slightly modified binary hardware

    is obviously appealing, albeit constraining.

    The FNT pair is defined as:

    Xk = n=0

    N-1 Ft

    xn Ft a

    nk (11)

    xn = N-1

    k=0

    N-1 Ft

    Xk Ft a

    -nk (12)

    There are some points that we should note:

    1 Since NFt -1, it is clear that we can generate a maximally composite N= 22t and so use

    one of the fast algorithms available for the DFT, to optimize a computational structure for the FNT.

    2 If we restrict N to be even, N-1 will always exist.

    3 The first four Fermat Numbers are prime, and 3 is a primitive root for all four of these

    primes.

    From 1 and 3 we find that 322t1. If we wish to find a generator, a, for a multiplicative group of

    order 2b(this will allow a transform of length 2b), then a2b 32

    2t 1; hence a 32

    (2t-b). This

    last expression allows us to determine the relationship between suitable values of a (for ease of

    implementation) and possible transform lengths. The easiest value ofa to choose is clearly a = 2.

    Number Theoretic Techniques in Digital Signal Processing

    Page 22

  • 8/6/2019 NTTs

    24/77

    This will allow multiplication by powers ofa to be replaced by shifts modulo Ft. Since Ft is close

    to a power of 2, the hardware should be only moderately more complex than shifts modulo a power

    of 2 (binary shifts).

    It turns out that for a = 2, the order is 2t+1. This can be verified, using ordinary arithmetic, from:

    22(t+1)

    = (22t

    + 1) . (22t

    - 1) + 1 1

    The largest Fermat prime is F4, for which the order is 32; the arithmetic is modified 17-bit binary.

    If we examine F5 and F6, we find that they are composite numbers of the form Ft = K.2t+2 + 1.

    We have to pause at this moment to reflect on the fact that Ft, t {5, 6} is not a prime number.

    Clearly we will not be computing over a field, so will the NTT still work? It turns out that the

    conditions for the existence of the transform still exist when computing over a ring. The restriction

    is that the transform has to exist over each field constructed from each of the prime factors of the

    composite modulus. The transform length, N, therefore satisfies NF(i)

    t "i, where

    Ft =F(i)t (this assumes no powers of primes as factors) . For F5 and F6 we can show thatN = 2t+2 is the maximum transform length supported by these Fermat numbers; as a necessary

    condition, we can see that 2t+2(Ft - 1). Although we are only computing over a ring of integers

    ZFt, 2t+2/ K.2t+2+1, and so N-1 exists. The length supported by a = 2 is 2t+1, as before.

    For all the Fermat Numbers considered here, it turns out that the transform length, N = 2t+2, can be

    achieved by using a value ofa = 2 (Agarwal and Burrus (1974)) ; i.e. a2 = 2. The expression for

    a = 2 is given by:

    a = 22(t-2)

    . (22(t-1)

    - 1)

    which is clearly representable by a 2-bit (2 ones in a field of zeros) generator. For a fast DIT or

    DIF algorithm, only the first, or last, stage will require odd powers of a, and so most of the

    multiplications will only require single-bit multipliers. Given the use ofa = 2 and the fifth FN,

    Number Theoretic Techniques in Digital Signal Processing

    Page 23

  • 8/6/2019 NTTs

    25/77

    we have a dynamic range of just over 32-bits, with a transform length of N = 128. It is clear that

    there is an unfortunate link between dynamic range and transform length, assuming the requirement

    for a fast algorithm.

    It might be helpful to take an example of convolution over GF(F2) in order to reinforce the results

    presented here. In order to illustrate the difference between performing convolution using an NTT

    and an 8-point DFT, we will chose a = 32 = 9 in order that the NTT may also generate an 8-point

    transform length.

    The transform pair is:

    Xk = n=0

    7

    17[ ]xn 17 9nk

    xn = 8-1

    n=0

    7

    17[ ]Xk 17 9-nk

    We can concisely show the calculation using matrix notation (the modulo arithmetic is implied):

    X = T x (13)

    where T is the transformation matrix, whose elements are given by:

    Tkn = 9kn

    Note that the corresponding DFT transformation matrix has elements given by:

    Tkn = e-jkn/4

    We can immediately see the major difference in the computation of both transforms. Although the

    DFT uses normal arithmetic, the calculations are over the complex field, with elements that are not

    representable in a finite machine. The NTT, although requiring finite field (modulo) arithmetic,

    requires calculations over an integer field using elements that are defined without error.

    Number Theoretic Techniques in Digital Signal Processing

    Page 24

  • 8/6/2019 NTTs

    26/77

    As an example calculation, in order to compare the two approaches, we will cyclically convolve the

    sequence {x} = {1, 2, 3, 4, 0, 0, 0, 0} with itself. This example is chosen to illustrate several points,

    as can be seen if we compute the cyclic convolution directly:

    yk = n=0

    7xn . x[k8(-n)] or {y} = {1, 4, 10, 20, 25, 24, 16, 0}

    If we now compute the aperiodic (normal) convolution we find:

    yk = n=0

    7

    xn . xk-n or {y} = {1, 4, 10, 20, 25, 24, 16, 0}

    We can observe that both the computed sequences are identical, which shows that it is possible to

    generate normal aperiodic convolution from periodic convolution by padding the sequences to be

    convolved with zeros. This forms the basis for the classical techniques of overlap-add and overlap-

    save for using cyclic convolution to compute normal convolution (Gold and Rader (1969)). As a

    second point, we also notice that some of the elements of the convolution output lie outside the

    elements of the field over which the NTT is to be computed, GF(17). This will be discussed after

    the example calculations.

    We will generate the forward transform as the matrix multiplication of equation (13).

    10

    16

    6

    11

    15

    13

    7

    15

    =

    1 1 1 1

    1 9 13 15

    1 13 16 4

    1 15 4 9

    1 16 1 16

    1 8 13 2

    1 4 16 13

    1 2 4 8

    1 1 1 1

    16 8 4 2

    1 13 16 4

    16 2 13 8

    1 16 1 16

    16 9 4 15

    1 4 16 13

    16 15 13 9

    Q17

    1

    2

    3

    4

    0

    0

    0

    0

    The symbol, Q17 , indicates matrix multiplication over GF(17). Since we are convolving the

    sequence with itself, we generate the transform domain point-wise multiplication (over GF(17)) as

    below, and feed this result to the input of the inverse transform calculation.

    Number Theoretic Techniques in Digital Signal Processing

    Page 25

  • 8/6/2019 NTTs

    27/77

    15

    1

    2

    2

    4

    16

    15

    4

    =

    10

    16

    6

    11

    15

    13

    7

    15

    17

    10

    16

    6

    11

    15

    13

    7

    15

    815

    12

    7

    13

    5

    9

    0

    =

    1 1 1 11 2 4 8

    1 4 16 13

    1 8 13 2

    1 16 1 16

    1 15 4 9

    1 13 16 4

    1 9 13 15

    1 1 1 116 15 13 9

    1 4 16 13

    16 9 4 15

    1 16 1 16

    16 2 13 8

    1 13 16 4

    16 8 4 2

    Q17

    151

    2

    2

    4

    16

    15

    4

    The inverse transform requires a final multiplication by 8-1= 15 to obtain the cyclic convolution

    output.

    1

    4

    10

    3

    8

    7

    16

    0

    = 8-1 17

    8

    15

    12

    7

    13

    5

    9

    0

    We note that some of the elements of the output convolution do not equal the original calculation

    using the direct convolution sum. They are, in fact, the result of finding the least positive residue,

    modulo 17, of the correct output. In fact the results are not wrong in the accepted sense; they are

    simply a mapping of the correct values onto GF(17). We will see later that such results can still be

    Number Theoretic Techniques in Digital Signal Processing

    Page 26

  • 8/6/2019 NTTs

    28/77

    used, and in fact will provide a technique for removing some of the problems of the linkage of

    dynamic range with transform length.

    It will be constructive at this point to examine the use of the DFT in performing the same

    convolution. Because we are performing the calculations over the complex field, we will treat the

    'real' and 'imaginary' parts of the sequences and W elements as the elements of a 2-tuple; the

    interaction between the 'real' and 'imaginary' parts will use the usual rules of complex arithmetic.

    The forward transform is shown below:

    10.00

    -0.41

    -2.00

    2.41

    -2.00

    2.41

    -2.00

    -0.41

    0.00

    -7.24

    2.00

    -1.24

    0.00

    1.24

    -2.00

    7.24

    =

    1.00

    1.00

    1.00

    1.00

    1.00

    1.00

    1.00

    1.00

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

    1.00

    0.71

    0.00

    -0.71

    -1.00

    -0.71

    0.00

    0.71

    0.00

    -0.71

    -1.00

    -0.71

    0.00

    0.71

    1.00

    0.71

    1.00

    0.00

    -1.00

    0.00

    1.00

    0.00

    -1.00

    0.00

    0.00

    -1.00

    0.00

    1.00

    0.00

    -1.00

    0.00

    1.00

    1.00

    -0.71

    0.00

    0.71

    -1.00

    0.71

    0.00

    -0.71

    0.00

    -0.71

    1.00

    -0.71

    0.00

    0.71

    -1.00

    0.71

    1.00

    -1.00

    1.00

    -1.00

    1.00

    -1.00

    1.00

    -1.00

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

    1.00

    -0.71

    0.00

    0.71

    -1.00

    0.71

    0.00

    -0.71

    0.00

    0.71

    -1.00

    0.71

    0.00

    -0.71

    1.00

    -0.71

    1.00

    0.00

    -1.00

    0.00

    1.00

    0.00

    -1.00

    0.00

    0.00

    1.00

    0.00

    -1.00

    0.00

    1.00

    0.00

    -1.00

    1.00

    0.71

    0.00

    -0.71

    -1.00

    -0.71

    0.00

    0.71

    0.00

    0.71

    1.00

    0.71

    0.00

    -0.71

    -1.00

    -0.71

    1.00

    2.00

    3.00

    4.00

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

    The transform domain input to the inverse transform is the point-wise multiplication of the forward

    transform with itself:

    Number Theoretic Techniques in Digital Signal Processing

    Page 27

  • 8/6/2019 NTTs

    29/77

    100.00

    -52.28

    0.00

    4.28

    4.00

    4.28

    0.00

    -52.28

    0.00

    6.00

    -8.00

    -6.00

    0.00

    6.00

    8.00

    -6.00

    10.00

    -0.41

    -2.00

    2.41

    -2.00

    2.41

    -2.00

    -0.41

    0.00

    -7.24

    2.00

    -1.24

    0.00

    1.24

    -2.00

    7.24

    10.00

    -0.41

    -2.00

    2.41

    -2.00

    2.41

    -2.00

    -0.41

    0.00

    -7.24

    2.00

    -1.24

    0.00

    1.24

    -2.00

    7.24

    =

    The inverse transform is computed below with a final division by 8.

    8.00

    32.00

    80.00

    160.00

    200.00

    192.00

    128.00

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

    =

    1.00

    1.00

    1.00

    1.00

    1.00

    1.00

    1.00

    1.00

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

    1.00

    0.71

    0.00

    -0.71

    -1.00

    -0.71

    0.00

    0.71

    0.00

    0.71

    1.00

    0.71

    0.00

    -0.71

    -1.00

    -0.71

    1.00

    0.00

    -1.00

    0.00

    1.00

    0.00

    -1.00

    0.00

    0.00

    1.00

    0.00

    -1.00

    0.00

    1.00

    0.00

    -1.00

    1.00

    -0.71

    0.00

    0.71

    -1.00

    0.71

    0.00

    -0.71

    0.00

    0.71

    -1.00

    0.71

    0.00

    -0.71

    1.00

    -0.71

    1.00

    -1.00

    1.00

    -1.00

    1.00

    -1.00

    1.00

    -1.00

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

    1.00

    -0.71

    0.00

    0.71

    -1.00

    0.71

    0.00

    -0.71

    0.00

    -0.71

    1.00

    -0.71

    0.00

    0.71

    -1.00

    0.71

    1.00

    0.00

    -1.00

    0.00

    1.00

    0.00

    -1.00

    0.00

    0.00

    -1.00

    0.00

    1.00

    0.00

    -1.00

    0.00

    1.00

    1.00

    0.71

    0.00

    -0.71

    -1.00

    -0.71

    0.00

    0.71

    0.00

    -0.71

    -1.00

    -0.71

    0.00

    0.71

    1.00

    0.71

    100.00

    -52.28

    0.00

    4.28

    4.00

    4.28

    0.00

    -52.28

    0.00

    6.00

    -8.00

    -6.00

    0.00

    6.00

    8.00

    -6.00

    Number Theoretic Techniques in Digital Signal Processing

    Page 28

  • 8/6/2019 NTTs

    30/77

    1.00

    4.00

    10.00

    20.00

    25.00

    24.00

    16.00

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

    8.00

    32.00

    80.00

    160.00

    200.00

    192.00

    128.00

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

    =1

    8

    The calculations were performed at great precision, and so the effect of quantization noise is not

    seen in this limited precision format. The quantization noise is present in both the inexact

    representation of the W coefficients, and in the scaling procedures used to keep the data within the

    dynamic range of the computational system. Even at large precision, zero elements become non-

    zero, albeit very small, values. The fact that the input sequence is purely real, shows up in the

    hermitian symmetry of the input to the inverse transform (even symmetry in real part about the

    center sample (sample 4) and odd symmetry in the imaginary part). Although we have explicitly

    shown all details of the calculations (not normally shown in most texts) an examination of the DFT

    calculation compared to the GF(17) NTT calculation, demonstrates the simplicity and error free

    behavior of computing indirect convolution over finite fields.

    We are now in a position to summarize the features about each approach to computing indirect

    convolution. This is presented in Table I below:

    Table I Comparison of DFT and FNT integer sequence convolution

    Comments

    Feature DFT FNT

    Transform Length No restriction Restricted by requirement

    NF(i)

    t "i

    Number Theoretic Techniques in Digital Signal Processing

    Page 29

  • 8/6/2019 NTTs

    31/77

    Dynamic Range No restriction Restricted by requirement

    |Xk|