IEEE Implementationof Modulo Prime Number, with Applications … · givenformultiplication modulo...

7
IEEE TRANSACTIONS ON COMPUTERS, VOL. C-29, NO. 10, OCTOBER 1980 Implementation of Multiplication, Modulo a Prime Number, with Applications to Number Theoretic Transforms G. A. JULLIEN, MEMBER, IEEE Abstract-This paper discusses a technique for multiplying num- bers, modulo a prime number, using look-up tables stored in read-only memories. The application is in the computation of number theoretic transforms implemented in a ring which is isomorphic to a direct sum of several Galois fields, parallel computations being performed in each field. The look-up table technique uses the addition of indexes within a ring that contains at least twice as many elements as the field. The ring is chosen so that the number of elements factor either into two relatively prime submoduli, or a power of two. The first factorization is useful for implementing multiplication by interconnecting ROM's, the second for using look-up tables stored in a microcomputer ROM. Techniques are given for generating table entries and for interconnecting ROM's and constructing microcomputer algorithms. Specific examples are given for multiplication modulo 19 using ROM arrays, and multipli- cation modulo 13 using an 8048 single chip microcomputer. Index Terms-Indexes, number theoretic transforms, parallel microcomputer systems, residue number systems, ROM arrays, table look-up. I. INTRODUCTION T HERE has been considerable interest for several years in a class of signal processing algorithms called number theoretic transforms (NTT's) or, more generally, generalized discrete Fourier transforms (GDFT's). The transforms are similar in form to the classical DFT, except that they are computed over finite Galois fields [1], [2]. The form of the transform is shown in (1): N-1 Fk= E fia i (1) i=O where a is an Nth root of unity in GF(M), with the restriction that the multiplicative inverse of N exists. Although the transform domain has no known use, the transform possesses the same cyclic convolution property as the classical DFT, and can thus be used for filtering and correlation operations. Be- cause the calculations are performed in a Galois field, there are no errors in the final result; the only problem with this is Manuscript received June 4, 1979; revised November 20, 1979 and May 13, 1980. This work was supported by an Operating Grant from the Natural Sciences and Engineering Research Council of Canada. The author is with the Department of Electrical Engineering, University of Windsor, Windsor, Ont., Canada. that the result of the convolution operation should have an upper bound n=O XYk-n < M1 in order to ensure that the final result is correct, and not simply a mod M residue of the true value. Traditionally, Fermat or Mersenne primes2 have been used for M in order to ease the computational burden of computing arithmetic operations, modulo prime numbers, using binary arithmetic. For the same reason, a has also been restricted to some power of two. Because of a direct link be- tween M and N (large M required for large N), extension fields GF(M2) have been built using second-order irreducible polynomials over GF(M). Implementing NTT's in these ex- tension fields allows larger values of N for a given prime number M [3]. It has been suggested several times [4], [5] that NTT's be implemented in rings which are isomorphic to a direct sum of Galois fields: R_GF(m ED) GF(m ... ED *** GF(mnL) where the tmi} are primes and n represents the degree of the extension fields. The results of the operation can be recovered by either using the Chinese remainder theorem or a mixed radix conversion algorithm [6]. This is tantamount to imple- menting the transform using the residue number system (RNS), and so we may consider applying some recent high- speed RNS architectures using ROM arrays [7]. Using such architectures allows the selection of a and mi to be based on purely number theoretic considerations, without suffering the added constraint that a and mi be suitable for high-speed bi- nary implementations. Even with this latter constraint re- moved, there is still the need for selecting relatively large values of mi, given the requirement of a reasonable transform length (N > 32). From the form of the transform given by (1), we can see that the two arithmetic operations required are those of addition and multiplication modulo a prime mi. This paper discusses the efficient ROM array implementation of multi- plication, modulo a prime number, with addition included as a subset of the array. An extension of the idea to parallel mi- croprocessor implementations is also given. 1 Here we assume that the sequences Ix,,d, Lvn are positive semidefinite. 2 The use of a prime modulus ensures that N-' exists. 0018-9340/80/1000-0899$00.75 ©) 1980 IEEE 899

Transcript of IEEE Implementationof Modulo Prime Number, with Applications … · givenformultiplication modulo...

Page 1: IEEE Implementationof Modulo Prime Number, with Applications … · givenformultiplication modulo 19usingROMarrays,andmultipli- cationmodulo 13using an 8048singlechipmicrocomputer.

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-29, NO. 10, OCTOBER 1980

Implementation of Multiplication, Modulo aPrime Number, with Applications to Number

Theoretic Transforms

G. A. JULLIEN, MEMBER, IEEE

Abstract-This paper discusses a technique for multiplying num-bers, modulo a prime number, using look-up tables stored in read-onlymemories. The application is in the computation of number theoretictransforms implemented in a ring which is isomorphic to a direct sumof several Galois fields, parallel computations being performed in eachfield.

The look-up table technique uses the addition of indexes within aring that contains at least twice as many elements as the field. The ringis chosen so that the number of elements factor either into two relativelyprime submoduli, or a power of two. The first factorization is usefulfor implementing multiplication by interconnecting ROM's, the secondfor using look-up tables stored in a microcomputer ROM. Techniquesare given for generating table entries and for interconnecting ROM'sand constructing microcomputer algorithms. Specific examples aregiven for multiplication modulo 19 using ROM arrays, and multipli-cation modulo 13 using an 8048 single chip microcomputer.

Index Terms-Indexes, number theoretic transforms, parallelmicrocomputer systems, residue number systems, ROM arrays, tablelook-up.

I. INTRODUCTION

T HERE has been considerable interest for several yearsin a class of signal processing algorithms called number

theoretic transforms (NTT's) or, more generally, generalizeddiscrete Fourier transforms (GDFT's). The transforms aresimilar in form to the classical DFT, except that they arecomputed over finite Galois fields [1], [2]. The form of thetransform is shown in (1):

N-1Fk= E fia i (1)

i=O

where a is an Nth root of unity in GF(M), with the restrictionthat the multiplicative inverse of N exists. Although thetransform domain has no known use, the transform possessesthe same cyclic convolution property as the classical DFT, andcan thus be used for filtering and correlation operations. Be-cause the calculations are performed in a Galois field, thereare no errors in the final result; the only problem with this is

Manuscript received June 4, 1979; revised November 20, 1979 and May13, 1980. This work was supported by an Operating Grant from the NaturalSciences and Engineering Research Council of Canada.The author is with the Department of Electrical Engineering, University

of Windsor, Windsor, Ont., Canada.

that the result of the convolution operation should have anupper bound n=O XYk-n < M1 in order to ensure that thefinal result is correct, and not simply a mod M residue of thetrue value. Traditionally, Fermat or Mersenne primes2 havebeen used forM in order to ease the computational burden ofcomputing arithmetic operations, modulo prime numbers,using binary arithmetic. For the same reason, a has also beenrestricted to some power of two. Because of a direct link be-tween M and N (large M required for large N), extensionfields GF(M2) have been built using second-order irreduciblepolynomials over GF(M). Implementing NTT's in these ex-tension fields allows larger values of N for a given primenumber M [3].

It has been suggested several times [4], [5] that NTT's beimplemented in rings which are isomorphic to a direct sum ofGalois fields:

R_GF(m ED) GF(m ... ED*** GF(mnL)where the tmi} are primes and n represents the degree of theextension fields. The results of the operation can be recoveredby either using the Chinese remainder theorem or a mixedradix conversion algorithm [6]. This is tantamount to imple-menting the transform using the residue number system(RNS), and so we may consider applying some recent high-speed RNS architectures using ROM arrays [7]. Using sucharchitectures allows the selection of a and mi to be based onpurely number theoretic considerations, without suffering theadded constraint that a and mi be suitable for high-speed bi-nary implementations. Even with this latter constraint re-moved, there is still the need for selecting relatively large valuesof mi, given the requirement of a reasonable transform length(N > 32). From the form of the transform given by (1), we cansee that the two arithmetic operations required are those ofaddition and multiplication modulo a prime mi. This paperdiscusses the efficient ROM array implementation of multi-plication, modulo a prime number, with addition included asa subset of the array. An extension of the idea to parallel mi-croprocessor implementations is also given.

1 Here we assume that the sequences Ix,,d, Lvn are positive semidefinite.2 The use of a prime modulus ensures that N-' exists.

0018-9340/80/1000-0899$00.75 ©) 1980 IEEE

899

Page 2: IEEE Implementationof Modulo Prime Number, with Applications … · givenformultiplication modulo 19usingROMarrays,andmultipli- cationmodulo 13using an 8048singlechipmicrocomputer.

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-29, NO. 10, OCTOBER 1980

II. ROM ARRAY IMPLEMENTATION

For a given modulus mi < 32, the operations of multipli-cation and addition, modulo mi, of two numbers can be com-puted by looking up the result in a 1k X 8-bit ROM [7]. Usingthe same approach, operations modulo mi, 32 < mi < 64,would require a 4k X 8-bit ROM (or four 1k X 8-bit ROM's)and operations modulo mi, 64 < mi < 128, would require a16k X 8-bit ROM (or 16 1k X 8-bit ROM's). Since, forpractical NTT's, we will require prime moduli in this latterrange, the use of sixteen 8k-bit ROM's does not seem a veryefficient approach. In order to increase the implementationefficiency of multiplication, we can use the isomorphism be-tween a multiplicative group g having elements {g,n = 1, 2, 3,* i1}-1 with multiplication modulo mi, and the additivegroup k having elements lk,1 = 10, 1, 2, , mi - 21 with ad-dition modulo mi- 1; mi being restricted to the primes. Themapping is given by gn = p kn where p is a primitive root of theprime mi. Thus,

IgngjImi = p Jkn+kjImi-Iwhere x I mi indicates the least positive residue of x, modulomi. We can, therefore, perform multiplication by the followingsteps.

1) Find the index ki for each number.2) Add indexes, mod (mi- 1).3) Perform the inverse index operation.There is an immediate simplification that can be made to

this algorithm. Since every multiplication in the transform of(1) is between some arbitrary data and al (where 1 dependson the position in the general summation), the lal} sequencecan be prestored with the mapping already applied. We

are interested in preserving an "all ROM" array, then a sub-modular [8] ROM adder can be considered. Here the modulusis decomposed into two relatively prime moduli, and the ad-dition is carried out within this two moduli system. The finalresult is reconstructed using a look-up table. As an extra sav-ings, this reconstruction table can include:

1) submodular reconstruction,2) modulus overflow correction,3) inverse index look-up.In summary, then, the entire multiplication process consists

of three steps:1) generating the submodular residues of the indexes (2

ROM's),2) performing submodular index addition (2 ROM's),3) reconstruction of the required result (1 ROM).The size of the five ROM's required depends on the prime

modulus with which the result is being reduced.

III. GENERATION OF THE LOOK-UP TABLES

In this section we will discuss techniques for determiningthe entries in the ROM's for the three steps of the multipli-cation process.

Step 1 -Submodular Index Table: Based on the selectionof a primitive root p for the prime modulus under consider-ation, we construct a table based upon the mapping g = I p k mi.By interchanging contents and addresses of this table, we havea table of indexes. The two required tables are found by re-ducing the index table, modulo the two submoduli. As an ex-ample, consider mi = 19 with submoduli 16, 71. Note that 2m1< 6 X 7. We will select [ 10] the primitive root p = 2. The fol-lowing table is determined from the mapping g = 12k 119:

k 1-0 1 2 | 3 4 5 6 7 8 9 10 11 | 12 13 14 15 16 17

g 11 2 4 8 16 13 7 14 9 18 17 15 11 3 6 12 5 10-

Interchanging contents and address we obtain the index table from which the submodulo residues can be determined:

g I 2 3k 0 1 1 13

lk16 0 1 11k17 1 6

4 5 62 16 T 142 4 22 2 0

7 8 96 3 80 3 26 3 1

10 1117 125 03 5

121531

13

555

14 15 167 11 41 5 40 4 -4

therefore only need consider performing an index look-up on

the arbitrary data, and so step 1) is correspondingly simpli-fied.

Although we do not appear to have made the multiplicationany more efficient with this technique, we have replacedmultiplication with addition, and this can lead to substantialsavings. These savings accrue because we can perform additionin a modulus other than the prime modulus of the multiplica-tion. The only proviso is that the new modulus has to be at leasttwice the original modulus. The result of the addition in thelarger modulus can easily be corrected to the value that wouldhave been obtained in the smaller modulus, by using a look-uptable. It is obvious that all possible outcomes of the additionare contained in this larger modulus, and so no errors willoccur. There is no limit to the form of this larger modulus,other than its minimum size. We can, for example, use thehighly composite modulus obtained with a binary adder. If we

Step 2-Submodular Addition Tables: The addresses ofthese tables are found by concatenating the two-input sub-modulo residues to be added. The contents of the table addressis the submodulo residue addition of the input residues. Theconcatenation of the binary patterns associated with theseresidues means that there will be unused addresses in the tablecorresponding to invalid residue bit patterns. For example, ifwe use 3 bits to represent the residues modulo 6, then thepatterns 110, 111 will never arise, and so addresses corre-

sponding to these patterns will never be accessed. The examplein Section IV will clarify this point.

Step 3-Reconstruction Table: As explained in Section II,

the reconstruction table entries are found in a three step pro-

cess.

a) Submodular reconstruction: Let the submoduli bem(,) and m(!) and the corresponding residues be r|i) and ry).To reconstruct the number, modulo mP) m t), we can either

17 1810 94 33 2

900

Page 3: IEEE Implementationof Modulo Prime Number, with Applications … · givenformultiplication modulo 19usingROMarrays,andmultipli- cationmodulo 13using an 8048singlechipmicrocomputer.

JULLIEN: IMPLEMENTATION OF PRIME NUMBER

use the Chinese remainder theorem [6]

r()= |E A ( ). | i iJ) 2

where

A (j) MlY'MY)m m)

or the equivalent metric vector summation

r(i) = |X r5i) U5')|j=1I JmP.mY

where the metric vector

U5i i A (j) | (i)

This can be written most straightforwardly as

-()|r¶i)m9~) | ,)| + rmi m 1my)

b) Modulus overflow correction: Since there will be no

overflow of the modulus m(i) m?i) we can correct the overflowof the modulus mi- 1 by performing the operation

ri = Ir(i)Imi-1-c) Inverse index look-up: For this operation we use the

mapping

Xi= |pri| i.

As an example of applying the above steps, consider againthe modulus mi = 19, ml') = 6, and my) = 7.We find

m 76=1

1 1

W =

For example, the result for input variables r~i) -1, rj) -6

is

ri) = 17 + 216142 = 13.

Applying the other two steps we find

xi= 12113118119 = 1213119 = 3.

In this way we can construct the complete reconstructiontable.

IV. LOOK-UP TABLE INTERCONNECTION

The interconnection of the table ROM's for a modulo 19multiplier are shown in Fig. 1, along with the complete tableentries. As in the previous section, submoduli {6, 7} are used.Although the index of zero does not exist (and hence multi-plication by zero, using indexes, is theoretically invalid) we

store in this location a code of 7. Since neither of the submoduli

"'(.)I7Index Lookup Addition Tables Inverse Lookup

Fig. 1. A modulo 19 ROM array multiplier.

is greater than 7, 7 will never appear as a valid submodularresult, and so the row and column corresponding to 7 cancontain another code (again, 7) which will result in an outputof zero from the final look-up table.As a specific example of this modulo 19 multiplier, consider

x = 12, Y = 17. The result is 112 X 171 19 = 14. Assuming that17 is already stored in index form (Ik(17)16 = 4, Ik(17)I17 =3) the progress through the tables is shown by the outlinedmemory locations in each table. The reader can verify that thesystem gives correct results for other operands; in particular,either operand can be zero and the zero detection coding willproduce the correct look-up of zero at the final table.

Fig. 2 shows the structure for a general prime modulus mi< 14 X 15/2 = 105. Note that we cannot allow a submodulusto be some power of 2 (in this case, 16) because this would notallow for zero operand correction. Fig. 2 also shows the easewith which the structure can be pipelined to preserve thethroughput rate of the single ROM multiplier, used when mi< 32. Each ROM is buffered by a register which is operatedfrom a common latch pulse, thus allowing a shift of the datathrough the array at the access plus latch time of the ROM.As an example of a typical throughput rate, we can considerthe monolithic memories high performance Schottky 4- and8-bit wide PROM's with built in registers [9]; these will acceptclock frequencies in the 30 MHz region.

Addition (subtraction) is obviously a subset of the previouslydiscussed ROM array. A complete radix 2 FNTT3 butterflyis shown in Fig. 3. This butterfly computes A = a + a' b; B =a - a lb for mi < 105. The extra look-up tables for the previ-ously discussed example, using mi = 19, are shown in Fig. 4.The difference between the addition and subtraction correctiontables is due to the difference in the dynamic range at the input.For the addition correction table, the range is 0 < x < 36, andfor the subtraction correction table, the range is -18 < x <18.As a specific example consider al = 17, b = 12, a = 7

.-. A = 2, B = 12 (= -7).

3Fast NTT, similar to the structure of an FFT.

901

Page 4: IEEE Implementationof Modulo Prime Number, with Applications … · givenformultiplication modulo 19usingROMarrays,andmultipli- cationmodulo 13using an 8048singlechipmicrocomputer.

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-29, NO. 10, OCTOBER 1980

IndexLookup

AdditionTables

Latch Pulse

Fig. 2. General configuration for a pipe-lined multiplier using a modulus< 105.

The appropriate look-up table addresses are shown outlinedin Figs. 1 and 4. Note that the inverse look-up table of Fig. 4is a submodular version of that in Fig. 1.ROM arrays can also be used for converting from binary

to the residue form at the beginning of the transform, and forconverting from residue form to a weighted magnitude format the end of the transforms. These techniques are discussedin detail elsewhere [7].

V. MICROPROCESSOR ARRAY IMPLEMENTATIONS

The hardwired ROM arrays of Figs. 2 and 3 can be replacedby a software algorithm in a microprocessor. Using the modulo256 adder/subtractor built into an 8-bit microprocessor, wecan eliminate the submoduli addition and subtraction tables.We could even consider eliminating the addition/subtractioncorrection tables and replacing them with extra instructions;the resulting speed/ROM storage tradeoff would have to betaken into account in order to make such a decision. Thecomplete speed/cost tradeoff between a ROM array and amicrocomputer implementation depends on the particularcomponents chosen. For example, a Schottky bipolar ROMcan be "read" in a matter of a few tens of nanoseconds whereasa typical NMOS microcomputer will require several micro-seconds if one considers the extra work of setting up memoryaddresses. Using a Schottky bipolar bit sliced microcomputer,

Fig. 3. Configuration for a pipe-lined NTT butterfly using a modulus< 105.

instruction set. We will use this microcomputer in the nextsection to illustrate the techniques required for table look-uparithmetic operations modulo a prime number.

VI. MICROPROCESSOR TABLE LOOK-UP TECHNIQUES

As an example of microprocessor routines to perform bothaddition and multiplication modulo a prime number, considerthe use of a single chip 8-bit microcomputer from the Intel8048 series. The microcomputer has an on-board ROM (orPROM) of 1k X 8 bits which is divided into 4 pages of 256bytes each. This can be used to store look-up tables as well asalgorithms. The algorithm for addition of two numbers modulomi reduces to

1) y= ja+bj2562) compute IY Imi.

Step 2) can be computed by a set of instructions, or from acorrection look-up table as discussed previously. In order tocompare both techniques, consider an addition routine that isentered with "a" stored in the accumulator (A) and "b" storedin register 2 (R2). The result is to be placed in the accumulator.Assume mi = 103. For the correction table algorithms, thetable is stored in the top 256-byte page (page 3) of the on-boardROM. The correction table will have the form

Address 1011121314 | | 102 1103 | 104 | 203 | 204Contents 1 2 3l 4 1 102 0 1 101 102

the time can be reduced to a few hundred nanoseconds; how-ever, the chip count will be larger than for the NMOS system.For systems that only require low data rates, the microcom-puter realization is obviously the best choice; for very high-speed data rates, the ROM array realization is the only choice.For the former requirement, a very interesting solution is touse a single chip microcomputer for each NTT butterfly. Withthe use of such a device a complete parallel computation canbe performed with a very low package count. A microcomputerideally suited to this application is the Intel 8048 series, whichhas some relatively powerful table look-up instructions in its

The remaining 52 bytes at the top of the table can be used foralgorithm storage.

There is a single byte instruction available to use page 3 asan efficient look-up table, (MOVP3 A, @A), which uses theaccumulator contents as a page 3 address, and fetches thecontents of this address to the accumulator.

Without Look-Up Table

Operationla + b1256 = Yy- 103 = z

MnemonicADD A, R2ADD A, # -103

Number Numberof Bytes of Cycles

1 12 2

902

Page 5: IEEE Implementationof Modulo Prime Number, with Applications … · givenformultiplication modulo 19usingROMarrays,andmultipli- cationmodulo 13using an 8048singlechipmicrocomputer.

JULLIEN: IMPLEMENTATION OF PRIME NUMBER

A

Addition Tables Subtraction Tables Correction Lookup

Fig. 4. A modulo 19 ROM array NTT butterfly.

JP END

ADD A, # 103END,

2 2 be 127, then 125,

the minimum will be zero. The index sum will therefore rangefrom 0 to 250, which leaves five locations free at the top of page2. In order to move these locations to the bottom of page 2, wecan add five to each element of the index look-up table on page

2 2 multiplication multiplicand, A,

by the index of the multiplier, in R2, with the result being left'7 '7 in A, is as follows:

With Look-Up Table

Operationla + b1256 = YIY 1 103

MnemonicADD A, R2MOVP3 A, @,A

Number Nuiof Bytes of C

1

1

mber OperationFind index of

I multiplicand2 1 + 1256=Y

2 3 pIY1127

The look-up table routine is over two times as fast as the firstroutine.We can now consider the implementation of modulo mi

multiplication, mi a prime. Assuming that the same micro-computer chip is to have both addition and multiplication al-gorithms, we will not be able to use page 3 for storage associ-ated with index addition, since it is already in use as a correc-

tion table for direct addition. The look-up table requirementsfor multiplication are an index table and an addition, correc-

tion, and inverse index combination table. We will assume thatthe index table is stored in the lower part of page 1 and thecombination table in page 2. The upper half of page 1 isavailable to store the multiplication routine and, possibly, otheralgorithms. A special instruction for accessing current pageinformation is available in the 8048 instruction set (MOVP A,@A) similar to the page 3 instruction (MOVP3 A, -@A). In

order to use both pages for look-up tables, the routine will haveto cross the page boundary. This will ensure that the appro-

priate look-up is in the current page. We will store the multi-plication algorithms, with a starting point positioned near thetop of page 1, so that the page boundary occurs in between thepage 1 and page 2 table look-ups. If we restrict the maximum

return

Number NumberMnemonic of Bytes of Cycles

MOVPA,@,A 1 2

ADD A, R2 1 I(Page Boundary)

MOVP A, @A 1 2RET 1 1

4 7

The routine does not correct for multiplication by zero;however, a change to the look-up tables (similar to the specialcode used in the ROM array implementation) can provide a

simple indication of multiplication by zero. Since there is novalid result for an index look-up for zero, we can store FF16 inthis position. When this value is used in the addition, the resultwill either be FF16, or a carry will result. If the displacementof 5, required to shift the table in page 2, is shared between theindex look-up of page 1 and the prestored multiplier index inR2 (say 2 and 3), then the result of the addition will alwaysproduce a carry. This is a direct indication of an invalid resultthat should be zero.

In order to demonstrate the above techniques, consider thefollowing example in which mi = 13 and we assume that theROM storage has four pages of 32 bytes/page. Fig. 5 showsthe memory map in the four pages, with combined look-uptables and algorithm storage. The multiplication routine hasbeen modified to correct for zero. If a carry is detected afterthe first addition, the routine jumps toX which sets the accu-

mulator to zero and returns. Page 0 contains a routine to per-

Inverse Lookup(o, b)

903

Page 6: IEEE Implementationof Modulo Prime Number, with Applications … · givenformultiplication modulo 19usingROMarrays,andmultipli- cationmodulo 13using an 8048singlechipmicrocomputer.

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-29, NO. 10, OCTOBER 1980

form a radix-2 butterfly calculation. A look-up table is pro-

vided along with the routine to allow subtraction. The algo-rithm operates as follows:The routine computes A = a + alb, B = a-1ab; the rou-

tine is entered with b in the accumulator, the index of a I in R2and a in R3:

Operationlb -a1113 = Y

Acc. R3

la +Y1256la +y 13 = A

Acc. R3

Find SubtractionTable Address-2yla +y -2Y13 = B

MnemonicCALL Y

XCHA, R3

ADD A, R3MOVP3 A, @A

XCHA, R3

ADD A, #19MOVP @A, AMOVP3 A, @A

A specific example is shown below where a = 11, b = 8, anda

I = 2 (stored value = 6).The results are

A = |11 + (8 X 2)113= 1 (Stored in R3).

B = Ill -(8 X 2)1 13 = 8 (Stored in A).

Main Routine

CALL Y

XCHA, R3ADD A, R3MOVP3 A, @A

XCHA, R3ADD A, # 19MOVPA, @AADD A, R3MOVP3 A, @A

Subroutine Y

MOVPA, @AADD A, R2Jcx

MOVPA, @ARET

Contents ofRegistersA R3

88141433

111413

22788

11

11

11

11

11

11

3

3

3

1

1

1

1

The reader can verify that the butterfly calculation workswith other values, including operands that are zero.

For a high-speed 8048, the instruction time is 2.5 ,us/cycle.If we compute the maximum time to perform each of theroutines discussed, the following times are obtained:

additionmultiplicationNTT butterfly

7.5 ,us17.5 ,us62.5 ,s.

Although the addition time is not as fast as a modulo 256addition using the ALU, both the multiplication and NTTbutterfly times are very impressive considering the large in-struction time and limited capabilities of the device. For a

Address PAGE 0 PAGE 1 PAGE 2 PAGE 3

0 _ _ FF16 MOVPA,EA 01 5 RET 1___2 6 X CLR A 23 9 RET 34 7 4

5 14 56 r CALL Y 10 _ 6

7 XCH A,R3 16 78 ADD A,R3 8 89 MOVP3 A,aA 13 910 XCH A,R3 15 1 1011 ADD A, #19 12 2 1i112 MOVP A,@A 11 4 1213 ADDA R3 8 014 |MOVP3 A,QA 3 115 6 216 12 317 11 4

18 9 519 0 _5-620 11 10 721 9 7 822 7 1 923 5 2 1024 3 4 1125 1 B826 12 327 10 628 8 1229 6 Y MOVP A,@A 1130 4 ADD A R2 931 2 JC X 5

Fig. 5. Algorithms and look-up tables for a modulo 13 NTT butterflyusing an 8048 microcomputer.

64-point transform, in which there are 192 butterfly calcula-tions, the calculation overhead for each sample is 192 X62.5/64 ,us = 187.5 ,As. It is not unreasonable to expect thatthe device could perform real time, modulo mi, convolutionat a data rate of 2 kHz.

VII. COMMENTS AND CONCLUSIONS

It is clear from the preceding discussion, that multiplication,modulo a prime, can be performed very effectively using anapproach based on look-up tables. Although the idea of indextable look-up has been mentioned previously [1], the me-chanics of making the idea work have not been previouslydiscussed. This paper has shown that efficient implementationscan be obtained using both ROM arrays for high-speed op-eration and look-up tables, stored in limited microprocessorprogram memory, for low-cost implementation. As far as thelatter technique is concerned, it is clear that a variety of costversus speed tradeoffs can be obtained via the use of differentmicroprocessor systems. For example, the use of a Schottky-bipolar bit sliced system would provide a high-speed imple-mentation and the example provided in Section III (8048system) would provide a low-speed low-cost implementation.Finally, although the application used in this paper has beenmultiplication applied to NTT structures, there is no reasonwhy the techniques cannot be used for integer multiplicationwithin a general RNS architecture, where each of the moduliare primes.

904

Page 7: IEEE Implementationof Modulo Prime Number, with Applications … · givenformultiplication modulo 19usingROMarrays,andmultipli- cationmodulo 13using an 8048singlechipmicrocomputer.

IEEE TRANSACTIONS ON COMPUTERS, VOL. c-29, NO. 10, OCTOBER 1980

REFERENCES

[1] J. M. Pollard, "The fast Fourier transform in a finite field," Math.Comput., vol. 25, pp. 365-374, Apr. 1971.

[2] R. C. Agarwal and C. S. Burrus, "'Fast convolution using Fermat numbertransforms with applications to digital filtering," IEEE Trans. Acoust.,Speech, Signal Processing, vol. ASSP-22, pp. 87-97, Apr. 1974.

[3] I. S. Reed and T. K. Truong, "The use of finite fields to compute con-volutions," IEEE Trans. Inform. Theory, vol. IT-21, pp. 208-213, Mar.1975.

[4] P. J. Nicholson, "Algebraic theory of finite Fourier transforms," J.Comput. Syst. Sci., vol. 5, pp. 524-547, 1971.

[5] E. Dubois and A. N. Venetsanopoulos, "The discrete Fourier transformover finite rings with application to fast convolution," IEEE Trans.Comput., vol. C-27, pp. 586-593, July 1978.

[6] N. S. Szabo and R. l. Tanaka, Residue Arithmetic and its Applicationsto Computer Technology. New York: McGraw-Hill, 1967.

[7] G. A. Jullien, "Residue number scaling and other operations using ROMarrays," IEEE Trans. Comput., vol. C-27, pp. 325-336, Apr. 1978.

[8] G. A. Jullien and W. K. Jenkins, "The application of residue numbersystems to digital signal processing," submitted for publication.

[9] "Bipolar LSI Data Book," Monolithic Memories, 1978.

[10] M. Abramowitz and I. A. Stegun, "Handbook of mathematical func-tions," Nat. Bureau Standards, AMS 55, p. 864, May 1968.

a l_ > gG. A. Jullien (M'70) was born in Wolverhampton,g England, on June 16, 1943. He received the

B.Tech. degree from Loughborough University ofTechnology, England, in 1965, the M.Sc. degreefrom the University of Birmingham, England, in1967, and the Ph.D. degree from Aston Universi-ty, England, in 1969, all in electrical engineering.From 1961 until 1966 he worked for English

Electric Computers at Kidsgrove, England, first asa Student Apprentice and then as a Data Process-ing Engineer. From 1967 until 1969 he was em-

ployed as a Research Assistant at Aston University in England. Since 1969he has been with the Department of Electrical Engineering, University ofWindsor, Windsor, Ont., Canada, and currently holds the rank of Professor.From 1975 until 1976 he was a Visiting Senior Research Engineer at theCentral Research Laboratories of EMI Ltd., Hayes, Middlesex, England. Heis currently engaged in research in the areas of digital signal processing andhigh-speed digital hardware. He also teaches courses on electronic circuits,microcomputer systems, and digital signal processing.

Performance of a Simulated DataflowComputer

KIM P. GOSTELOW AND ROBERT E. THOMAS, STUDENT MEMBER, IEEE

Absiract-Our goal is to devise a computer comprising largenumbers of cooperating processors (LSI). In doing so we reject thesequential and memory cell semantics of the von Neumann model, andinstead adopt the asynchronous and functional semantics of dataflow.We briefly describe the high-level dataflow programming languageId, as well as an initial design for a dataflow machine and the resultsof detailed deterministic simulation experiments on a part of thatmachine. For example, we show that a dataflow machine can auto-matically unfold the nested loops of n X n matrix multiply to reduceits time complexity from 0(n3) to 0(n) so long as sufficient processorsand communication capacity is available. Similarly, quicksort executeswith average 0(n) time demanding 0(n) processors. Also discussed arethe use of processor and communication time complexity analysis and"flow analysis," as aids in understanding the behavior of the ma-chine.

Index Terms-Asynchronous execution, concurrency, dataflow,distributed computer, functionality, large-scale integration, locality,multiprocessor architecture, parallel computer.

Manuscript received April 12, 1979; revised October 15, 1979 and March25, 1980. This work was supported by NSF Grant MCS-7815467: The UCIDataflow Architecture Project.

K. P. Gostelow was with the Department of Information and ComputerScience, University of California, Irvine, CA 92717. He is now with theGeneral Electric Research and Development Center, Schenectady, NY12301.

R. E. Thomas is with the Department of Information and ComputerScience, University of California, Irvine, CA 92717.

I. INTRODUCTION

THE ability of large-scale integration (LSI) technologyto inexpensively produce large numbers of identical,

small, yet complex devices, should make possible a general-purpose computer comprising hundreds, perhaps thousandsof asynchronously operating processors. Within such a ma-chine each processor accepts and performs a small task gen-erated by the program, produces partial results, and sends theseresults on to other processors in the system. Many processorsthus cooperate, asynchronously, to complete the overallcomputation. A natural consequence of such behavior shouldbe decreased time for problem solution as new processormodules are added to the machine. This paper describes theresults of simulation experiments on an initial design for sucha machine based on the principle of dataflow.

A. Concurrency and the von Neumann Model

Several computers have been devised in attempts to syn-thesize a single large machine from a collection of smallerprocessors, e.g., Illiac IV [10], Cm* [16], and C.mmp [38].However, multiprocessor machines have not yet achieved theease of programming and level of performance sought. Forexample, the programmer should not be concerned with how

0018-9340/80/1000-0905$00.75 1980 IEEE

905