New realisation technique of high-speed discrete Fourier transform described by distributed...

New realisation technique of high-speeddiscrete Fourier transform described by

distributed arithmeticW.C. Siu, AP(HK), M.Phil., C.Eng., M.I.E.R.E., Mem.I.E.E.E.. and Prof. C.F.

Chen, B.S., M.S., Ph.D., C.Eng., F.I.E.E., Sen.Mem.I.E.E.E.

Indexing terms: Mathematical techniques, Microprocessors, Discrete Fourier transforms

Abstract: This paper presents the results of a study using distributed arithmetic on a microprocessor to imple-ment a prime-number-based discrete Fourier transform (DFT). The matrix equation of the DFT can be reord-ered in a convolution form suitable for distributed arithmetic. Further simplification of the equation can beachieved by noting some simple properties of number theory and the DFT. It was found that the total compu-tation time for a 61-point DFT, using a 2 MHz clock 6800 microprocessor, was only 3.6 ms, and the computa-tion time increases directly in proportion to the number of points in the DFT. This fast realisation techniqueshould be suitable both for microprocessor-based systems and for the direct hardware implementation.

1 Introduction

The discrete Fourier transform (DFT) has a great manypotential applications in many areas of engineeringscience. However, until the fast Fourier transform (FFT)[1] was developed, the DFT was not widely used becauseof the excessive time it takes to compute. With the devel-opment of the FFT many facets of scientific analysis havebeen completely revolutionised. Although much work hasalready been done on the implementation of the DFT onminicomputers, large computers and dedicated hardware,there is little work reported [2-5] on the implementationof the DFT using microprocessors. This is mainly becausethe execution time for a microprocessor-based DFT, oreven FFT, is much too long for many applications. It tooka few milliseconds to compute a 64-point FFT using two-byte, floating-point, complex arithmetic with real imputdata [5].

In 1968 Rader [6] found that the compution of theDFT can be changed into a circular convolution byrearranging the complex coefficients W, whereW = e-j(2«N)t w h e n N i s p r i m e K o , b a a n d P a r k s J-71J

making use of the combination of the multidimensionaltransform with prime factors and Winograd's discreteFourier transform (WDFT) [8], found that a length-252DFT requires 3.2 s on an 8080 microprocessor and thatthe radix-2-FFT subroutine is 70% slower than this prime-factor DFT.

In this paper we present a study of the use of distributedarithmetic [9] on a microprocessor to implement a prime-number-based DFT. We found that this new realisation isvery promising and is faster than the FFT algorithm for a61-point DFT.

2 Theory

Consider a field of integers: F{P) = (0, 1, 2, . . . , P - 1),where P is a prime number. It is always possible to find aprimitive root to generate all nonzero elements inside thefield modulo P. To see how this may be done consider thecase of P = 7. A primitive root of 7 is g = 3. The elementsinside the set {n: n= 1, 2, . . . , 6} can be generated by a

Paper 2794E, first received 18th August 1982 and in revised form 18th August 1983* Prof. Chen is, and Mr. Siu was, with the Department of Electronics of theChinese University of Hong Kong. Mr. Siu is now on leave from the Department ofElectronic Engineering of the Hong Kong Polytechnic and is with the Departmentof Electrical Engineering, Imperial College of Science and Technology, LondonSW7 2BT, England

IEE PROCEEDINGS, Vol. 130, Pl..E, No. 6, NOVEMBER 1983

mapping of n to g" modulo P. The residue of g" modulo Pcan be written as <g">P. The set formed by <3">P is {3, 2,6, 4, 5, 1}. This shows that the transformation of n to <#">P

is just a rearrangement of the elements inside {1, 2, . . . ,P — 1} and <g">P forms a cyclic group of P — 1 elements.

Consider now a P-point discrete Fourier transform,p - 1

X(k) = £ x(n)W nk (1)

where- e-j(2n/P) = 0, 1,2, . . . , P- 1

Writing

tX(0) =

andP- 1

X(k) = X x(n)Wn = 1

x(0)

= X'(k) + x(0)

whereP - 1

X\k) = X ^l")1^n= 1

nk k = 1, 2, . . . , {P - 1)

The existence of a primitive root g allows us to make arenumbering of the DFT equation by mapping k to <gk>P.Thus,

p - 1

X'{(gk}P) = X x(n)Wn<ak>F k = 1, 2, . . . , P - 1n = 1

The primitive root g also allows us to make a reordering ofthe terms in the DFT equation.

Define <g-">P as < / ( P ) - " > P

where

(f)(P) = Euler's totient function

= P - 1 if P = prime

and

</(l>)>p = 1 (Euler's theorem)

Let n map to <#~">P, thusP - l

n= 1(2)

177

Eqn. 2 represents a backward circular convolution of(P — 1) points. In matrix notation:

X'«91>P)

X'«g2yP)

,-p+ 1\

By making the substitution of —l = k — n, eqn. 2 can berewritten as

X'«gkyP) =

In matrix notation:

(4)

x«g-

(5)

XV)X'{2)X'(6)X'(4)X'(5)

x(4) x(6) x(2) x(3) x(l) x(5)x(6) x(2) x(3) x(l) x(5) x(4)x(2) x(3) x(l) x(5) x(4) x(6)x(3) x(l) x(5) x(4) x(6) x(2)x(l) x(5) x(4) x(6) x(2) x(3)x(5) x(4) x(6) x(2) x(3) x(l)J

W5

w4

w6

w2

w3

(7)

Usually x(n) is a sampled input value of the signal to beprocessed. The implementation of eqn. 3 using distributedarithmetic is not practical since the x(n)s are variables andthe row values W<^k~">p inside the square matrix vary fromone row to another. However, eqn. 7 is much easier toimplement. The row values inside the square matrix stillvary from row to row, but the column matrix of W<9~l>p isactually a fixed pattern if the transform length P is known.This fixed pattern may be stored in read-only memory(ROM) as a look up table for use in distributed arithmetic,as is discussed below.

Let x«g~'~fc>P) be represented by a 2s complementform and normalised to fractional numbers. x((g~l~kyp)can be represented by

M - l

where M = word length of x«gf"'~fc>P)s, x((g~l~kyp)0 isthe sign bit for the 2s complement representation andx«g~'~fc>P)m = 0 or 1. Thus eqn. 4 can be written as

P - l / M - l

*'Kflf*>p)= I/ = 1

Z

M-l / P - l

z (zm = 1 \ / = 1

To clarify the ideas of this Section, consider the followingexample with P = 7 and g = 3. Eqn. 1 and 2 become

7 - 1

X(k) = £ x(n)Wnk

where

k = 0, 1, 2, . . . , 6

and

7 - 1

X'(3k)= X x

The matrix representation of this equation can be written

as

X\2)X'(6)X'(4)X'(5)

w3w2w6w4_w5

w5w1w3w2w6w4

w4w5wlw3w2w6

w6w4w5w1w3w2

w2w6w4wswlw3

w3-w2w6w4w5w1

x(5)x(4)x(6)x(2)x(3)

(6)

Applying eqn. 2, we can rewrite eqn. 6 as

178

(8)

This is a binary formulation of the sampled input valuesand is a standard form for the distributed arithmetic. Themost significant bits of all the x«g~'~fc>p)s multiply thecorresponding W<9~l>ps, the second most significant bits ofall the x«g~'~fc>p)s multiply the same set of W<9~l>ps andso on. The term 2~m in eqn. 8 simply means that the sumregister must be shifted before each addition. Note that ifthe term

p-i

I x«g-l-kyP)mw<9-'>p

1 = 1

can be precalculated, the computation necessary to findeach X'((gkyp) is M — 2 additions and one subtraction. Inthe next Section, we are going to introduce a method tofind these sum-of-product terms and describe a method ofsimplifying the procedure.

3 Implementation techniques

As g and P are relatively prime, Euler's theorem says

< / ( P ) > P = 1

If

< / " 1 > P = 1 (9)

IEE PROCEEDINGS, Vol. BO, Pt. E, No. 6, NOVEMBER 1983

then

= ±1<0(P~1)/2>p = 1 >s contradictory with g being a primitiveroot, because if g is a primitive root then g is of order

= p - 1, that is P - 1 is the least integer such that1

(11)

Therefore,

This equation is particularly useful for the implementationof eqn. 8.

Recalling the symmetry of the real and the anti-symmetry of the imaginary components of Wnk about thehorizontal axis one may show from eqn. 1 that

X'(k) = X*(P - k) (12)

where k = 0, 1, . . . , (P — l)/2 and * denotes the complexconjugate. Eqn. 12 shows that for the implementation ofeqn. 1 only 'half plus one' of the X(k) terms need to befound. The other half can be found by making use of eqn.12. We may either find all the X{k)s or all the X(P - k)s,for k = 0, 1, . . . , (P — l)/2. However, the actual guide linefor the implementation is that we simply have to findeither X(k) or X{P — k) for each conjugate pair [X(k),X{P - /c)].

Substituting eqns. 9 and 11 into the first column matrixof eqn. 5, we obtain

X\<gl^P)

X\(g < P - l ) / 2 - !•

X'(-l)X\-(glyP)X'(-<g2}P)

x'*(W>P)

The last column matrix of eqn. 13 is obtained by notingthe property of periodicity of the DFT. To make the pointmore clear, consider the case P = 7. The first column ofeqn. 7 can be written as

(13)

-X'(3) IX'(2)X'(6)X'(4)X'(5)

-X'(l)_!

-X'(3) •

X'(2)X'(6)X'*(3)X'*(2)

IX'*16).

This shows that only half of the terms need to be found forthe implementation of eqns. 5 and 8.

In order to make a fast implementation of eqn. 8, anefficient method has to be used to calculate

p-i

I,-l-k

/p) — Am

m = 0, 1 , . . . ( M - 1 ) (14)

Each summation over / involves the multiplication of the(P — 1) bits formed from the mth bits of all x{(g~l~kyP)s bythe W<9~l>ps of B bits in length. Since the length of eachinput string from x((g~l~k}P)ms is (P — 1) bits, there are2P~1 possible input patterns. Hence there are 2p~l pos-sible values for X"m. For ROM implementation of X"m, acapacity of 2P~1 words is required for the look up valuesfor a length-P DFT. The (P - 1) bits from x((g-'-kyP)msform 2p~l possible addresses to map 2p~l outputs fromROMs. Note also that the word length of W<9~l>ps can beas long as desired. However we have to round off X'^s toform ROM tables of proper word length.

The number 2P~1 becomes excessively large for larger(P — 1). This problem may be solved by partitioning the(P — 1) input bit patterns into blocks. Hence,

X"m =1 = 0 d = 1

where D is the number of bits per block. Software could beused to add up the partial sum of products of blocks in aserial basis. However, hardware adders may be used tomaintain one access to ROMs for each X'^. A three-blockimplementation scheme is illustrated in Fig. 1. It can be

shift register bank

P-1

R0M1

P-13

ROM 2

1 word 1 word

P-13

ROM 3

adder

1 word

adder

Fig. 1 Three-block implementation scheme

seen that some delay problems may be caused by theadders. These problems can be solved without difficulty byusing bipolar adders and high speed ROMs (or PROMs),as the access time of a high speed ROM (PROM) and thepropagation delay time of a bipolar adder are usually tenor more times shorter than one machine cycle of a micro-processor. Note that if we use 2P~1 words of ROM, nohardware adder is required. On the other hand, theminimum ROM size for the implementation could just be(P — 1) words, but, in this case, the number of addersnecessary would be (P — 2). In practice a trade-off has tobe made between ROM size, the number of adders used,the ROM access time and the adder delay time.

The ROM size for the implementation can further bereduced by noting that the last column matrix of eqn. 5 issimilar to the first column matrix of the same equation.

IEE PROCEEDINGS, Vol. 130, Pt. E, No. 6, NOVEMBER 1983 179

The lower half of the matrix involving W<9 />ps can bewritten as the complex conjugate of the upper half.

*>P

or

(15)

(16)

Thus, eqn. 7 can be written as

•X'{3)

X'(2)X'(6)X'*(3)X'*(2)

The equation to be implemented is

x(4)x(6)x(2)

*(3)x(l)x(5)

x(6)x(2)

x(3)x(l)x(5)x(4)

x(2)x(3)x(l)x(5)x(4)x(6)

x(3)x(t)x(5)x(4)x(6)x(2)

x(l)x(5)x(4)x(6)x(2)x(3)

x(5)"x(4)x(6)x(2)x(3)x-m.

'W5 'W4

w6

.w6*.(17)

A"(3,A"(6,X'(2,X'{3,X'(6,

-X'(2,

D"0)0)0)1)

D-l

x(3)x(2)x(6)x(4)x(5)

x(3)x(2)x(6)x(4)x(5)

x(5)

x(3)x(2)x(6)x(4)J

W5

(18)

where

X'{3) = X'(3, 0) + X'*{3, 1)

X'(6) = X'(6, 0) + X'*(6, 1)

X'(2) = A"(2, 0) + X'*(2, 1)

This shows that only A"(3), X'(2) and X'{6) need to be cal-culated directly and a table involving W5, W4 and W6 issufficient for the implementation.

4 Microprocessor-based implementation

Table 1 in Appendix 8 shows the sequence {g°, gl, g2, . . . ,gp~1} generated by P = 61 with a primitive root of g = 2.The reader may check that the entries in this Table agreewith eqn. 16. Note also the symmetry and antisymmetry ofthe real and imaginary components of Ws about the mid-point of the Table. With reference to eqn. 15, the data thatneeds to be employed to form the ROM table runs fromN = 59 to N = 30. Fig. 2 shows the block diagram for theimplementation of this 61-point DFT using the 8-bit 6800microprocessor. The W<9~l>ps are partitioned into fourblocks as shown, hence a total of 1.5 K x 8-bit ROM isrequired. Three adders are used to add up the partial sumsand for each access either all the ROMs containing thereal parts or all the ROMs containing the imaginary partsneed to be activated. Data of 8-bit length are sent from theMPU and converted to serial form by the parallel-to-serialconverter chip. A register bank of 29 eight-bit serial regis-ters and the parallel-to-serial converter chip are used toprovide 30 input bits for the look up tables. The 256-bitshift register is used to shift back the data for the lower

half of the operation. The tri-state buffer is used to isolateresults of adder 3 from the data bus when the MPU is notaccessing the partial sum. In order to save execution time,the shift register bank is shifted to the right after eachaccess to the sum of adder 3 without using an extra ins-truction. This can be done by simultaneously reading thepartial sum from adder 3 and starting to shift the registerbank to the right.

The initial 30 bytes of data can be loaded into the 29eight-bit shift registers by using a dummy read from adder3 or by using 29 eight-bit shift registers with parallel inputcontrol. The latter method is faster than the former, butsome extra wiring is necessary in this case.

Since hardware adders are used to add up the partialsums from the look up tables, subtraction is time consum-ing and difficult. Therefore, it is desirable to add a positivebias to all X"ns of eqn. 14 so as to make then all positivenumbers. In order to reduce the ROM size (see eqn. 18),X"m may be modified:

(P-D/2

m = 0, 1, ... (M - 1) (19)

In our design, all the X"ns are scaled to a fractionalnumber. We divide these values by 2 and then add 0.5 toshift the zero point to 0.5. That is

X»m-+ 0.5 + 0.5*;:

Making this substitution and from eqns. 19 and 8, weobtain

X'((gk>PJi) =m = 1

M - l j- Y" 0

A f - 1 i ,

X 22'm~2j f o r / l = 0'

The last term in eqn. 20, _ 2 ~ ( M ~ 1 ) , is small in value andmay sometimes be neglected. However, if better accuracy isrequired, the _2~ ( A f ~ 1 ) term can be added back to theresult register.

To calculate each X'((gkyp, h) in eqn. 20 eight accessesto the look up table are required for M = 8. The 6800assembly language program for calculating X'((gkyp, h) is

* FOR GETTING 8-BIT RESULT*LDAA RSLT LOAD THE LEAST SIGNIF.

BYTELSRAADDA RSLT 2ND BYTERORAADDA RSLT 3RD BYTERORAADDA RSLT 4TH BYTERORAADDA RSLT 5TH BYTERORAADDA RSLT 6TH BYTERORAADDA RSLT 7TH BYTERORAADCA #$0SUBA RSLT RESULT STORED IN A

180 IEE PROCEEDINGS, Vol. 130, Pt. E, No. 6, NOVEMBER 1983

This program segment shows that it takes 41 clock cycles This formulation is suitable for microprocessor-based(= 20.5 fis with a 2 MHz clock) to find each *"(<0k>6i > h)- implementations of the DFT. The total computation time

256 bitshift register

data bus of microprocessor

Fig. 2 Block diagram for microprocessor-based implementation of 61-point Dt'T

The total number of accesses to this program segment tofind all the X'((gk\x, /i)s, where h = 0 or 1, is 60 for boththe real and imaginary coefficients. This implies that thetotal time spent on data accessing is 2.4 ms. However someadditional time has also to be spent on: (i) rearranging theinput and output data, (ii) loading and storing the data,(iii) complementing X'((gky>6l, 1) and calculating

i. 0) i)

The time required for these parts depends mainly onwhether or not we use program loops and this time variesbetween 1.2 ms and 1.8 ms.

5 Conclusion

We have seen that by reordering the matrix equation, theprime-number-based discrete Fourier transform in convol-ution form can be calculated using distributed arithmetic.

for a 61-point DFT using a 2 MHz 6800 microprocessorwas found to be only 3.6 ms. The computation timeincreases in direct proportion to the number of points {N)in the DFT. This is obviously better than in the case of theFFT, for which the computation time is proportional to±N log2 N.

The present design can, of course, be applied to allDFTs with prime sequence lengths. However, it is mostefficient for the computation of DFTs with relatively shortsequence lengths N. The amount of additional hardwarerequired increases with N (= P). Hence, if the sequencelength is a very large number, the hardware design may betoo large for a direct implementation using this technique.Note also that this technique cannot be directly applied toDFTs with composite sequence lengths. However, one mayuse a two-dimensional formulation [10] of the DFT equa-tions for long and composite sequence lengths. The firstdimensions can be found by the present technique whereasthe second dimension can be found by the prime factor

I Eli PROCEEDINGS, Vol. 130, Pt. E, No. 6, NOVEMBER 1983 181

algorithm [7] or the Winograd's Fourier transform algo-rithm [8]. For example, a length 122-point DFT can becomputed by a 2 x 61-point two-dimensional DFT. Thewhole computation requires just two 61-point DFTstogether with 122 extra additions.

Although the 6800 microprocessor was used for thisimplementation, this technique is also suitable for othermicroprocessors. We suggest that this method should alsobe suitable for the hardware implementation of the DFTand this could form the basis for the design of new singlechip digital signal processor.

6 Acknowledgment

The authors wish to thank John Morris of the ImperialCollege of Science & Technology for his comments andassistance in preparing this paper and to thank the refereesfor their helpful comments and recommendations.

7 References

1 COOLEY, J.W., and TUKEY, J.W.: 'An algorithm for the machinecalculation of complex Fourier series', Math. Comput., 1965, 19, pp.297-301

2 PELED, A., and LIU, B.: 'Digital signal processing: theory, designand implementation' (John Wiley & Sons, 1976)

3 KOBYLINSKI, R.A., STIGALL, P.D., and ZIEMER, R.E.: 'Amicrocomputer-based data acquisition system with hardware capabil-ities to calculate a fast Fourier transform', IEEE Trans., 1979, ASSP-27, pp. 202-203

4 WALLINGFORD, E.E., and COLLINS, W.R.: 'A dynamic electro-encephalogram frequency analyzer', ibid., 1978, IM-27, pp. 70-73

5 LUK, W.K., and LI, H.F.: 'Microcomputer-based real-time onlinef.f.t. processor', IEE Proc. E, Comput. & Digital Tech., 1980, 127, (1),pp.18-23

6 RADER, CM. : 'Discrete Fourier transforms when the number ofdata samples is prime', IEEE Proc, 1968, 56, pp. 1107-8

7 KOLBA, D.P, and PARKS, T.W.: 'A prime factor FFT algorithmusing high-speed convolution', IEEE Trans., 1977, ASSP-25, pp.281-294

8 WINOGRAD, S.: 'On computing the discrete Fourier transform',Meth. Comp., 1978, 32, pp. 175-99

9 BURRUS, C.S.: 'Digital filter structure described by distributed arith-metic', IEEE Trans., 1977, CAS-24, pp. 674-80

10 BURRUS, C.S.: index mapping for multi-dimensional formulation ofthe DFT and convolution', IEEE Trans., 1977, ASSP-25, pp. 239-242

Table 1:

N

2324252627282930313233343536373839404142434445464748495051525354555657585960

K (2N)

10204019381530605957534529585549371326524325503917347142856514121422346311

Real (T)

0.5147928-0.4699764-0.5582432-0.3767277-0.71615020.02574793

-0.99859420.99469990.97885570.91631690.6792733

-0.07717546-0.98803810.95263540.81502830.3285424

-0.78411620.2296878

-0.89447620.6002143

-0.2794856-0.84376920.4239144

-0.6405921-0.1792807-0.93569840.75113190.1283984

-0.96699720.87028520.5147928

-0.4699765-0.5582431-0.3767278-0.71615020.02574789

-0.99859430.9946999

Imag. (T)

-0.8573146-0.88267880.8296771

-0.9263240.6979445

-0.9996685-0.051495290.1028210.20455210.40045390.73388540.9970175

-0.15390060.30411490.57942110.94448920.6206101

-0.9732644-0.44709580.79983930.9601499

-0.53669730.90570230.7678806

-0.9837980.3527555

-0.6601521-0.9917227-0.25467710.49254810.85731460.8826788

-0.82967710.926324

-0.69794450.99966350.05149527

-0.102821

Base = 6 1 , primitive root = 2, T= W<2S><"

8

Tablewith

N

012345678910111213141516171819202122

Appendix

i 1: Sequence* {g°. g\ g2

a primitive root of g = 2

K (2")

02481632361224483591836112244275447335

Real (T)

10.97885570.91631690.6792733

-0.07717542-0.98803810.95263540.81502830.3285424

-0.78411620.2296877

-0.89447620.6002143

-0.2794855-0.84376930.4239144

-0.6405921-0.1792807-0.93569840.75113190.1283983

-0.96699720.8702852

gp~1} generated by P = 61

Imag. (T)

0-0.2045521-0.4004539-0.7338854-0.99701750.1539005

-0.3041148-0.5794211-0.9444892-0.62061010.97326440.4470957

-0.7998392-0.96014990.5366973

-0.9057023-0.76788060.983798

-0.35275550.66015210.99172270.254677

-0.4925481

Chih-fan Chen received the B.S. degree from Pei-yang University,China, the M.S. degree from the University of Pennsylvania, USAand the Ph.D. degree from the Cambridge University, UK. Hewas Professor of Electrical Engineering of the University ofHouston from 1966 to 1977 and has been Professor and Chair-man of the Electronics Department of the Chinese University ofHong Kong since 1977. During the summers of 1981-1983 he wasvisiting Professor of the Boston University, and visiting scientistat MIT of Cambridge, Ma., USA in the summers of 1976, 1977and 1983. He is a Fellow of the IEE.

Wan-chi Siu received the Associateship of the Hong Kong Poly-technic in Electronic Engineering in 1975 and the M.Phil, degreein Electronics from the Chinese University of Hong Kong in1977. From 1975 to 1980 he taught, and subsequently became anElectronic Engineer, in the Department of Electronics of theChinese University of Hong Kong. Since 1980 he has been withthe Department of Electronic Engineering of the Hong KongPolytechnic as a lecturer. He is now on leave from the HongKong Polytechnic and with the Department of Electrical Engi-neering, Imperial College of Science and Technology England.His research interests are in transform techniques, hardware andsoftware implementations of digital signal processors, micro-processor architectures and fabrication technology. He is a Char-tered Engineer and a member of the IERE and the IEEE.

182 IEE PROCEEDINGS, Vol. 130, Pt. E, No. 6, NOVEMBER 1983

New realisation technique of high-speed discrete Fourier transform described by distributed...

Documents

Transcript of New realisation technique of high-speed discrete Fourier transform described by distributed...