Algorithms for Discrete Fourier Transform and Convolution (Signal Processing and Digital Filtering)
Transcript of Algorithms for Discrete Fourier Transform and Convolution (Signal Processing and Digital Filtering)
Algorithms for Discrete Fourier Transform and Convolution
Second Edition
Springer New York Berlin Heidelberg Barcelona Budapest Hong KO ng London Milan Par is Santa Clara Singapore Tokyo
Signal Processing and Digital Filtering
Synthetic Aperture Radar J.P. Fitch
Multiplicative Complexity, Convolution and the DFT M.T. Heideman
Array Signal Processing S.U. Pillai
Maximum Likelihood Deconvolution J.M. Mendel
Algorithms for Discrete Fourier Transform and Convolution Second Edition T. Tolimieri, M. An, and C. Lu
Algebraic Methods for Signal Processing and Communications Coding R.E. Blahut
Electromagnetic Devices for Motion Control and Signal Processing Y.M. Pulyer
Mathematics of Multidimensional Fourier Transform Algorithms Second Edition R. Tolimieri, M. An, and C. Lu
Lectures on Discrete Time Filtering R.S. Bucy ,
Distributed Detection and Data Fusion P.K. Varshney
Richard Tolimieri Myoung An Chao Lu
Algorithms for Discrete Fourier Transform
and Convolution Second Edition
c.S. Burrus Consulting Editior
4* (1,
Springer
Richard Tolimieri Myoung An Department of Electrical Engineering A.J. Devaney Associates City College of CUNY 52 Ashford Street New York, NY 10037, USA Allston, MA 02134, USA
Chao Lu Department of Computer and
Information Sciences Towson State University Towson, MD 21204, USA
Consulting Editor Signal Processing and Digital Filtering
C.S. Burrus Professor and Chairman Department of Electrical and
Computer Engineering Rice University Houston, TX 77251-1892, USA
Library of Congress Cataloging-in-Publication Data Tolimieri, Richard, 1940-
Algorithms for discrete Fourier transform and convolution / Richard Tolimieri, Myoung An, Chao Lu.
p. cm. — (Signal processing and digital filtering) Includes bibliographical references (p. — ) and index. ISBN 0-387-98261-2 (alk. paper) 1. Fourier transformations—Data processing. 2. Convolutions
(Mathematics)— Data processing. 3. Digital filters (Mathematics) I. An, Myoung. II. Lu, Chao. III. Title. IV. Series. QA403.5.T65 1997 515 '.723 —dc21 97-16667
Printed on acid-free paper.
0 1997 Springer-Verlag New York, Inc. All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer-Verlag New York, Inc., 175 Fifth Avenue, New York, NY 10010, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereaf-ter developed is forbidden. The use of general descriptive names, trade names, trademarks, etc., in this publication, even if the former are not especially identified, is not to be taken as a sign that such names, as understood by the Trade Marks and Merchandise Marks Act, may accordingly be used freely by anyone.
Production managed by Anthony Battle; manufacturinesupervised by Johanna Tschebull. Photocomposed pages prepared from the authors' LATEX files: Printed and bound by Braun-Brumfield, Inc., Ann Arbor, MI. Printed in the United States of America.
9 8 7 6 5 4 3 2 1
ISBN 0-387-98261-2 Springer-Verlag New York Berlin Heidelberg SPIN 10629424
Preface
This book is based on several courses taught during the years 1985-1989 at the City College of the City University of New York and at Fudan Univer-sity, Shanghai, China, in the summer of 1986. It was originally our intention to present to a mixed audience of electrical engineers, mathematicians and computer scientists at the graduate level a collection of algorithms that would serve to represent the vast array of algorithms designed over the last twenty years for computing the finite Fourier transform (FFT) and finite convolution. However, it was soon apparent that the scope of the course had to be greatly expanded. For researchers interested in the design of new algorithms, a deeper understanding of the basic mathematical concepts underlying algorithm design was essential. At the same time, a large gap remained between the statement of an algorithm and the implementation of the algorithm. The main goal of this text is to describe tools that can serve both of these needs. In fact, it is our belief that certain mathematical ideas provide a natural language and culture for understanding, unifying and implementing a wide range of digital signal processing (DSP) algo-rithms. This belief is reinforced by the complex and time-consuming effort required to write code for recently available parallel and vector machines. A significant part of this text is devoted to establishing rules and procedures that reduce and at times automate this task.
In chapter 1, a survey is given of basic algebra. It is assumed that much of this material is not new; in any case, the facts are easily described. The tensor product is introduced in chapter 2. The importance of the tensor product will be a reoccurring theme throughout this text. Tensor product factors have a direct interpretation as machine instructions on many vector
vi Preface
and parallel computers. Tensor product identities provide linguistic rules for manipulating algorithms to match specific machine characteristics. In-herent in these rules are certain permutations, called stride permutations, which can be implemented on many machines by a single instruction. The tedious effort of readdressing, required in many DSP algorithms, is greatly reduced. Also, the data flow is highlighted, which is especially important on supercomputers where the data flow is usually the major factor that determines the efficiency of the computation.
The design of fast FFT algorithms can be dated back historically to the times of Gauss (1805) [1]. The collected work of some unpublished manuscripts by Gauss contained the essentials of the Cooley-Tukey FFT algorithm, but it did not attract much attention. In 1965, when Cooley and Tukey published their paper [2], known as the fast Fourier transform algorithm, the computing science started a revolutionary era. Since then, many variants of the Cooley-Tukey FFT algorithm have been developed. In chapters 3 and 4 of this book, the Cooley-Tukey FFT algorithm, along with its many variants, are unified under the banner of tensor product. Prom one point of view, these algorithms depend on mapping a one-dimensional array of data onto a multidimensional array of data (depending on the degree of compositeness of the transform size). Using tensor product we need to derive only the simplest case of mapping into a two-dimensional array. Tensor product identities can then be used to derive the general case. An explicit description of the data flow is automatically given along with rules for varying this data flow, if necessary. In chapter 5, the Good-Thomas prime factor algorithm is reformulated by tensor product.
In chapters 6 and 7, various linear and cyclic convolution algorithms are described. The Chinese Remainder Theorem (CRT) for polynomials is the major tool. Matrix and tensor product formulations are used wherever possible. Results of Cook and Toom and Winograd are emphasized. The integer CRT is applied in chapter 7 to build a large convolution algorithm from efficient small convolution algorithms (Agarwal-Cooley).
The scene changes in chapter 8. Various multiplicative FFT algorithms (depending on the ring structure of the indexing set) are described. The prime size algorithms are due to Rader. Winograd generalized Rader's method to composite transform size. We emphasize a variant of the Rader-Winograd method. Tensor product language is used throughout, and tensor product identities serve as powerful tools for obtaining several variants that offer arithmetic and data flow options.
In chapter 13, we consider the duality between periodic and decimated data established by the Fourier transform. This duality is applied the computation of the Fourier transform of odd prince power transform sizes, say pk . The ring structure of the indexing set has an especially simple ideal structure (local ring). The main result decomposes the computation of the Fourier transform into two pieces. The first is a Fourier transform of transform size pk-2 . The description of the second is the main objective
Preface vii
of chapters 14 and 15, where we introduce the theory of multiplicative characters and derive formulas for computing the Fourier transform of multiplicative characters.
The authors are indebted to the patience and lcnowledge of L. Auslan-der, J. Cooley and S. Winograd, who, over the years at IBM, Yorktown Heights, and at the City University of New York, have taken time to ex-plain their works and ideas. The authors wish to thank C. S. Burrus, who read the manuscript of the book and suggested many improvements. We wish to thank DARPA for its support during the formative years of writing this book, and AFOSR for its support during the last two years, in which time the ideas of this book have been tested and refined in applications to electromagnetics, multispectral imaging, and imaging through turbulance. Whatever improvements in this revision are due to the routines written in these applications.
Richard Tolimieri Myoung An
Chao Lu
Contents
Preface
1 Review of Applied Algebra 1 1.1 Introduction 1 1.2 The Ring of Integers 2 1.3 The Ring Z/N 5 1.4 Chinese Remainder Theorem (CRT) 7 1.5 Unit Groups 11 1.6 Polynomial Rings 13 1.7 Field Extension 17 1.8 The Ring F[x]/ f (x) 18 1.9 CRT for Polynomial Rings 21 References 23 Problems 23
2 Tensor Product and Stride Permutation 27 2.1 Introduction 27 2.2 Tensor Product 28 2.3 Stride Permutations 33
-2.4 Multidimensional Tensor Products 40 2.5 Vector Implementation 44 2.6 Parallel Implementation 50 References 53 Problems 53
x Contents
3 Cooley-Tukey FFT Algorithms 55 3.1 Introduction 55 3.2 Basic Properties of FT Matrix 56 3.3 An Example of an FT Algorithm 57 3.4 Cooley-Tukey FFT for N = 2M 59 3.5 Twiddle Factors 61 3.6 FT Factors 63 3.7 Variants of the Cooley-Tukey FFT 64 3.8 Cooley-Tukey FFT for N = ML 66 3.9 Arithmetic Cost 68 References 69 Problems 70
4 Variants of FT Algorithms and Implementations 71 4.1 Introduction 71 4.2 Radbc-2 Cooley-Tukey FFT Algorithm 72 4.3 Pease FFT Algorithm 76 4.4 Auto-Sorting FT Algorithm 79 4.5 Mixed Radix Cooley-Tukey FFT 81 4.6 Mixed Radix Agarwal-Cooley FFT 84 4.7 Mixed Radix Auto-Sorting FFT 85 4.8 Summary 87 References 89 Problems 90
5 Good-Thomas PFA 91 5.1 Introduction 91 5.2 Indexing by the CRT 92 5.3 An Example, N = 15 93 5.4 Good-Thomas PFA for the General Case 96 5.5 Self-Sorting PFA 98 References 99 Problems 100
6 Linear and Cyclic Convolutions 101 6.1 Definitions 101 6.2 Convolution Theorem 107 6.3 Cook-Toom Algorithm 111 6.4 Winograd Small Convolution Algorithm ' 119 6.5 Linear and Cyclic Convolutions 125 6.6 Digital Filters 131 References 133 Problems 134
Contents xi
7 Agarwal-Cooley Convolution Algorithm 137 7.1 Two-Dimensional Cyclic Convolution 137 7.2 Agarwal-Cooley Algorithm 142 References 145 Problems 145
8 Multiplicative Fourier Transform Algorithm 147 References 153
9 MFTA: The Prime Case 155 9.1 The Field Z/p 155 9.2 The Fundamental Factorization 157 9.3 Rader's Algorithm 162 9.4 Reducing Additions 163 9.5 Winograd Small FT Algorithm 167 9.6 Summary 169 References 171 Problems 171
10 MFTA: Product of Two Distinct Primes 173 10.1 Basic Algebra 173 10.2 Transform Size: 15 175 10.3 Fundamental Factorization: 15 176 10.4 Variants: 15 178 10.5 Transform Size: pq 181 10.6 Fundamental Factorization: pq 183 10.7 Variants 185 10.8 Summary 189 References 190 Problems 191
11 MFTA: Composite Size 193 11.1 Introduction 193 11.2 Main Theorem 193 11.3 Product of Three Distinct Primes 196 11.4 Variants 197 11.5 Transform Size: 12 198 11.6 Transform Size: 4p, p odd prime 198 11.7 Transform Size: 60 199 References 202 -Problems 202
1/ MFTA: 203 12.1 Introduction 203 12.2 An Example: 9 203
xii Contents
12.3 The General Case: p2 206 12.4 An Example: 33 212 References 214 Problems 215
13 Perioclization and Decimation 217 13.1 Introduction 217 13.2 Periodic and Decimated Data 220 13.3 FT of Periodic and Decimated Data 223 13.4 The Ring Z/pm 225 Probletns 227
14 Multiplicative Characters and the FT 229 14.1 Introduction 229 14.2 Periodicity 232
14.2.1 Periodic Multiplicative Characters 232 14.2.2 Periodization and Decimation 235
14.3 F(p) of Multiplicative Characters 237 14.4 F(r) of Multiplicative Characters 239
14.4.1 Primitive Multiplicative Characters 239 14.4.2 Nonprimitive Multiplicative Characters 240
14.5 Orthogonal Basis Diagonalizing F(p) 242 14.6 Orthogonal Basis Diagonalizing F(pm) 245
14.6.1 Orthogonal Basis of W 245 14.6.2 Orthogonal Diagonalizing Basis 246
References 247 Problems 248
15 Rationality 249 15.1 An Example: 7 250 15.2 Prime Case 252 15.3 An Example: 32 254 15.4 Transform Size: p2 256 15.5 Exponential Basis 260 15.6 Polynomial Product Modulo a Polynomial 260 15.7 An Example: 33 262 References 264
Index 265
Review of Applied Algebra
1.1 Introduction
In this chapter we will give a brief account of several important results from applied algebra necessary to develop the algorithms in this text. In particular, we will describe the main properties of the following rings:
• Ring of integers Z.
• Quotient ring Z/N of integers modulo N.
• Ring of polynomials F[x] over the field F.
• Quotient ring F[x]I f (x) of polynomials.
The Chinese Remainder Theorem for the ring of integers and the ring of polynomials will be treated in detail with special emphasis on the use of complete systems of idempotents to define the Chinese Remainder ring-isomorphism. This ring-isomorphism diagonalizes convolutional operations, in a sense to be described in later chapters.
In the next chapter, we will introduce the tensor or Kronecker product of matrices, a subject in applied linear algebra, and develop the algebra of ten-soi products, especially the commutation theorem of tensor products. This algebra, along with the algebra of stride permutations, will provide power-fill tools for modeling a wide range of algorithms and for constructing large classes of algorithmic variants with well-defined parameters quantifying computational and communication characteristics.
2 1. Review of Applied Algebra
1.2 The Ring of Integers
The ring of integers Z satisfies the following important condition:
Divisibility Condition. If a and b are integers with b 0, then we can write
a = bq + r, 0 < r < b, (1.1)
where q and r are uniquely determined integers. The integer q is called the quotient of the division of a by b, and it is the
largest integer satisfying bq < a.
The integer r is called the rem,ainder of the division of a by b and is given by the formula
r = a — bq.
If r = 0 in (1.1), then a = bq,
and we say that b divides a or that a is a multiple of b, and we write
b I a.
An integer p > 1 is called a prime if its only divisors are ±1 and ±p. An integer c is called a cornm,on divisor of integers a and b if c a and
c I b. The integers 1 and —1 are common divisors of any two integers. If 1 and —1 are the only common divisors of a and b, we say that a and b are relatively prime. There are only a finite number of common divisors of two integers a and b as long as a 0 or b O. Denote the largest common divisor of integers a and b by
(a , b).
We call (a, b) the greatest common divi,sor of a and b. If b = p, a prime, then either (a,p) = 1 and a and p are relatively prime or (a,p) = p and
P a- Fix an integer N, and set
(N) = NZ = {Nk : k E Z},
the set of all multiples of N. The set NZ is an ideal of the ring Z in the sense that it is closed under addition,
Nk + N1= N(k +l),
and closed under multiplication by Z,
m(Nk) = (mN)k = N(mk).
We will now prove a fundamental property of the ring Z.
1.2 The Ring of Integers 3
Lemma 1.1 Every ideal of Z has the form NZ, for some integer N > O.
Proof Suppose that M is an ideal in Z. If M = (0) = OZ, we are done. Otherwise, M contains positive integers and a smallest positive integer, say N. Take any c E M and write, using (1.1),
c = Ng + r, 0 < r < N.
Since c and N are in M, r c — Ng
is also in M. However, 0 < r < N, which contradicts the definition of N unless r =- O. Thus c = Ng and M = NZ, proving the lemma.
We see from the proof of lemma 1.1 that any ideal M (0) in Z can be written as NZ, where N is the smallest positive integer in M. We will use lemma 1.1 to give an important description of the greatest common divisor.
Lemma 1.2 For nonzero integers a and b,
(a,b) = axo + byo,
for some integers xe and Yo•
Proof The set M = fax ± by : x, y E Z}
is an ideal of Z. By lemma 1.1,
M = dZ,
where d is the smallest positive integer in M. In particular, since a and b are in M, d is a common divisor of a and b. Now write
d = axo + byo,
and observe that any common divisor of a and b must divide d, proving the lemma.
From the proof of lemma 1.2, we have that every common divisor of a and b divides (a, b), which can also be characterized as the smallest positive integer in the set
M = tax + by : x, y E Z1.
A's a consequence, (a, b) = 1 if and only if axo byo = 1, for some integers ro and yo. We will use this to prove the following result.
Lemma 1.3 If a I bc and (a,b) = 1, then a I c. In particular, if p is prime and p I bc, then p b or p I c.
4 1. Review of Applied Algebra
Proof Since (a, b) =- 1, axo + byo = 1,
for some integers xo and yo. Then
c = car° + cbyo.
Since a I caxo and, by assumption, a I bc, we have that a I c. To prove the second statement, we observe that if p does not divide b, then (p, b) = 1. Applying the first part completes the proof.
We have all of the ingredients needed to prove the fundamental prime factorization theorem.
Theorem 1.1 ff N > 1 is an integer, then N can be written uniquely, up to ordering of the factors, as
N = 'Prar,
where pi, , pr are distinct primes and ai > 0, , a, > 0 are integers.
Proof We first prove the existence of such a factorization. If N is prime, we are done. Otherwise, write N = ATIN2, where 0 < N2 < N. By mathematical induction, assume that Ni and N2 have factorizations of the given form. Then their product N = Ni N2 can be written as a product of primes. Collecting the like primes, N has a factorization of the form. Suppose that
N = qbil • • • (1,17'
is a second factorization of the same form. Then qi I N. If qi 0 pi, 1 < j < r, then (qi,p3) = 1. Lemma 1.3 now implies that qi does not divide pcp . In this case, a second application of lemma 1.3 implies that qi does not divide N, a contradiction. It follows that qi = p3 for some 1 < j < r. Continuing in this way, and reordering the factors if necessary, we have s < r and
qk — Pk, 1 < k < s.
Reversing the roles of the prime factors, we have s = r and
qk = Pk, 1 < k < r.
Suppose that ai < bi. Applying the above argument to the integer .
rn — N _ p2a2 prar _ pbraipb22 prb,
Pal 1
we have bi = al. Continuing in this way, uniqueness follows, completing the proof of the theorem.
1.3 The Ring Z/N 5
Take an integer N > 1 and a prime p dividing N. Suppose that a is the largest positive integer satisfying
Pa I N.
By the proof of Theorem 1.1, pa appears in the prime factorization of N. If q is another prime divisor of N, and b is the largest positive integer satisfying
qb N,
then pa qb appears in the prime factorization of N. This discussion leads to the following corollary of Theorem 1.1.
Corollary 1.1 If a I c, b I c and (a,b) =1, then
ab I c.
Proof Since (a, b) = 1, the prime fa.ctors of a and b are distinct. Repeated application of the above discussion proves the corollary.
1.3 The Ring Z/N
Fix an integer N > 1. For any integer a, set
a mod N
equal to the remainder of the division of N into a. In particular,
0 < (a mod N) < N.
Set Z/N = {0, 1, 2, ..., N — 11.
Define addition in Z/N by
(a + 6) mod N, a, b E Z/N,
and multiplication in Z/N by
(ab) mod N, a, b E Z/N.
Straightforward computation shows that Z/N becomes a commutative ring with identity 1 under these operations.
Consider the mapping z Z/N,
defined by n(a) = a mod N.
6 1. Review of Applied Algebra
The mapping n is a ring-hornoniorphism in the sense that
n(a b) =- eq(a) ii(b)) mod N,
?gab) = eri(a)n(b)) mod N.
Two integers a and b are said to be congruent mod N if 97(a) = n(b) or, equivalently,
N I (a — b).
In this case we write
a b mod N.
The unit group of Z/N, denoted by U(N), consists of all elements a E Z/N that have multiplicative inverses bEZIN:
1 = (ab) mod N.
To show that a E U(N), it suffices to find an element b E Z such that
1 ab mod N, (1.2)
since it then follows that
1 = (a(b mod N)) mod N.
Straightforward verification shows that U(N) is a group under the ring-multiplication in Z/N.
Example 1.1 Take N = 9. Then
U(9) = {1, 2, 4, 5, 7, 81.
The group table of U(9) under multiplication mod is as follows:
Table 1.1 Multiplication table of U(9).
1 2 4 5 7 8 1 1 2 4 5 7 8 2 2 4 8 1 5 7 4 4 8 7 2 1 5 5 5 1 2 7 8 4 7 7 5 1 8 4 -2 8 8 7 5 4 2 1
Lemma 1.4 U(N) = ta EZIN : (a, N) = 11.
1.4 Chinese Remainder Theorem (CRT) 7
Proof By the remarks following lemma 1.2, (a, N) = 1 if and only if
axo + Nyo = 1,
for some integers xo and yo. Equivalently, (a, N) = 1 if and only if
axo -a 1 mod N,
which, by (1.2), implies that xo mod N is the multiplicative inverse of a in Z/N, proving the lemma.
Example 1.2 Take N = 15. Then
U(15) = {1, 2, 4, 7, 8, 11, 13, 141.
As a special case of lemma 1.4, if p is a prime, then
U(p)= {1, 2, ..., p — 11,
and every nonzero element in Z/p has a multiplicative inverse. Since Z/p is a commutative ring with identity, it follows that Z/p is a finite field.
Lemma 1.5 Zip is a finite field if and only if p is a prime.
Proof We have shown that if p is a prime, then Z/p is a field. Suppose that N is not a prime, and write N = NiN2, where
1 < N2 < N.
By lemma 1.4, since (N, = 1, Ni does not have a multiplicative inverse in Z/N and Z/N is not a field, completing the proof.
1.4 Chinese Remainder Theorem (CRT)
Suppose that N = NiN2, where (Ni, N2) = 1. Form the ring direct product
ZINixZ/N2. (1.3)
A typical element in (1.3) is an ordered pair
(ai,a2), al Z/Ni, a2 E Z/N2.
Addition and multiplication in (1.3) are taken as componentwise addition
(al, a2) + (bi, b2) = ((ai + bi) mod (a2 + b2) mod N2)
and componentwise multiplication
(al,a2)(bi,b2) = ((aibi) mod (a2b2) mod N2).
8 1. Review of Applied Algebra
The CRT constructs a ring-isomorphism
Z/N Z/Ni x Z/N2. (1.4)
We will construct the ring-isomorphism using idempotents. Since NI. and N2 are relatively prime,
+ N2f2 = 1, f2 E Z. (1.5)
Set ei = (N2f2) mod N, (1.6)
e2 = (Nifi) mod N. (1.7)
Rewrite (1.6) as el = N2f2 + Nq, q E Z. (1.8)
We see from (1.5) that el 1 mod (1.9)
and from (1.8) that el 0 mod N2 . (1.10)
In the same way, e2 0 mod (1.11)
e2 1 mod N2 - (1.12)
The element ei E Z/N is uniquely determined by conditions (1.9) and (1.10). Suppose that a second element gi E Z/N can be found satisfying
91 1 mod 0 mod N2 •
Then gi ei mod Ni and gi ei mod N2 implying that
Ni I 91 — ei N2 I g 1 - ei.
Since (Ni, N2) = 1, corollary 1.1 implies that
N = NiN2 I 91 — ei. (1.13)
Without loss of generality, we assume that gi — ei > O. We have
0 < — ei < N,
which, in light of (1.13), implies that gi = el. This same argument shows that if a and b are elements in Z/N satisfying
a b mod a b mod N2,
where N = NiN2 and (Ni, N2) = 1, then
a = b.
Conditions (1.9)—(1.12) uniquely determine the set
fei,e21, (1.14)
which is called the system of idem,potents for the ring Z/N corresponding to the factorization N = NiN2, (Ni, N2) = 1.
1.4 Chinese Remainder Theorem (CRT) 9
Example 1.3
Table 1.2 Examples of idempotents.
n n2 ei e2 6 2 3 3 4 10 2 5 5 6 12 3 4 4 9 15 3 5 10 6 21 3 7 7 15 28 4 7 21 8
The system of idempotents given in (1.14) will be used to define a ring-isomorphism (1.4). First we need the following result.
Lemma 1.6 If {el, e2} is the system of idempotents for ZI N corresponding to the factorization N = NiN2, (Ni, N2) =1, then
ei mod N , e2 mod N , (1.15)
ele2 0 mod N, (1.16)
ei + e2 1 mod N. (1.17)
Proof By (1.8) and (1.9), Ni I (el — 1) and N2 el. Thus
Ni ei(ei — 1), N2 I ei(ei — 1).
Since (NI., N2) = 1 and e7— ei et(ei —
N = NiN2 (ei — et),
proving =— ei mod N.
In the same way, d e2 mod N, proving (1.15). Since NI I ei and N2 I e2, (NI., N2) = 1 implies that N = NiN2 I eie2 and that
eie2 0 mod N.
Finally, Ni i (et - 1) and Ni e2, implying that Ni I (et ±e2 —1). In the same way, N2 I (ei + e2 — 1). Again, (NI., N2) = 1 implies that N I (el + e2 — 1) and that
ei + e2 1 mod N,
completing the proof.
Define the mapping
cb Z/N
by the formula
0(ai,a2) = (alei + a2e2) mod N, al E Z/N1, a2 G Z/N2.
10 1. Review of Applied Algebra
Theorem 1.2 0 is a ring-isonimphism of Z/Ni x Z/N2 onto Z/N.
Proof Take
a = (ai,a2), b = (bi,b2) E Z/Ni x Z/N2.
We will write addition and multiplication in Z/Ni x Z/N2 by a-kb and ab. Straightforward computation shows that
¢)(a + b) = (4)(a) + OM) mod N.
Lemma 1.6 implies that
0(ab) = (0(a)cb(b)) mod N.
By (1.15) and (1.16),
0(a)0(b) aibie? (alb2 ±a2bi)eie2 -1-a2b2d
aiblei + a2b2e2 mod N.
Formula (1.17) implies that
q5(1, 1) 1 mod N,
proving that 0 is a ring-homomorphism. To prove that cb is onto, take any k E Z/N and observe that
k ((k mod NO el + (k mod N2) e2) mod N.
Since Z/Ni x Z/N2 and Z/N have the same number of elements, this proves that 0 is onto, completing the proof of the theorem.
From the above proof, we see that the inverse 0-1 of 0 is given by
0-1(k) = (k mod k mod N2), k E Z/N.
This implies that every k E Z/N can be written uniquely as
k kiei + k2e2 mod N,
where ki E Z/Ni and k2 E Z/N2. This fact will be used later.
Example 1.4 Take N = 15 with Ni = 3 and N2 = 5. Then ei = 10, e2 = 6 and (b is given in table 1.3.
1.5 Unit Groups 11
Table 1.3 Isomorphic mapping between Z/3 x Z/5 and Z/15.
Z/3 x Z/5 (0, 0) (0, 1) (0, 2) (0, 3) (0, 4) (1, 0) (1,1) (1,2) (1, 3) (1, 4) (2, 0) (2,1) (2, 2) (2, 3) (2, 4)
Consider the direct product of unit groups
U(Ni) x U(N2),
with componentwise multiplication. U(Ni) x U(N2) is the unit group of the ring Z/Ni x Z/N2. In general, any ring-isomorphism maps the unit group isomorphically onto the unit group. Thus, we have the following theorem.
Theorem 1.3 The ring-isoinorphism q5 restricts to a group-isomorphism of U(Ni) x U (N2) onto U(N).
The extension of Theorem 1.3 to factorization,
N = NiN2 • • Arr,
where the factors N2, . NT. are pairwise relatively prime, will be given in problems 8 to 12.
1.5 Unit Groups
Properties of unit groups play a major role in algorithm design. In this section, we will state, at times without proof, several important results that will be used repeatedly throughout the text.
Denote the number of elements in a set S by o(S). o(S) is called the brder of S. In section 3, we proved that
Z/15
6 12 3 9 10 1 7 13 4 5 11 2 8 14
o(U(p)) = p — 1, for a prime p. (1.18)
12 1. Review of Applied Algebra
The same argument, using lemma 1.4, proves for a prime p,
0(U (Pa» = 1), a
CRT, especially Theorem 1.3, can be used to extend these results to the general case. Suppose that
N = Al • • • A.",
is the prime factorization of N. Then
U(N) === U x U(pr) x • • x U (pr.").
It follows that
o(U(N)) = pcp—i pra 1 (p 1 ) (pr 1).
The function o(U(N)), N > 1, is called the Euler quotient function.
Table 1.4 Values of the Euler quotient function.
N I 5 I 52 53 7 I 72 73 I 35
o(U(N)) I 4 I 20 100 6 I 42 294 I 24
We require the following results from general group theory.
Theorem 1.4 If G is a finite group of order m with composition law written multiplicatively, then, for all x E G,
xm = 1. Applying this result to the unit group of the finite field Z/p, we have
from (1.18)
XP-1 1 mod p, x E U(p).
Equivalently, _= x mod p, x E Z/p.
Similar results hold in the unit groups U(pa) and U(N). The next two results are deeper and will be presented without proof.
Theorem 1.5 For an odd prime p, and integer a > 1, the unit group U(pa) is a cyclic group.
This important result is proved in many number theory books, for instance N. As a consequence of Theorem 1.5, an element z E U(pa), called a generator, can be found such that
U(pa) = lzk : 0 < k,<,o(U(pa))}. -
The corresponding result for p = 2 is slightly more complicated. The unit group
U(22) = {1, 3}
is cyclic, but U(2a), a > 2, is never cyclic. The exact result follows.
1.6 Polynomial Rings 13
Theorem 1.6 The group
U(2°), a > 3
is the direct product of two cyclic groups, one of order 2 and the other of order 2'2. In fact, if
Gi = {1, -lb G2 = {5k : 0 < k <2'2 1,
then U(2a)-= Gi X G2.
Example 1.5 For p = 3 and a = 2,
U(32) = {21` : 0 < k < 61.
Example 1.6 Take p = 2 and a = 3. Then
U(23) = {1, 3, 5, 71 = {1, 71 x {1, 51.
Take a = 4. Then
U(24) = {1, 151 x {1, 5, 9, 131.
1.6 Polynomial Rings
Consider a field F and denote by F[x] the ring of polynomials in the inde-terminate x having coefficients in F. A typical element in F[x] is a formal expression
f (x) = Efkxk, fk c F. (1.19) k=0
If fr 0 in (1.19), we say that the degree of f (x) is r and write
deg f(x)= r.
The elements of F can be viewed as polynomials over F. The nonzero elements in F can be identified with the polynomials over F of degree O. The zero polynomial, denoted by 0, has by convention degree —oo. Then we have the important result
deg (f (x) g(x)) = deg f (x) + deg g(x).
iThe integer ring Z and polynomial rings over fields have many properties in common. The reason for this is that the following divisibility condition holds in F[x].
14 1. Review of Applied Algebra
Divisibility Condition. If f (x) and g(x) 0 are polynomials in F[x], then there is a unique pair of polynomials q(x) and r(x) in F[s] satisfying
f (x) = q(x) g(x) + r(x) (1.20)
and deg r(x) < deg g(x). (1.21)
The polynomial q(x) is called the quotient of the division of g(x) into f (x). In practice, we compute q(x) by long division of polynomials. The polynomial r(s) is called the remainder of the division of f (x) by g(x).
If r(x) = 0 in (1.20), we have
f (x) = q(x)g(x), q(x) E F[x],
and we say that g(x) divides f (x) or f (x) is a multiple of g(x) over F, and we write
g(x) I f (x).
The elements of F viewed as polynomials are called constant polynomi-als. Nonzero constant polynomials divide every polynomial. A nonconstant polynomial p(x) in F[x] is said to be irreducible or prime over F if the only divisors of p(x) in F[x] are constants or a constant times p(z). Constant polynomials play the same role in F[x] that the integers 1 and —1 play in Z. To force uniqueness in the statements below, we require the notion of a monic polynomial, which is defined by the property that if fr 0 in (1.19), then fr = 1.
The division relation satisfies the following properties:
(D1) If h(x) I g(x) and g(x) I f (x), then h(x) I f (x).
(D2) If h(x) I f (x) and h(s) I g(s), then h(x) I (a(x)f(x) + b(x)g(x)), for all polynomials a(x) and b(r).
(D3) If f (x) I g(x) and g(x) I f (x), then f (x) = ag(x), a E F.
Consider polynomials f (x) and g(x) over F. A polynomial h(r) over F is called a. common divisor of f (x) and g(x) over F if
h(s) I f (x) and h(x) g(s).
We say that f (x) and g(x) are relatively prirne over F if the only divisors of both f (x) and g(x) over F are the constant polynomials.
A subset J of F[x] is called an ideal if J satisfies the folloviring two properties:
(I1) If f (x), g(x) E J, then f (x) + g(x) E J.
(I2) If f (x) E J and a(x) E F[xj, then a(x) f (x) E J.
1.6 Polynomial Rings 15
Equivalently, J is an ideal if, for any two polynomials f (x) and g(x) in J,
a(x)f (x) + b(x)g(x) E J,
for all polynomials a(x) and b(x) in F[x]. The set
(f (x)) = la(x)f(x) : a(x) E F[x]l
is an ideal of F[x]. The divisibility condition will be used to show that all ideals J in F[x] are of this form. The proof is the same as that in section 2, where we now use the divisibility condition for polynomials. First note that if an ideal J contains nonzero constants, then
J = (1) = F[x],
since, by (I2), if a 0 is in J, then
f (x) = (a-1 f (x))a
is in J for arbitrary f (x) in Fix].
Lemma 1.7 If J is an ideal in F[x] other than (0) or F[x], then
J = (d(x)),
wh,ere d(x) is uniquely determ,ined as the monic polynomial of lowest positive degree in J.
Proof By (I2), J contains a monic polynomial of lowest positive degree, say d(s). Take any f (x) in J and write
f (x) = q(s) d(x) + r(x),
where deg r(x) < deg d(x). By (1.20),
r(x) = f (x) — q(x)d(x),
is in J . Since deg r(x) < deg d(x), we must have
deg r(x) = —oo or O.
But J contains no nonzero constants, implying that r(x) = O. Since f(x) is arbitrary,
J = (d(x)),
afid all polynomials in J are multiples of d(x). By (D3), d(x) is uniquely determined as the lowest positive degree monic polynomial in J , proving the lemma.
16 1. Review of Applied Algebra
Take any two polynomials f(x) and g(x) over F. The set
J = {a(x)f(x) + b(x)g(x) : a(x), b(x) E F[x11
is an ideal in F[x]. By lemma 1.7,
J =- (d(x)),
where d(x) is the monic polynomial of lowest degree in J. In particular, d(x) is a common divisor of f (x) and g(x). Write
d(x) = ao(x)f (x) + bo(x)g(x), ao(x), bo(x) E Fix].
By (D2), every common divisor of f(x) and g(x) divides d(x). We have proved the following result.
Lemma 1.8 If f(x) and g(x) are polynomials over F, then there exists a unique monic polynomial d(x) over F satisfying:
I. d(x) is a common divisor of f (x) and g(x). II. Every divisor of f(x) and g(x) in F[x] divides d(x).
Equivalently, d(x) is the unique monic polynomial over F, which is a common divisor of f (x) and g(x) of maximal degree. We call d(x) the greatest common divisor of f(x) and g(x) over F and write
d(x) = (f (x), g(x)).
By the divisibility condition above,
(f (x), g(x)) =- a(x) f (x) b(s)g(x),
where a(x) and b(x) are polynomials over F . In particular, if f (x) and g(x) are relatively prime over F, then
1= ao(x)f (x) bo(x)9(x), (1.22)
for some polynomials ao(x) and bo(x) over F. Arguing as in section 2, we have the following corresponding results.
Lemma 1.9 If f (x) I g(x)h(x), (f (x), g(x)) = 1, then f (x) I h(x).
Theorem 1.7 (Unique Factorization) If f (x) is a polynomial over F, then f(x) can be written uniquely, up to an ordering of factors, as
f (x) = apV (x) • • • gr (x),
where a E F, pi(x), , pr(x) are nicrnir irreducible polynomials over F and al > 0, , > 0 are integers.
Corollary 1.2 For polynomials over F, if f(x) I g(x), h(x) I g(x) and (f(x),h(x)) = 1, then
f (x)h(x) I 9(x).
1.7 Field Extension 17
1.7 Field Extension
Suppose that K is a field and F is a subfield of K in the sense that F is a subset of K containing 1, and it is closed under the addition and multiplication in K . For example, the rational field Q is a subfield of the real field R, which is a subfield of the complex field C. We also say that K is an extension of F. Observe that
F[x] c K[x].
A polynomial p(x) in F[x] can be irreducible over F without being irreducible as a polynomial in K[x]. For example, the polynomial
x2 + 1
is irreducible over the real field, but over the complex field
x2 +1= (x +i)(x -i).
Thus, the field of definition is necessary when referring to irreducibility. Consider now the greatest common divisor d(x) over F of two polynomi-
als f (x) and g(x) over F. View f (x) and g(x) as polynomials over K and denote the greatest common divisor of f(x) and g(x) over K by e(x). By lemma 1.8,
d(x) I e(x),
meaning that
e(x) = g(x)d(x), g(x) E K[x].
Write d(x) = a(x)f (x) + b(x)g(x), a(x), b(x) E F[x].
Since e(x) is a common divisor of f (x) and g(x), by (D2) of section 6,
e(x) I d(x).
Applying (D3) of section 6,
e(x) = d(x).
Thus, the greatest common divisor d(x) of two polynomials f (x) and g(x) over F does not change when we go to an extension K of F. In particular, f (x) and g(x) are relatively prime over F if and only if they are relatively prime over any extension K of F.
Consider a polynomial f(x) over F and suppose that K is an extension of,F. The polynomial f (x) can be evaluated at any element a of the field K by replacing the indeterminate x by a. The result,
f (a) = fka , k=0
18 1. Review of Applied Algebra
is an element in K. We say that a is a root or zero of f (x) if f (a) = O. The main reason to consider extensions K of a field F is to find roots of polynomials f (x). For instance, x2 ± 1 has no roots in the real field but in the complex field i and —i are roots.
Lemma 1.10 A nonzero polynomial f (x) over F has a root a in an extension field K if and only if
f (x) = (x — a)g(x),
for some polynomial g(x) over K . In any extension field K of F, a polynomial f(x) over F has at most n roots, n = deg f (x).
Proof Applying the divisibility condition in Kix],
f (x) = (x — a) g(x) + r(x),
where g(s), r(s) E K[s] and deg r(s) < 1. Since f (a) = 0, we have r(a) = 0, implying that r(x) = O. Suppose that ai, , an, are distinct roots of f(x) in K. Then (x — a3) I f (x). The polynomials
x — — am,
are relatively prime. By the extension of corollary 1.2 to several factors, we have
(x — al) (x — am) I f (s).
Thus m deg f (x),
completing the proof.
1.8 The Ring F [x] f (x)
Fix a polynomial f (x) over F of degree n. Set
F[x] / f (x)
equal to the set of all polynomials g(x) over F satisfying
deg g(x) < n.
Every polynomial g(x) in F[xl/ f (x) can be wriiten as
n-1
g(x) = E gkxk , gk E F, k=0
1.8 The Ring F[x]I f (x) 19
and we can regard F[x] I f (x) as an n-dimensional vector space over F having basis
1, xn-1.
We place a ring-multiplication on F[x] f (x) as follows. For any g(x) E
F[x], denote by g(x) mod f (x)
the remainder of the division of g(x) by f (x). Then
g(x) mod f (x) E F[x] 1 f (x).
Define multiplication in F[x] I f (x) by
(g(x) h(x)) mod f (x), g(x), h(x) E F[ I f (x). (1.23)
Direct computation shows that the vector space F[xl/f(x) becomes an algebra over F with the multiplication (1.23).
Two polynomials g(x) and h(x) over F are said to be congruent mod f (x), and we write
g(x) h(x) mod f (x) (1.24)
if g(x) mod f (x) = h(x) mod f (x). Equivalently, (1.24) holds if
f (x) (g(x) — h(x)).
Define the mapping F[x] —> F[x] I f (x)
by the formula n(g(x)) = g(x) mod f (x).
_Straightforward computation shows that n is a ring-homomorphism of F[x] onto F[x]I f (x) whose kernel
{g(x) E F[x] : n(g(x)) = 0}
is the ideal (f (x)). In lemma 1.5, we gave a method of constructing a finite field of order p,
for a prime p. We will now construct fields using the rings F[x] I f (x).
Lemma 1.11 The ring F[x] I f (x) is a field if and only if f (x) is irreducible over F
Proof Suppose that f(s) is irreducible. Take any nonzero polynomial g(t) in F[x] I f (x). By (1.22),
1 = ao(x)g(x) + bo(x)f (x),
20 1. Review of Applied Algebra
where ao(x) and b0(x) are polynomials over F. Then
1 ao(x)g(x) mod f (x),
so ao(x) mod f (x) is the multiplicative inverse of g(x) in F[x]l f (x). Since g(x) is an arbitrary nonzero polynomial in F[x]/f(x), the commutative ring F[x]I f (x) is a field. Conversely, suppose that f (x) is not irreducible. Then
f(x) = fi(x)f2(x),
where 0 < deg fk(x) < deg f(x), k = 1,2.
Then, fi(x) and f2(s) are in F[x]I f (x) and
0 = (.fi(x).f2(x)) mod .f(x).
If fi(x) has a multiplicative inverse, then
0 f2(x) mod f(x),
a contradiction, completing the proof of the converse of the lemma.
More generally, we have the next result, which we give without proof.
Lemma 1.12 The unit group U of F[x]I f (x), consisting of all polynomials g(x) in F[x11 f (x) having multiplicative inverse, is
U = {h(x) c F[x]I f (x) : (h(x), f(x)) = 11.
Identifying F with the constant polynomials in F[x]I f (x), we have
F c F[x]I f (x).
If p(x) is an irreducible polynomial of degree n, then
K = F[x]Ip(x) (1.25)
is a field extension of F that also can be viewed as a vector space of dimension n over F. Suppose that
F = Z/p.
Then K is a finite field of order pn. We state the next result withqut proof.
Lemma 1.13 If K is a finite field, then K has order pn for some prime p and integer n > 1. Two finite fields of the sam,e order are isom,orphic.
In addition, every finite field K can be constructed as in (1.25).
1.9 CRT for Polynomial Rings 21
1.9 CRT for Polynomial Rings
Consider a polynomial f(x) over F, and suppose that
i(x) = it(x).f2(x), Ch(x), .i2(x)) = 1- (1.26)
We will define a ring-isomorphism
F[xl/fi(x) x F[x]I f2(x) F[x]l f(x), (1.27)
where the ring-direct product in (1.27) is taken with respect to component-wise addition and multiplication. First, following section 4, we define the idempotents. Since fi(x) and f2(x) are relatively prime, we can write
1 = al (x)fi (x) + a2(x)f2(x),
with polynomials al (x) and a2(x) over F. Set
ei (x) = (a2(x) f2(x)) mod f(x),
e2(x) = (ai(x) fi(x)) mod f (x).
Arguing as in section 4,
ei(x) 1 mod fi(x), ei(x) =_ 0 mod f2(x), (1.28)
e2(x) 0 mod fi(x), e2(s) 1 mod f2(x). (1.29)
Conditions (1.28) and (1.29) uniquely determine the polynomials el (x) and e2(x) in F[x]I f (x). The set
fei(x), e2(x)}
is called the system of idempotents corresponding to factorization (1.26). Arguing as in lemma 1.6, we have the next result.
Lemma 1.14 The system of idempotents fei(x),e2(x)} satisfies
qc(x)- ek(x) mod f(x), k = 1, 2,
ei(x)e2(x) 0 mod f(x),
ei(x) + e2(x)- 1 mod f (x).
Define
(t)(gi (x), g2(x)) = ( (s) ei(x) + g2(x) e2(x)) mod f (x),
gk(x) e F[x]/fk (x), k = 1, 2. As in Theorem 1.2, the next result follows.
22 1. Review of Applied Algebra
Theorem 1.8 cb is a ring-isomorphism, of the ring-direct product F[x]I fi(x)x F[x]I f2(x) onto F[x11 f(x) having inverse 0-1 given by the formula
0-1(g(x)) = (g(x) mod fi(x), g(x) mod f2(x)),
for g(x) E F[x11 f (x).
In particular, every g(x) in F[x]l f(x) can be written uniquely as
g(x) (gi(x)ei(x) + g2(x)e2(x)) mod f (x),
where gk(x) E F[x11 fk(x), k = 1, 2. The extension of these results to factorization of the form
i(x) =.fi(x).f2(x) fr(x), (1.30)
where the factors fk(x), 1 < k < r, are pairwise relatively prime, is straightforward. To construct the system of idempotents,
fek(x) : 1 < k < r}, (1.31)
corresponding to the factorization (1.30), we reason as follows. First,
(h (x), f2(x), . . . , fr(x)) = 1,
and we can apply the above discussion to find a unique polynomial ei(x) in F[x]I f (x) satisfying
ei(x) 1 mod fi(x), (1.32)
ei(r) 0 mod /2(x) • • • fr(x). (1.33)
Condition (1.33) implies that
ei(x) 0 mod fk(x), 1 < k < r, k 1.
Continuing in this way, we find polynomials ek(x) in F[x]l f(x), 1 < k < r, satisfying
ek(s) 1 mod fk(x), 1 < k < r,
ek(r) 0 mod fi(x), 1 < k,1 < r, k 1.
These conditions uniquely determine the set (1.31). As before, we have the next result.
Lemma 1.15 The system of idernpotents (1.31) satisfies the properties
eZ(x) e k(x) mod f (X)-, 1 < k < r,
ek(x) ei(x) 0 mod f (x), 1 < k,1 < r, k 0 1,
E ek(x) E. 1 mod f (x). k=1
Problems 23
From lemma 1.14, we have the ring-isomorphism 0 of the direct product
F[xlifk (X) k=1
onto F[x]I f(x) given by the formula
0(gi(x), . . . , gr(x)) = ( E gk(s) ek(s) ) mod f (x).
The inverse 0-1 of 0 is given by the formula
0-1 (g(x)) -= (g(x) mod fi(x), , g(x) mod fr(x)),
and every g(x) in F[x]I f (x) can be written uniquely as
g(x) E g k (x) ek (x) mod f (x), k=1
where gk(x) is in F[x]/fk(x), 1 < k < r.
References
[1] Ireland, K. and Rosen, M. A Classical Introduction to Modern Number Theory, Springer-Verlag, 1980.
[2] Halmos, P. R. Finite-Dimensional Vector Spaces, Springer-Verlag, 1974.
[3] Herstein, I. N. Topics in Algebra, XEROX College Publishing, 1964.
Problems
1. Show that
(a + (b + c) mod N) mod N = ((a + b) mod N + c) mod N.
This is the associative law for addition mod N.
2. Show that
(a • ((19 c) mod N)) mod N
= ((a • b) mod N + (a • c) mod N) mod N.
This is the distributive law in the ring Z/N.
24 Chapter 1. Review of Applied Algebra
3. Describe the unit group U(N) of Z/N explicitly for N = 12, N = 21, N = 44 and N = 105.
4. Give the table for addition and multiplication in the field Z/11.
5. Give the table for addition and multiplication in the ring Z/21.
6. Find the system of idempotents corresponding to the factorizations N=3x7,N=4x 5, N = 2 x 7 and N =7 x11.
7. Give the table for the ring-isomorphism q5 of the CRT corresponding to factorizations N = 4 x 5 and N = 2 x 7.
8. Suppose that N = N2 • • • NT, where the factors Ni N2, • • • , Nr are relatively prime in pairs. Show that
(Nk, NINk) =1, 1 < k < r.
9. Continuing the notation and using the result of problem 8, define integers ei, e2, , er satisfying
0 < ek < N,1 < k < r,
ek 1 mod Nk,1 < k < r,
ek 0 mod Ni,1 < k,1 < r, k 01.
These integers are uniquely determined by the above conditions and form the system of idempotents corresponding to the factorization given in problem 8.
10. Continuing the notation of problems 8 and 9, prove the analog of lemma 1.6:
e2k ek mod N,1 < k < r,
ek 0 mod N,1 < k, 1 < r, k 1,
E ek 1 mod N. k=1
11. Define the CRT ring-isomorphism of the direct product Z/Ni x Z/N2 x • • • x Z/Nr onto Z/N 'and describe its inverse 0-1.
12. Extend Theorem 1.3 to the case of several factors given in the above problems.
13. Find a generator of the unit group U(N) of Z/N where N = 5, N = 25, N = 125.
14. Show that U(21) is not a cyclic group. Use Theorem 1.3 to find generators of U(21).
Problems 25
15. For N — Pi P2 • ' • Pr, where the factors Pi, P2, - pr are distinct primes, show that the unit group U(N) of Z / N is group-isomorphic to the direct sum
Z/(pi — 1) e Z/(p2 — 1) e • • • 0) Z/(pr — 1).
16. Prove that
deg (f (x) g(x)) -= deg f (x) + deg g(r).
17. Write out the divisibility condition for the polynomials
g(x) = xl° + 4x8 + 2x2 + 3,
f(x) 4x2o 2xio L
18. For any two polynomials f (x) and g(x) over Q[x], show that the following set is an ideal:
J = {a(x) f (x) + b(s)g(x) : a(x), b(x) E Q[x]}.
19. Let F be a finite field and form the set
L = {1, 1 + 1, , 1 + 1 + • • + 1, •
Show that L has order p for some prime p and that L is a subfield of F isomorphic to the field Z/p. The prime p is called the characteristic of the finite field F.
20. Show that every finite field K has order pm for some prime p and integer n > 1.
21. For the polynomial f (x) over Q
f (x) = (x — 1) (x + 1),
find the idempotents corresponding to this factorization and describe the table giving the CRT ring-isomorphism.
22. Find the idempotents corresponding to the factorization
f (x) = (x — ai)(x — a2) • • • (x — ar),
where al, , a, are elements in some field F . Describe the t corresponding CRT ring-isomorphism (/) and its inverse 0-1.
2 Tensor Product and Stride Permutation
2.1 Introduction
Tensor product offers a natural language for expressing digital signal pro-cessing (DSP) algorithms in terms of matrix factorizations. In this chapter, we define the tensor product and derive several important tensor product identities.
Closely associated with tensor product is a class of permutations, the stride permutations. These permutations govern the addressing between the stages of tensor product decompositions of DSP algorithms. As we will see in the following chapters, these permutations distinguish the variants of the Cooley-Tukey FFT algorithms and other DSP algorithms.
Tensor product formulation of DSP algorithms also offers the convenience of modifying the algorithms to adapt to specific computer architectures. Tensor product identities can be used in the process of automating the implementation of the algorithms on these architectures. The formalism of tensor product notation can be used to keep track of the complicated index calculation needed in implementing FT algorithms. In [1], the implemen-tation of tensor product actions on the CRAY X-MP was carried out in detail.
28 2. Tensor Product and Stride Permutation
2.2 Tensor Product
In this section, we present some of the basic properties of tensor products which are encountered in the algorithms that we will describe in future chapters of this work. Tensor product algebra is an important tool for presenting mathematical formulations of DSP algorithms so that these al-gorithms may be studied and analyzed in a unified format. We first define the tensor product of vectors and present some of its properties. We then define the tensor product of matrices and describe additional properties. These properties will be very useful in manipulating the factorizations of discrete FT matrices.
Let Cm denote the M-dimensional vector space of M-tuples of complex numbers. A typical point a E Cm is a column vector,
ao
a =
am._i
We say that a has size M. If the size of a E CM is important, we write a as am.
The tensor product of two vectors a e Cm and b E CL is the vector a 0 b CN, N = ML, defined by
[ aob 1 [ . am_ib
Example 2.1
aobo aobi
ao aibo
ai 0 [ b° = aibi
a2 a2bo a2bi
The tensor product is bilinear in the following sense. For vectors a b, c of appropriate sizes
(a+b)0c=a0c±boc, (2.1)
a0(b+c)=-a0b+a0c, (2.2)
but it is not commutative. In general, a 0 b b 0 a.
2.2 Tensor Product 29
Tensor product constructions often involve the following relationship between linear arrays and multidimensional arrays. An M x L array
= [xm,i [0<m<m,o<RL
can be identified with the vector x of size N, N -= ML, formed by running down the columns of X in consecutive order. Conversely, given a vector x of size N, we denote by MatmxL(x) the M x L array formed by first segmenting x into L vectors of size M,
xo Xm X(L-1)M
X1 XM-F1 X0 = , X1 = [
1 7 " •1 XL-1 =[
1 [ Xm. _i 1 X2M-1 XML-1
and then placing these vectors in L consecutive columns,
MatmxL(x) = [Xo Xi XL-1] •
Consider the tensor products a 0 b and b 0 a with a E Cm, b E CL. Identify a 0 b with the L x M array
MatLxm(a b) = [aob • am_ib] ,
and b 0 a with the M x L array
MatmxL(b 0 a) -= [boa • • 19L_ia] .
We see that Matm x L (b 0 a) = (MatLx m(a b))t.
Thus, interchanging order in the tensor product corresponds to matrix transpose. In example 2.1, a 0 b corresponds to the 2 x 3 array
[ aobo al bo a2 bo
aobi al a2bi
while the vector b 0 a corresponds to the 3 x 2 array
[boa° bi ao boa' bi al • bo a2 bi a2
general, we can describe matrix transposition in terms of a permuta-tion of an indexing set. Consider first the L x M array
Y = [ Y1,TTI]o<I<L, o<m<m •
30 2. Tensor Product and Stride Permutation
For any 0 < r, s < N, we can write uniquely
r = mL, 0 <1 < L, 0 < < M,
s = rn /M, 0 < m < M, 0 < / < L.
The vector y formed from the array Y has components given by
Yr = Yt,m, r = / mL, 0 < rn < M, 0 <1 < L,
while the vector z formed from Y.' has components given by
zs = s = m + /M, 0 < < M, 0 < / < L.
This corresponds to the permutation of the indexing set,
7(1 + rnL) = 771+ 1M, 0 < M < M, 0 <1 < L, (2.3)
and we have Yr = Zw(r) 0 < r < N. (2.4)
Example 2.2 Taking M = 2 and L = 3, we have
= ( 0 2 4 1 3 5 )
and Zo -
Z2
Z4 Y = zi
Z3
_ Z5 _
To form y from z, we 'stride' through z with length two.
In general, to form a 0 b from b 0 a, we first initialize at boa° the 0-th component of b a, and then stride through b 0 a with length the size of a. After a pass through b0a, we reinitialize at boa', the first component of b® a, and then stride through b 0a with length the size of a. This permutation of data continues until we form a b. This procedure is an example of the important notion of a stride permutation. Stride permutation will be discussed in great detail beginning in the next section.
Denote by emm, 0 < < M, the vector of size M with 1 in the m-th place and 0 elsewhere. The set of vectors
.
: 0 < rn < M}
is a basis of Cm called the standaryl basis. Set N = ML and form the tensor products ernm 0 et, 0 < rn < M, 0 < / < L. Since
oN M eL 0 < m < M, 0 <1 < L, — ern I 7 -
2.2 Tensor Product 31
the set lemmOet : 0<m<M,0</<L1
is the standard basis of CN. In particular, as a runs over all vectors of size M and b runs over all
vectors of size L, the tensor products a0b span the spa.ce CN (see problems 4 and 5). To prove that the actions of two matrices on CN are equal, we only need to prove that they are equal on tensor products of the form am 0 bL.
The tensor product of an R x S matrix A with an M x L matrix B is the RM x SL matrix, A 0 B, given by
420,0B ao,03 • • • ao,s_iB
•
_aR—LoB " • aR-1,s—iB
Setting C = A 0 B, the coefficients of C are given by
Cm±rM,1-1-sL = ar,s brn,1-
It is natural to view the tensor product A0B as being formed from blocks of scalar multiples of B. The relationship between tensor products of matrices and vectors is contained in the next result.
Theorem 2.1 If A is an R x S matrix and B is an M x L matrix, then
(A 0 B)(a b) = Aa Bb,
for any vectors a and b of sizes S and L.
Proof The vector a 0 b, aaobb
as_ib
can be viewed as consisting of consecutive segments,
aob, alb , , as_ib,
each of size L. Since the M x SL matrix formed from the M rows of A0 B is
[a0,013 ao,iB • - • ao,L-1.81,
we have that the vector of size M formed from the first M components of (A B)(a b) is
(ao,oao aojai + • • + ao,s_ias_i) Bb.
Continuing in this way proves the theorem.
32 2. Tensor Product and Stride Permutation
Theorem 2.2 If A and C are M x M matrices and B and D are L x L matt-ices, then
(A 0 B)(C D) = AC 0 BD.
Proof Take vectors a and b of sizes M and L. By Theorem 2.1,
(A 0 B)(C D)(a = (A 0 B)(Ca Db)
= ACa BDb,
proving the theorem, in light of the preceding discussion.
More generally, the tensor products
a0b0c= a® (b0c)= (a® b) 0c, a E Cm, b E CL, c E CK,
span the space CN, N = MLK, and the observation about matrix expressions can be extended.
An important special case of Theorem 2.2 is the following decomposition. Denote by /L, the L x L identity matrix. Then
A 0 B = (Im B)(A 0 IL) = 0 .ELY-rm B), (2.5)
where A is an M x M matrix and B is an L x L matrix. In order to better understand the computation (A 0 B)x, we need to examine the factors I'm 0 B and A 0 /L.
im B is the direct sum of M copies of B,
Im B — 0
0
Bi'
and its action on x is the action of B on the M consecutive segments of x of size L. We call im B a parallel operation. For a vector x E CN, N = ML, we have
MatLxm((/m B)x) = BMatLxm(x).
Write MatL,,m(x) = [X0 - • Xm-i],
where X„., is the m-th column of MatLx m(x). The computation (A 0 /L)x can be interpreted as a vector operation of
A on the vectors X0, Xi, , Xm_i by
a0,0X0 + • • • + ao,m_iXm-i
(A 0 /L)x = (2.6)
_am_i,o)Co + • • • + am-i,m-iXm-i
2.3 Stride Permutations 33
Operations involved in (2.6) are scalar-vector multiplication and vector addition.
Factorization (2.5) decomposes the operation (A 0 B) into the parallel operation IM B followed by the vector operation A 0 /L. For a vector x E CN,
MatLx m((A IL )x) = (MatLxm(x ) )At
and
MatLxmail B)x)= BMatLxm(x)At.
2.3 Stride Permutations
In this section, we discuss the stride permutations that govern the data flow required to parallelize or vectorize a tensor product computation. Stride permutations play a crucial role in the implementation of FT computations. On some machines, the action of a stride permutation can be implemented as elements of the input vector being loaded from the main memory into registers. For architectures where this is the case, considerable savings can be obtained by performing these permutations when loading the input into the registers.
Take N = 2M. The tensor products a 0 b where, a E C2 and b Cm span CN. We define the N -point stride M permutation matrix P(N,M) by the rule
P(N, M)(a b) = b 0 a, a E C2, b E Cm. (2.7)
More generally, if x is a vector of size N, then y = P(N, M)x satisfies
Mat2x m (Y) = (Matm x 2 (x))t.
The matrix P(N, M) is usually called the perfect shuffle. It strides through x with stride of length M.
Example 2.3 Take N = 4. Then
1000 xe
0010 x2
P(4, 2) = [0 1 0 0 P(4' 2)x = xi •
0001 13
34 2. Tensor Product and Stride Permutation
Example 2.4 Take N = 8. Then
'1000000°- 00001000 01000000 00000100
P(8,4) = 00100000' 00000010 00010000 _00000001_
xo x4
P(8, 4)x = x5 X2
X6
X3
_X7
Example 2.5
P(6, 3)
Take
=
N = 6. Then
-1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0000 0 0 1 0 0
_ 0 0 0 0 0
0 0 0 o 0 1_
P(6, 3)x =
x3
x1 X4
X2
_X5 _
•
Suppose now that N = ML. The N-point stride L perm,utation matrix P(N, L) is defined by
P(N,L) (a b) = b 0 a, (2.8)
where a and b are arbitrary vectors of sizes M and L. More generally, if x CN, then y = P(N, L)x if and only if
MatmxLY = (Mat', mx)t
To compute P(N, L)x, we stride through x with stride L. Formula (2.8) will be used repeatedly to derive matrix identities involving stride permutations.
The algebra of stride permutations has an important impact on the design of tensor product algorithms.
Theorem 2.3 If N = MLK, then
P(N, LK) = P(N, L)P(N, K).
Proof Take vectors a E Cm, b E CL and c E CK. Then
P(N, LK)(a b c) .=_13 c 0 a, -
and
P(N, L)P(N, K)(a 01: c) = P(N,L)(c a® b) =b0c0a,
proving the theorem.
2.3 Stride Permutations 35
In particular, from Theorem 2.3,
P(NM,M)-1 = P(NM, N). (2.9)
Example 2.6 Take N = 4 x 2.
- X0-
X2
X4
P(8,2)X= X6 Xi
X3
X5
_X7 _
Then
, P(8,2)2X=
- X0-
X4
Xi
X5 X2
16
X3
_X7 _ from which we see that
P(8, 2)2 = P(8, 4), P(8, 2)3 =
In general, an N x N permutation matrix can be given by a permutation of Z/N. Let ir be a permutation of Z/N. We represent ir using the following notation:
7r = (7r(0),7r(1),...,7r(N - 1)).
Define the N x N permutation matrix P(7), by the condition
P(7r)x = y,
where yi = 0 < 3 < N.
- Example 2.7 Take N = 8 and
7r = (0,4,1,5,2,6,3,7).
Then - X0-
X 4
Xi
P (70X = 5 , X2
X6
X3
_X7 _
and we see that P(ir) = P(8,4).
36 2. Tensor Product and Stride Permutation
Example 2.8 Take N = 12 and
7r = (0, 3, 6, 9, 1, 4, 7, 10, 2, 5, 8, 11).
Then P(7r) = P(12,3).
Direct computation shows that the mapping
7r P(7r)
satisfies
P(7r2 71-1) = P(70P(71-2),
P(7r-1) -= P(70 - 1.
Example 2.9 Take N = 8, M = 2 and L -= 4. Then
7r = (0, 2,4,6,1,3,5,7)
and P(7r) = P(8, 2).
Example 2.10 Take N = 12, L = 3 and M = 4. Then
ir = (0,3,6,9,1,4,7,10,2,5,8,11)
and Pew) = P(12,3).
In general, we have the next result.
Theorem 2.4 If N = ML and 7r is the perm,utation of ZIN defined by
7r(a bM) = b aL, 0 < a < M, 0 < b < L,
then P(70= P(N,L).
Consider the set of N x N permutation matrices
{P(N,L) LI NI . (2.10)
We will describe the permutation matrices in this set in terms of the unit group U(N - 1) of Z/(N - 1). The unit group U(N -1) is given by
U (N - 1) = {0 < T < N - 1 : (T - 1) = 1} .
If T E U(N - 1), then multiplication by T mod (N - 1) is a bijection of the set
{0,1,..., N - 21.
2.3 Stride Permutations 37
Define the permutation 71-7-, of Z/N by the two rules
72-(k) kT mod (N —1), 0 < k < N —1,
71-2-(N —1) = (N —1).
Observe that, if T I N, then (T,N —1) = 1 and we can define rr-
Example 2.11 Take N = 12 and T = 4. Then
= (0,4,8,1,5,9,2,6,10,3,7,11).
We see that
74(a + 3b) = b + 4a, 0 < a < 3, 0 < b < 4,
and P(71-4) = P(12,4).
Theorem 2.5 If N = ML, then
Pen L) = P(N, L).
Proof We must show that
irL(a + bM) = b+aL, 0 < a < M, 0 < b < L.
N — 1=- ML —1=Omod (N —1).
irL(a + bM) aL + bN
aL + b + b(N — 1) mod (N —1)
=aL+b.
Consider the set of N x N permutation matrices
{PerT) : T E U(N —1)} .
This set is group-isomorphic to U(N — 1). In fact,
P(7rR)P(7rs) = P(7ru), U RS mod (N —1)
P(1-R)-1 = P(z-R-1), R-1 taken mod (N —1).
Theorem 2.6 {P(2m,2m) : 0 < m < MI
is the cyclic group generated by P(2m , 2). In fact,
ppm ,2777,)p(2m ,2k) _ ppm ,2m-f-k),
where m + k is taken mod M.
38 2. Tensor Product and Stride Permutation
Proof Consider integers 0 < k < M. If rn + k < M, then there is nothing to prove. Suppose that M < nt+ k. Set / = m + k — M. We have 0 < / < M and / 7/1+ k mod M:
2m+k 21+1 I ,
2"i±lc — 2/ = 2/±m — 21 -= 21(2m — 1) =- 0 mod (2m — 1).
It follows that
2m2k mod (2m — 1),
proving the theorem.
More generally, we have the following result, which we state without proof.
Theorem 2.7 If p is a prime, then the set
fP(Pm,PL) (L, m) = 11
is a cyclic group of order M generated by P(pm ,p).
It is sometimes useful to represent permutations and general computa-tions by diagrams that give a picture of data flow.
Example 2.12 The permutation P(4, 2) can be represented by
10. .X0
Xi. -X2
X2. .X1
X3. .X3
Example 2.13 represented by
The permutation /2 P(4,2) and P(4, 2) 0 /2 can be
XO. .X0 X0 .X0
Xl• .X2 Xi .XI
X2. „
.X1
X3. .X3 X3 .X5
X4. -X4 X4 -X2
X5. .X6 X5 .X3
X6. .X5 X5 .X6
X7. .X7 X7 .X7
2.3 Stride Permutations 39
Example 2.14 The permutations P(8, 2) and P(8, 4) can be represented by
Xo. .X0 lo. .X0
Xl• .X2 Xi. •X4
X2. •X4 X2. •Xi
X3. .X6 13. .X5
X4. .X1 X4. .X2
X5. .X3 X5. .X6
X6. .X5 X6. .X3
X7. .X7 X7. .X7
We see from the diagram of example 2.12 that 12 0 P(4, 2) consists of two parallel copies of P(4, 2). To compute the action of P(4, 2) oh, we can first form the vectors
x(0) = , x(1) = [x2] , x(2) = [x41 , x(3) = [x61 X1 X3 X5 X7
and compute the vector operation of P(4, 2) on these four vectors,
[x(0)
P(4, 2) x(1) x(2) x(3)
In the preceding section, we discussed how certain tensor product ex-pressions could be viewed as vector operations, parallel operations, or as a combination of vector and parallel operations. An important tool for inter-changing the operations in a given algorithm is the commutation theorem.
Theorem 2.8 If A is an M x M matrix and B is an L x L matrix, then
P(N, L)(A 0 B) = (B A)P(N, L), N = M L.
Proof Set z = x y, where x has size M and y has size L. Then, by definition,
(A 0 B)(x y) = Ax 0 By,
P(N, L)(A B)z = By 0 Ax.
Arguing in the same way,
(B A)P(N, L)z = By Ax,
proving the theorem.
COrollary 2.1
P(N, L)(Im B) = (B Im)P(N,L), N = M L.
40 2. Tensor Product and Stride Permutation
As an important application of the commutation theorem, we observe that
A0 B = (A 0 IL)P(N,M)(B Im)P(N,M)-1, (2.11)
A0 B = P(N,M)(IL, A)P(N,M)-1(Im B). (2.12)
Factorization (2.11) decomposes A 0 B into a sequence of vector oper-ations; the first operates on vectors of size M while the second operates on vectors of size L. The intervening stride permutations provide a math-ematical language for describing the readdressing between stages of the computation. In the same way, we interpret (2.12) as a sequence of parallel operations.
2.4 Multidimensional Tensor Products
Tensor product identities will be used to obtain factorizations of multidi-mensional tensor products, which then can be applied to implementation problems. The rules of implementation established in this section will have important consequences in the rest of this text. The first application will be to the various Cooley-Tukey FFT algorithms in the next chapter. The stride permutations appearing in these factorizations will make explicit the readdressing needed to carry out computations. We begin with an example. Take positive integers N2 and N3. Set N = NiN2N3. AN denotes any N x N matrix. The product rule implies that
AN, 0 AN, 0 AN3 (2.13)
= (AN, 0 IN2N3)(INI AN2 IN3)(INiN2 ® AN3 )•
The factor AN, 0 I N2N, is a vector operation while the factor /NiN, AN3 is a parallel operation. The middle factor, IN„ AN,0 N,, is of mixed type involving NI copies of the vector operation AN, 0 IN3. There are several ways of modifying these factors, using the commutation theorem. Since
AN, IN2N3 P(N, Ni)(IN2N3 ANi)P(N, N2N3), (2.14)
IN, 0 AN2 /N2 = P(N,NiN2)(iNIN, AN2)P(N, N3) (2.15)
and P(N, N2N3)P(N, NiN2)_= P(N, N2),
we can rewrite (2.13) as
AN, 0 AN, A N3 = P(N, Ni)(IN2N, ANi) (2.16)
P(N, N2)(INi N3 0 AN2)P(NI N3)(IN1N2 AN3)•
2.4 Multidimensional Tensor Products 41
A second parallelization comes from replacing the middle factor by
IN, 0 AN2 IN3 Q(INiN3 AN2)C2 1, (2.17)
where Q = P(N2N3, N2). These two parallel factorizations differ in data flow. In the first, the readdressing between the computational stages is given by P(N,N3), P(N,N2) and P(N, NO while in the sec-ond the readdressing is given by Q-1, P(N,N2N3)Q and P(N, NO. Each will have advantages and disadvantages that can be made explicit when implementing on a specific computer.
In general, the permutations that arise from commuting terms in a multi-dimensional tensor product are built up from products of terms of the form / P /, where / denotes an identity matrix and P denotes a stride per-mutation. In particular, IN, 0 P(N2N3, N3) iS NI copies of the permutation P(N2N3, N3). As such, it performs a stride permutation on Ni segments of the input vector beginning at different offsets. It can be implemented as a loop of stride permutations, where the same permutation is performed, but the initial offset is incremented by N2N3 at each iteration. The second type of permutation can be thought of as permuting blocics of the input vector. Thus, P(N2N3, N3)0 /N1 permutes segments of length Ni at stride N3. This can be implemented by loading blocks of Ni consecutive elements, beginning at offsets given by the permutation P(N2N3, N3).
If M = = N2 = N3 and A, B and C are M x M matrices, the factorization becomes
A0 B 0C = P(Im2 A)P(Im2 B)P(Im2 0C),
where P = P(M3,M). In this case, the readdressing between each of the stages of the computation is the same and given by P.
Factorizations of the permutation occurring in computing terms in mul-tidimensional tensor products offer programming options that can be used to match the algorithm computing the action of these multidimensional tensor products to the specific machine architecture. Depending on the machine parameters such as maximal vector length, minimal vector length, number of processors and communication network, a full parallelization or vectorization may not be desired. The rules established can be modified to partially parallelize or vectorize. The next result describes a factorization that is especially useful.
Theorem 2.9 ././ N = NiN2N3, then
P(N, N3) = (P(NiN3, N3) 0 INMINi P(N2N3, N3))•
Prroof Take a E CN1, b E CN2 and c E CN3. Then
(P(NIN3, N3)0 ./N2)(iNi P(N2N3, N3))(a b c)
42 2. Tensor Product and Stride Permutation
= (P(NiN3, N3) 0 /N2)(a c b)
=c0a013
= P(N, N3)(a b c),
proving the theorem.
Let N = NiN2 • • • NK. For M x M matrices X2, • • - Xl•C
H Xk = X1X2 • • • XK k=1
is the sequence of matrbc actions beginning with XK and ending with Xi. Denote an arbitrary Nk X Nk matrix by ANk. Set N(k) = NiN2 • • Nk and N(0) = 1. The product rule implies that
AN, ' • ' ANK = H IN(k-1) ANk IN/N(k)• (2.18) k=1
Set Pk = P(N, N(k)). Since
/N(k_i) AN,, -/N/N(k) — Pk(1N/Nk ANk)Pk 1, (2.19)
we can parallelize (2.32) by the factorization
AN, 0 ' ' ANK = H Pk(IN/Nk ANk)Pk 1* (2.20) k=1
The description of the intervening permutation can be simplified by combining permutations. Since
= P(N, NI N(k))P(N, N(k +1)) = P(N, Nk+i),
we have the next result.
Theorem 2.10 (Parallel I)
AN, 0 • • • 0 ANK = H P(N, Nk)(IN/Nk ANk)• k=1
As in the above example, if Ni = N2 = • • • = NK, then the readdressing
between computational stages is exactly the same. A second parallelization can be obtained from the identity
/N(k_i) ANk ININ(k) = Ch(ININk ANk)Qkl, - (2.21)
where Qk = 0 P(NIN(k — 1), Nk). This leads to the next result,
AN, 0 • • • ® ANK = H Qk (iNiNk (8) ANk )(27, 11 k=1
2.4 Multidimensional Tensor Products 43
where
Qk = 0 P(NIN(k — 1), Nk)•
Combining permutations, we have the next result.
Theorem 2.11 (Parallel II)
ANi AN2 ' "0 ANK Qk_iicik(iN,N„ ANk), k=1
where Qo = I.
The permutation Qk—lQk+i has the interpretation that it maps the multidimensional tensor product
Nk Nk-1-2 aN IC 0 aNk+i a 0•--0a ca). a
into aNi aNk_i aNk+1 ell< aNk
interchanging the k-th and k 1-th positions. Since
ININk ANk = P(N, NINk)(ANk IN/NJP(N, NINk)-1 ,
we obtain the vector factorization analogs of the preceding two theorems.
Theorem 2.12 (Vector I)
ANi • • ANK = ( A Nk /N/Nk )P(N,Nk)• k=1
- Theorem 2.13 (Vector II)
ANi ANK = H (A NI, ININJRk, k=1
where Rk = P(N,Nk)QkiCh+iP(N,NINk+1).
If ANk, < k < K, is symmetric, then applying the transpose operation to both sizes of the factorizations in the preceding theorems produces ad-ditional factorizations which, although similar in form, can have significant data flow differences. We will use
(A 0 B)t = At 0 Bt ,
Pt = P-1, P a permutation matrix.
44 2. Tensor Product and Stride Permutation
Theorem 2.14 (Parallel III) If AN,,, 1 < k < K, is symmetric, then
AN, 0 AN2 0 • " A NK =
(ININK 0 ANK)P(N,NINK)- • (INN, 0 AN,)P(N,NINO•
Theorem 2.15 (Parallel IV) If ANk, 1 < k < K, is symmetric, then
AN, 0 AN2 ' • ' ANK =
(INNK ANKY 2K-1 UNIN2 AN2)W ° 1 -1,-IN/N, 0 A ) •
In Parallel I, a typical stage
P(N,Nk)( IN/Nk 0 A Nk
can be implemented by the parallel operation
ININk ANk
followed by the stride permutation P(N,Nk), while in Parallel III the parallel operation
iNiNk ANk
follows the stride permutation P(N, NINk). If there are many small factors Nk, < k < K, in the factorization of N, these two parallel forms can be distinguished by the small strides in the first as compared with the large strides in the second.
2.5 Vector Implementation
In this section, tensor product identities will be used to design algorithms computing tensor product operations on a sample vector processor. Our model of a vector processor includes the main memory, vector registers and a communication network between the main memory and vector reg-isters, which will be described in detail as required. Vector operations are performed on vectors located in vector registers. Some standard vector operations are vector addition, subtraction, scalar-vector and vector multi-plications. To take advantage of the high-speed computational rate offered by vector operation, it is essential to keep memory transfers to a minimum and to perform vector operations on vectors residing in vector registers as much as possible. Also, the transfer of data between the main memory and vector registers, on many processors, is especially suited to implement the stride permutations arising from tensor product operations.
There are several key ma,chine parameters that must be kept in mind when designing algorithms. First, vector registers have a maximum size,
2.5 Vector Implementation 45
which limits the size of vectors that can be used on vector instructions. Also, due to 'start-up costs', there is usually a lower bound on the size of vectors that can efficiently be operated on by vector operations. If a computation requires operations on larger vectors than allowed, then the computation must be segmented and several vector instructions combined to perform the computation. The language of tensor products is ideally suited to design algorithms that satisfy this key design parameter. Memory transfer can also be performed with vector operations. These vector operations correspond to stride permutations. A vector of elements in the main memory can be loaded into a vector register with the following instruction:
oVi X, L
The vector of elements in memory having the initial address X is loaded into the vector register V/ at stride L. A special register called the vector length register V L determines the number of elements that are loaded. For example, if
Xo -
Xi
X = S2 , V L = 3, X3
X4
_ X5 _
then
WO X, 2
loads the vector register VO with the elements of X beginning at xo with stride 2,
X0
VO = [X21 •
X4
The result of the load instruction
.1/1 X +1, 2
iS
Xi
111 = [X3] .
X5
The memory transfer operation that takes a vector in memory of size M N and fills at stride N , N vector registers with vectors of size M will be
46 2. Tensor Product and Stride Permutation
denoted by LATIN. Thus,
- X0 -
Xi X0 1 [ X1
X2 14 14 ,-.., P(6, 2) ---* [ X2 X31 .
13
X4 15 X4
_ X5 _ mainmemory VO V1
The contents of a vector register can be stored into the main memory with the instruction
• ,Y, L VK.
The contents of the vector register VK are stored in memory having the initial address Y at stride L. For example, if
VO= [x°1 , xi
then the result of the vector instruction
0, Y, 3 VO
is Y = X0 Xi
If
V1= [x2] ,V2 = [x41 , X3 s5
then the result of the sequence of store instructions,
• Y, 3 VO go, Y + 1, 3 V1 ▪ Y+2, 3 V2
is the sequence of stores
Y = X0
Y = X0 X2 _ Xi X3 ___
Y = X0 X2 X4 Xi "" X3 _X5 . "-
The memory transfer operation that takes the contents of N vector regis- ters with vector size M and stores them in memory with stride N will be denoted by LiAri N :
Lir I Lir
2.5 Vector Implementation 47
-X0 -
X2
[xi [ [X41 xo x4 (Lgyi = Lg = P(6, 2) X5 Xi .
X3
_ X5 _
VO V1 V2 memory
A load-stride followed by a store-stride can carry out a stride permuta-tion. The stride permutation P(6,2) can be implemented with a sequence of operations. Take VL = 2:
• VO X, 1 • V1 X + 2, 1 load at stride 1 • V2 X + 4, 1
41, Y, 3 VO • ,Y + 1, 3 V1 store at stride 3. • ,Y +2, 3 V2
Tensor product operations of the form A 0 IN can be implemented di-rectly with vector instructions as long as N is less than or equal to the maximum vector register length. For example, if
y = (A 0 /3)s,
where
[1 1 A =
1 —11 ' then
xo + X3-
Xi +
X2 ± X5 Y = — X3
Xi - X4
-X2 - X5_
If X0 X3
VO = [Xi , = [X41 ,
X2 X5
then y is computed by the vector instructions
• V2 VO+ V1
• V3 VO — V1
48 2. Tensor Product and Stride Permutation
The first instruction is the vector addition of VO and V1 placed in the vector register V2. The vector Y in memory is obtained by storing V2 followed by V3 back in memory. If Y is the location of the output vector, we obtain this with the following instructions:
Y, 1 V2 0,Y +3, 1 V3
The computation y = (A 0 /3)P(6, 2)x offers a more complicated ex-ample. The first step is to load x into two vector registers at stride 2.
so xi
VO = x21 , -= [x31 . X4 X5
The next step is
V2 = VO + V1 = [xo + xi x2 + X3 X4 ± X5 7
V3 = VO — V1 = [xo — xi x2 — x3 x4 — x5 .
Finally, the contents of V2 are stored at stride 1 beginning at Y and the contents of V3 are stored at stride 1 beginning at Y + 3. In effect, the stride permutation P(6, 2) is implemented for free since, in order to program A g ./3 as a vector operation, we must load the input vectors and store the results even in the absence of an input permutation.
The operation y = P(6, 3)(A 0 /3)x can be performed in the same way. Implementing a tensor product operation becomes significantly more dif-
ficult if segmentation is required, i.e., vector operations are required on vectors that do not fit inside vector registers. To be concrete, assume that the maximum size of vector registers is 64, and we would like to evaluate A /128. This acts naturally on vectors of size 128. The maximum size of the vector registers is 64, and we would like to replace A 0 /us by a vector operation on vectors of size 64. Since
A 0 /128 = P(256, 128)(1Z 0 A 0 /64)P(256, 2),
the computation of A /128 is equivalent to /2 0 A 0 /64 up to input and output permutations. Two copies of a vector instruction on vectors of size 64 are required. But P(256, 2) naturally forms vectors of size 128,
_ X0 X1
X2 X3
_ X254 X255
This problem can be solved by the factorization,
P(256,2) = (P(4,2) 0 /64X/2 0 P(128, 2))•
2.5 Vector Implementation 49
The first factor, /2 P(128, 2), decomposes the input vector of size 256 into two consecutive segments of size 128, and performs the stride permutation P(128, 2) on each of the segments:
X0 X4 X128 X126
VO = X.2 7 V1 = X,3 , V2 = [x1.3° , V3 — X1.31 .
X126 X127 X254 X255
Setting VL = 64, these vectors can be loaded by the instructions,
• VO X, 2 • V1 X + 1, 2 • V2 X + 128, 2 • V3 X + 129, 2
The second factor, P(4, 2) /64, can be thought of as a permutation of the segments, giving
VO V2 V1 V3
These two steps can be combined by carrying out the load-strides as be-fore, and by changing the initial offsets of the load-strides to perform the permutations of the segments.
The vector operation A /64 is performed on (VO, V2) and on (V1, V3):
V4 = VO + V2
V5 = VO — V2
V6 = V1 + V3
V7 =V1— V3
To complete the computation, the vectors V4, V5, V6 and V7 must be stored back in the memory in the order given by P(256, 128). This can be done with the store instructions by first permuting the segments to
V4 V6 V5 V7
and then storing the results with the store instructions
• Y, 2 V4 0, Y + 1, 2 V6 • Y + 128, 2 V5 ▪ Y + 129, 2 V7
This corresponds to the factorization
P(256,128) = (/2 0 P(128, 64))(P(4, 2) /64).
50 2. Tensor Product and Stride Permutation
2.6 Parallel Implementation
Tensor product identities provide powerful tools for matching tensor prod-uct factor computations to specific machine characteristics such as locality and granularity. Consider the tensor product factor IN 0 A, where A is taken as in section 5. In the simplest case, N separate processors are avail-able for the computation, and ea.ch processor has access to a shared memory containing the input vector X and the output vector Y. (In this section, we will use capital letters as variable names of the data used in codes.) Number the processors
0, 1, , N — 1.
Define the action A by
Y(0) = X(0) + X(1),
Y(1) = X(0) — X(1).
Assign to each processor the code
A(2, Y, X).
The n-th processor acts by this code on the components X(2n), X(2n + 1) of the input vector X and places the results in memory as the components Y(2n), Y(2n + 1) in the output vector Y. In the same fashion, A 0 IN
is computed by having the n-th processor act on the components X (n), X (n N). The results are placed in memory as the components Y(n), Y(n + N) in the output vector Y.
If the number M of processors is less than N , then the problem is more complicated. Suppose that N = M L. Using the identity
IN 0 A = /Ai (/"L 0 A),
we assign the code //, 0 A to each processor to perform the computation as above with M replacing N in the discussion. In the same way, the identity
A® = P(2N,21,)(.1 m 0 (A 0 I L))P(2N, M)
suggests that each processor be assigned the code for A0h with addressing determined by the input and output stride permutations. In particular, the m-th processor, 0 < rn < M, performs A 0 IL on the 2L components
.
X (rn), X (nr + M), , X (m, e2L — 1)M),
and places the result in memory as the 2L components
Y(m), Y(rn + M), , Y (m (2L — 1)M).
2.6 Parallel Implementation 51
Alternatively, we can use the identity
A 0 = (P(2M, 2) 0 /L)(im 0 A® h)(P(2M, M) 0 IL).
Consider the factor im 0 A0 IN. As above, we can implement the action by M parallel computations of A0 IN. If MN processors are available, we can use the identity
im 0 A 0 = P(2MN,2M)(1114N 0 A)P(2MN,N)
or the identity
im 0 A 0 = (Im P(2N,2))(ImN 0 A)(1m P(2N, N))
to compute Im A0 IN as MN paxallel computations of A. In this way, we naturally control the granularity of the parallel computation and fit the computation to granularity and to the number of available processors. The stride permutations give an automatic addressing to the processors.
Theses ideas can be used to compute the tensor product of, say T, factors of A in parallel. By the fundamental factorization,
A 0 • A = H(/2T-t 0 A0 I2t-.), t=i
we decompose the computation into a sequence of computations which, at the t-th stage, is given by
-r2T-t 0 A 0 ht-i.
To carry out the computation in this way requires a barrier synchroniza-tion to guarantee that the input to the next stage is correct. The natural interpretation of ea,ch stage leads to a different degree of parallelism at each stage. The factorization must be modified by different addressing, and
-,•hence different programming at each stage is required to get a consistent degree of parallelism. We turn to the factorization,
A0 A = P(2T, 2)(12T-1 A), t_-1
given in section 2. The addressing is the same at each stage, and the natural interpretation has the maximal degree of parallelism at each stage. For example,
3
A0A0A=HP(8,2)(/40A), t=i
arid at each stage of the computation we compute
y = P(8,2)(I4 A)x. (2.22)
52 2. Tensor Product and Stride Permutation
We compute this as Y(0)
Y(4)
Y(1)
Y(5)
Y(2)
Y(6)
Y(3)
Y(7)
=
=
=
=
=
=
=
=
X(0)
X(0)
X(2)
X(2)
X(4)
X(4)
X(6)
X(6)
+ X(1)
— X(1)
+ X(3)
— X(3)
+ X(5)
— X(5)
+ X(7)
— X(7).
(2.23)
If four processors are available, the rn-th processor, 0 < rn < 4, computes
Y(m) = X (2m) + X (27rt + 1),
Y (m + 4) = X(2m) — X(2rn + 1).
Suppose that we have two parallel processors. Then we rewrite (2.22) as
y = P(8, 2)(/2 (/2 0 A))
and compute the first four lines of (2.23) on the 0-th processor and the second four lines on the first processor. Thus, on the rrt-th processor, 0 < m < 2, we compute
for n = 0, 1,
Y(2rn + n) = X(4771, + 2n) + X(4rn + 2n + 1),
Y(2m + n + 4) = X (4rre + 2n) — X(4rn + 2n + 1).
Suppose that we wish to compute
(DA = P (21°, 2) ( /29 0 A), t=1 t=1
with eight processors. Writing
P(219, 2)(/29 0 A) = P(219, 2)0-23 0 (/26 A)),
at each stage the m-th processor, 0 < < 7, computes —
for n = 0, , 63,
Y(27rn + = X(27m + 2n) + X(27in + 2n + 1)
Y(27rn + n + 29) = X(27rn + 2n) — X(27rn + 2n + 1).
Problems 53
In this example, ten passes are required. After each computational stage, the results are stored back to the main (shared) memory. It may be advan-tageous to do more computations before doing the memory operation. We can do this by the factorization
5 0 A = H p(210, 22 )(/25 (h5 A 0 A))•
t=i
There are only five computational stages that reduce transfers to the main memory. However, the granularity has been increased since A has been replaced by A 0 A.
In section 3, we discussed the importance of stride permutation factorizations in implementation. For example,
P(21°, 2) -= (P(24, 2) 0 /26)(/23 0 P(27, 2))•
In the case of eight processors, /23 0 P(2', 2) is carried out by permuting elements in local memory for each of the processors by P(27, 2). The results then can be transferred to the main memory in segments of length 26 permuted by P(24, 2). In this way, the transfer to the main memory given by P(21°, 2) is replaced by decomposing this transfer into a collection of local permutations followed by a global block permutation.
References
[1] Johnson, J., Johnson, R., Rodriguez, D. and Tolimieri, R. "A Method-ology for Designing, Modifying, and Implementing Fourier Transform Algorithms on Various Architectures", IEEE Trans. Circuits Sys., 9(4), 1990.
[2] Hoffman, K. and Kunze, R. Linear Algebra, Second Ed., Prentice-Hall., 1971.
[3] Nering, E. D. Linear Algebra and Matrix Theory, John Wiley & Sons., 1970
Problems
11. Show that the tensor product of vectors is bilinear.
2. Show that the tensor product of matrices is bilinear.
54 2. Tensor Product and Stride Permutation
3. Compute (A 0 B) (a b) for
B [0 1 01 [2 1
A =
1 ' 2 1 1 0 1 1
—
a [11
0 b = [231 • 1
4. For vectors a and b running over all vectors of sizes 2 and 3, respectively, show that the tensor products a 0 b span C6.
5. Show the general result: For vectors a and b running over all vectors of sizes L and M, respectively, the tensor products a 0 b span CLm.
6. The canonical basis of CL is the set of vectors given by the vectors of size L,
-1- - 0- - 0 0 1 0
(L) 0 (L) 0 e(L) e„, — e —
0 0 _0_ _1 _ _
Show that the LM tensor products
e.,L) e(8m) 0 < r < L, 0 < s < M
describe the canonical basis of CLm. (Explicitly derive eV') e(sm).)
7. Describe P(27,3), P(27,9) and show that
P(27, 3)P(27, 9) = /27.
8. Compute the matrix product P(12, 2)P(12, 3).
9. Show that the set {P(81,38) : 0 < s < 4} is a cyclic group. List the generators.
10. Compute P(8, 2)(A B)P(8, 4), where
01 00 01 00 [1 1 1
A = B-= 1 0 0 0 ' 1 —1 ' 0001
and show that it is equal to B 0 A.
3 Cooley-Tukey FFT Algorithms
3.1 Introduction
In the following two chapters, we will concentrate on algorithms for com-puting the Fourier transform (FT) of a size that is a composite number N. The main idea is to use the additive structure of the indexing set Z/N to define mappings of input and output data vectors into two-dimensional arrays. Algorithms are then designed, transforming two-dimensional arrays which, when combined with these input/output mappings, compute the N-point FT. The stride permutations of chapter 2 play a major role.
The first additive fast Fourier transform (FFT) algorithm is described in the fundamental work of J. W. Cooley and J. W. Tukey [2] in 1965. Straight-forward computation of the N-point FT requires a number of arithmetic operations proportional to N2. In scientific and technological applications, the transform size N is commonly too large for direct digital computer im-plementation. The Cooley-Tukey FFT algorithm significantly reduces the computational cost for many transform sizes N to an operational count proportional to N log N. This result set the stage for widespread advances in digital hardware, and is one of the main reasons that digital computation has become the overwhelmingly preferred method for computing the FT in most scientific and technological applications.
The years following publication of the Cooley-Tukey FFT saw vari-mls implementations of the algorithm on sequential machines[1]. Recently, hciwever, as vector and parallel computer architectures began to play in-creasingly important roles in scientific computations, the adaptation of the
56 3. Cooley-Tukey FFT Algorithms
Cooley-Tukey FFT and its variants to these new architectures have become a major research effort. The tensor product provides the key mathematical language in which to describe and analyze, in a unified format, similarities and differences among these algorithms. An account of these variants not using this language can be found in [8]. In 1968, M. Pease [5] utilized the language of the tensor products to formulate a variant of the algorithm that is suitable for implementation on a special purpose parallel computer. In 1983, C. Temperton [9] provided tensor product formulations of the most conunonly known variants.
One of the advantages of using tensor product language to describe FT algorithms is that this mathematical language may be used as an analytic tool for the study of algorithmic structures for machine hardware and soft-ware implementations as well as the identification of new algorithms. For instance, an inherent part in the study of computer implementation of FT algorithms is the analysis of the data communication aspects of the algo-rithms that manifest themselves during implementation procedures. These data communication aspects can be best studied, in turn, through the anal-ysis of the permutation matrices, the stride permutation, which appear in our tensor product formulation of the FT algorithms.
We present, in tensor product form, the description of FT algorithms with the following objective in mind: To provide the user of these al-gorithms with guidelines that will enable him to effectively study their implementation on either special purpose or general purpose computers. By "effectively studying their implementation," we mean to be able to pro-duce algorithms that best conform to the inherent constraints on any given machine hardware architecture.
In this chapter, we consider the Cooley-Tukey FFT algorithm corre-sponding to the decomposition of the transform size N into the product of two factors. The convention introduced in chapter 2 relating two-dimensional arrays to one-dimensional arrays will still be enforced: If X is an M x L matrix, then we associate to X the ML-tuple x formed by reading, in order, down the columns of X. In the sections that follow, al-gorithms will be designed by using the additive structure of the indexing set to associate a two-dimensional array to a one-dimensional array.
3.2 Basic Properties of FT Matrix
The FT matrix of order N, denoted by F(N), is defined as
F(N) = [w3k ] , w = exp(27rillV), i =
The conjugate of w, denoted by w*, is
w* = exp(-27rilN) =
3.3 An Example of an FT Algorithm 57
and (wk)* _ w—k mod N = W N—k
Direct computation shows that
F(N)F(N)* N IN.
The inverse FT matrix is
F(N)-1 =- IT7.1 F(N)* ,
and F(N) is symmetric, i.e.,
F(N)t = F(N).
3.3 An Example of an FT Algorithm
The eight-point FT is given by the formula
7
= E W 1kXk, 0 < 1 < 8, w = exp (27ri/8). (3.1)
k—o
Associate to the input vector x the 4 x 2 array p X41
X = Xi X5
X2 X6
X3 X7
and set ,,, v-t [xo xi x2 x3] Xi = f 1 = .
X4 X5 X6 X7
The vector xi corresponding to Xi can be obtained by
xi = P(8, 4)x,
where P(8,4) is the eight-point stride 4 permutation. Associate to the output vector y the 2 x 4 array
y YO Y2 Y4 Y6
Y1 Y3 Ys Y7
*e will rewrite (3.1) in terms of the arrays Xi and Y. First,
Xi(ki,k2) = x(k2 +4ki), 0 < < 2, 0 < k2 < 4,
58 3. Cooley-Tukey FFT Algorithms
Y(11,12) = + 212), 0 < ti < 2, 0 < /2 < 4.
Placing these formulas into (3.1), we have
3 1 w(k2+4ki)(h+212)]
Y(ii /2) = E [E xicki,k2) (3.2) k2=0 ki=0
Set v = w2 -= i and u = w4 = —1. Then
w (k2+4ki)(ii+212) = ukiiivk2/2wk2ii
We can rewrite (3.2) as
3 1 wk211 vk2/2.
Y(11,12) = E 1E xi(ki, k2)uk1111 (3.3) k2=o ki=o
We can decompose (3.3) into a sequence of operations as follows. First we compute the inner sum
k2) = E 0 < /, < 2, 0 < k2 < 4. (3.4) ki=o
We see from (3.4) that the array Yi is computed by taking the two-point FT of each column of the array Xi. In tensor product notation,
yi =- (/4 F(2))xi,
where yi is the vector corresponding to The next stage of the computation,
Y2(ii,k2) = Yi(ii,k2)wk2/1, (3.5)
introduces the twiddle factor. In matrix notation,
Y2 = TY17
where y2 is the vector corresponding to Y2 and T is the diagonal matrix
T = diag(1, 1, 1, w , 1, w2 , 1, w3)
We complete the computation of Y from (3.4) by
3
Y(/1, /2) = E y201,kovk212,, 0 < < 2, 0 < /2 <A, k2=o
which is given by the four-point FT of each row of the array Y2. In tensor product notation, (3.5) is written
Y = (F(4) 0 /2)Y2-
3.4 Cooley-Tukey FFT for N = 2M 59
Combining these formulas, we have
y = (F(4) 0 /2)T(Li F(2))P(8, 4)x.
This leads to the factorization
F(8) = (F(4) 0 /2)T(/4 F(2))P(8, 4).
3.4 Cooley-Tukey FFT for N = 2M
The N-point FT is given by the formula
N-1 ik Yi =- E w xk, 0 < < N, w = exp(2rilN). (3.6)
k=0
Associate to the N-point input data x the M x 2 array
X: :mm4
X =
X m SN _1
and set
Xi = — [ 0X Xi • • • X M-1
Xm Xm+1 • • • XN-1
The corresponding N-tuple xi is given by
xi = P(N, M)x.
t Associate to the N-point output data y the 2 x M array
y YO Y2 • • • YN-2
Y1 Y3 • YN-1
We can rewrite (3.6) in terms of the two-dimensional arrays Xi and Y. First,
(ki , k2) = x(k2 + Mki), 0 < ki < 2, 0 < k2 < M, (3.7)
and Y(/i, /2) = + 2/2), 0 < /1 < 2, 0 < /2 < M. (3.8)
Using (3.7) and (3.8), we have
m—i Y(/1,/2) = E (E xi(ki,k2).,11)wk2iivk212, (3.9)
k2=o ki=o
60 3. Cooley-Tukey FFT Algorithms
where v = w2, u = wm = —1 and wN = 1. The inner sum,
Yi(zi,k2) = E )(lc/4, k2)ukill,
ki=0
computes, for each 0 < k2 < M, the two-point FT of the corresponding column of Xi. Let xi be the vector associated with the two-dimensional array Xi. To compute (3.9), we partition xi into M vectors each of length 2 given by the columns of Xi, and compute the two-point FT of these vectors. The ouput is placed in the columns of In tensor product notation,
-= (Im 0 F(2))P(N,M)x.
There are two remaining steps in the computation. First, we compute
Y2(11, k2) = k2)wk211,
which can be described by the diagonal matrix multiplication
Y2 - TY1,
where T = diag(11 1 w 1 wm-1)
The final computation,
m_i
Ycii,/2) = E Y20,,k2),k212,
k2=0
computes the M-point FT of the rows of Y2 that can be written as
y = (F(M) 12)Y2.
This discussion leads to the following theorem.
Theorem 3.1 Let N = 2M. Then
F(N) = (F(M) 0 12)T(Im F(2))P(N,M),
T = diag(11 lw ... 1 wm-1).
The permutation P(N, M) naturally forms vectors of size 2 on which the action of im 0 F(2) can be computed in parallel. The twiddle faqtor T can be thought of as a block diagonal matrix consiaing of M diagonal blocks of size 2. Each of the M blocics acts, in parallel, on the two-dimensional vector resulting from the action of /A4 0 F(2). The computation is completed by the vector FT, F(M) /2, the vector FT F(M) acting on the set of M two-dimensional vectors.
3.5 Twiddle Factors 61
3.5 Twiddle Factors
In this section, we consider the twiddle fa.ctors or the diagonal matrices appearing in Cooley-Tukey FFT algorithms.
For N > 1, set
- 1
D(N) = , w = exp(27ri N),
wN-1
and for 1 < M < N
_[1
D m (N) —
wm —1
Example 3.1 [01 01] D(2) =
Example 3.2
1
D(4) = i _i , D2(4) = [
—i
[01 Oi ] ,
and we have D(4) = D(2) 0 D2(4).
- Assume that N = 2M. With w = exp(27ri/N) and w2 = exp(27ri I M),
[ D2 (N) w2 D2(N)
D(N) = .
w2(m — 0 D2(N)
By the definition of the tensor product,
D(N) = D(M) 0 D2(N).
The general result will be stated as a theorem.
iheorem 3.2 If N = ML, then
D(N) = D(L) 0 Dm(N).
62 3. Cooley-Tulcey FFT Algorithms
Proof Since
w M = e (2wi I L)
we can write
D m (N)
D(N) = wm D m (N)
wm (L-1) Dm(N)
which by definition proves the theorem.
Let N = M L. Define the matrix direct sum
[im L-1 D m (N)
Tm (N) = sED Dim(N) = t=o
The matrix Tm (N) can be viewed as a block diagonal matrbc consisting of L diagonal blocks of size M. In this way, it naturally acts, in parallel, on L vectors of size M. The diagonal matrix T of Theorem 3.1 is T2(N).
Stride permutations act on these diagonal matrices as follows.
Theorem 3.3 If N = M L, then
m—i P(N, M)Tm(N)P(N, L) = (131 (N) = TL(N).
rre=0
Proof The matrix on the left-hand side is the diagonal matrix formed from the product of P(N, M) with the vector formed from the diagonal components of Tm(N). Listing the diagonal elements of Tm (N) as a row,
... 1; W ... WM-1; ... ; WL-1 W(L-1)(M-1)
and striding through the row with stride M, we have
... 1; W . . . WL-1; ; WM-1 ... W(L-1)(M-1),
proving the theorem.
As expected, the natural block-like structure of Tm(N) is transformed into M diagonal blocks of size L by conjugation by P(N,M). The judi-cious application of this result provides a means of keeping consistency throughout the computation.
3.6 FT Factors 63
3.6 FT Factors
In general, we will need to compute the actions of
Im FM, FM Im.
Factors of these types were studied in chapter 2. Although the arithmetic cost of both actions is the same, the efficiency of implementation can vary on different types of machine architecture. Diagrams can be helpful. The action of F(2) can be represented by the butterfly diagram
Xo. .X0 ±
Xi. .so —
Example 3.3 The action of 12 0 F(2) iS represented by
Xo• .so si X i. .X0 —
X2. .X2 ± Xo
So. .X2 — X3
which we see consists of two parallel two-point FTs.
In general, the action of im F(L) can be computed by M parallel L-point FTs.
Example 3.4 The action of F(2) h is represented by
X0. .X0 ± X2
Xi. .X1 + X3
X2. •X0 — X2
Xo. .X1 — X3
In the previous chapter, we saw that this is a vector operation. Associate to x the equivalent two-dimensional array
X [ X° X2 . Xi X3
Form the two vectors of length 2 from the columns of X,
x(0) = [x° , x(1) = [x2] .
xi 1 s3
We say that these vectors are formed with stride 1. The action of F(2) h is. given by the two-point vector FT,
iy(0)1 = Ft21 [x(0)1
[y(1)] [x(1)]
64 3. Cooley-Tukey FFT Algorithms
We read out the vector y = (F(2) 0 /2)x in natural order. As in chapter 2, the commutation theorem can be used to interchange
parallel and vector operations by the formula
P(N, L)(/m F(L))P(N, M) = F(L) Im, N = ML.
As an example, observe that the following diagram also computes the action F(2) 0 /2 (which is a vector operation) 8.s a parallel operation using the commutation theorem:
Example 3.5 F(2) 0 h = P(4, 2)(/2 0 F(2))P(4, 2):
xo • •XO •XO ± X2 •XO ± X2
Xl• •X2 •Xo - X2 •Xi -I- X3
X2. •Xl .XI ± X3 •XO - X2
X3. .X3 .X1 - X3 •Xl - X3
Example 3.6 y (F(2) 0 /3)x. Then
Yo = xo X3/ Y3 = XO X3/
Y1 = xl ± x49 Y4 = X1 - X49
Y2 = X2 ± X5/ Y5 = X2 - X59
which can be represented by
Xo. .X0 ± X3
Xi. .XI + X4
X2. •X2 ± X5
X3. •Xo - X3
X4. •Xl - X4
X5. •X2 - X5
This computation can be carried out in three stages, as indicated by the diagram
Xo. .X0 •Xo ± X3 .X0 ± X3
Xi. .X3 .X0 - X3 .X1 ± X4
X2. .X1 .XI ± X4 .X2 ± X5
X3. .X4 .X1 - X4 •XO - X3
X4. .X2 .X2 ± X5 .X1 - X4
X5. .X5 .X2 - X5 .X2 - X5
which corresponds to the factorization
F(2) o h = P(6, 2)(/3 F(2))P(6, 3).
3.7 Variants of the Cooley-Tukey FFT
In the notation of the preceding section,
F(N) = (F(M) 12)T2(N)(Im F(2))P(N,M), N = 2M. (3.10)
3.7 Variants of the Cooley-Tukey FFT 65
We will now derive other factorizations of F (N). These variants are dis-tinguished by the flow of the data through the computation. They make available to the algorithm designer several possibilities for computing N-point FT, all having the same arithmetic, but differing in the storage and gathering of data.
By the commutation theorem,
P(N, M)(I2 F(M))P(N, 2) = F(M) h.
Using this formula in (3.10), we have
F(N) = P(N, M)(I2 F(M))P(N,2)T2(N)(Im F(2))P(N, M).
Both of the FT factors are parallel factors; the first is M copies of F(2) and the second is two copies of F(M). The first part of the computation,
T2 (N)(I m F(2))P(N, M),
is naturally thought of as a parallel action on vectors of size 2, while the second part,
P(N, M)(I2 F(M))P(N, 2),
is a parallel action on vectors of size M, where the permutation matrices describe data readdressing.
Applying the commutation theorem in the form
P(N, 2)(Im F(2))P(N, M) = F(2) 0 m
leads to the factorization
F(N) (F(M) I2)T2(N)P(N, M)(F (2) 0 m).
The FT factors are now vector factors; the first acting on vectors of size M and the second acting on vectors of size 2.
A second technique for manipulating factorizations comes from the transpose. Taking the transpose on both sides of (3.10) and using the formulas
F(N)t = F(N), pt p-1,
(A 0 .13)t = 0 .13t ,
we have F(N) = P(N,2)(Im F(2))T2(N)(F(M) /2).
';%/k permutation of output data is now required. Applying transpose and the commutation theorem, other factorizations
can be derived. There are several features that distinguish between these
66 3. Cooley-Tukey FFT Algoritluns
factorizations: input permutation, output permutation and internal per-mutation, the type of lower order FT factors and their placement in the computation. We single out the four factorizations derived in this section for future reference. Set P = P(N, M) and T = T2(N), N = 2M. Using Theorem 3.1, we have
Case N = 2M:
(al)F(N) = (F(M) I2)T(Im F(2))P,
(b1)F(N) = P-1(Im F(2))T(F(M) 0 12),
(c1)F(N) = P(I2 F(M))13-1T(Im F(2))P,
(d1)F(N) = (F(M) I2)T P(F(2) 0 m).
3.8 Cooley-Tukey FFT for N = ML
Let N = ML and consider the N-point FT
N-1 yk = E wicn.„, 0 < k < N, w = exp (27i/N). (3.11)
n=0
We will derive a Cooley-Tukey algorithm computing the N-point FT. Associate to the N-point input vector x the L x M array
xxoi x7+1 xxol(m-1)1L)L+1
X =
xL-1 X2L-1 XN-1
and set
[ X0 X1 • • - XL-1
XL XL+1 • • • X2L-1 XI = Xt = .
• .
X(M-1)L X(M-1)L-1-1 • • • XN - 1
The corresponding N-tuple xi is given by applying the N-point stride-L permutation P(N,L) to x,
xi = P(N, L)x.
Associate to the output vector y the M Ax L array
Yo Ym • • • Y(L—i.)m
yi Ym+1 • • • Y(L-1)M+1 Y =
_Ym-1 Y2M-1 YN-1
3.8 Cooley-Tukey FFT for N ML 67
We can write
k2) = s(k2 + kiL), 0 < < M, 0 < k2 < L,
Y(11,12) = Y(11 +12M), 0 <11 < M, 0 <12 < L.
Formula (3.11) can be rewritten as
L-1 M-1
Y(11, /2) = E E x,(kl,kow(k24-ki..0(h±/2m) • k2=0k,=0
on)
Now, (k2 kiL)(11 + /2M) -= k2/1-1- ki/iL + k2/2M mod N.
Set u = wL and v = wm. Since wN = 1, we can rewrite (3.12) as
L-1 M-1 )wk2/ivk212.
Y(/1, /2) = E E (3.13)
k2=0 ki=o
The argument proceeds as in section 2. First observe that the inner sum,
m—i
= E k2).kiti ,
k2=o
computes, for each 0 < k2 < L, the M-point FT of the k2-th column of Xi and places the result in the k2-th column of Let be the vector formed by reading, in order, down the columns of Then
Yi = 0 F(M))xi = (./1 0 F(M))P(N,L)x.
The next stage of the computation,
Y2(4,k2) = Yi(ii,k2)iuk2/1,
can be given by the diagonal matrix multiplication
Y2 = Tm(N)Yi-
We complete the computation by
L-1
Y(11, /2) = E Y2(ii,k2)vk212,
k2=0
airhich computes the L-point FT of the rows of 172)
Y = (F(L) Im)Y2.
68 3. Cooley-Tukey FFT Algorithms
Theorem 3.4 If N = M L, then
F(N) = (F(L) 0 m)Tm(N)(IL F(M))P(N,L).
The first part of the computation,
Tm(N)(IL F(M))P(N,L),
can be viewed as a parallel action on vectors of size M, while the second part, F(L) /m, is the vector FT on the resulting L vectors of size M.
The transpose and the commutation theorem can be used to derive other factorizations as in section 7. We single out the following list for future reference:
Case N = L:
(a2)F(N) = (F(L) 0 I m)Tm (N)(I L F(M))P(N, L).
(b2)F(N) = P(N,L)(Im 0 F(L))P(N, M)Tm(N)(IL 0 F(M))P(N, L).
(c2)F(N) = (F(L) 0 I m)Tm (N)P(N, L)(F(M) 0 IL).
(d2)F(N) = P(N, M)(IL 0 F(M))Tm(N)(F(L) 0 I m).
3.9 Arithmetic Cost
The number of arithmetic operations required to carry out a computation is an important part of the cost of the computation and has traditionally occupied the most attention. On modern machines, a large part of the computation time can be spent on data communication; but, there is as yet little general theory measuring this aspect of the overall computation. We gave some general guidelines in the previous sections, but much more, especially on specific architecture, remains unanswered. Arithmetic cost is much easier to estimate.
In the class of algorithms listed in section 7, each algorithm has the same arithmetic cost if we ignore the underlying arithmetic involved in ad-dressing. Consider factorization (c2). We require an input permutation at no arithmetic cost. Then the L M-point FT must be computed, followed by a diagonal matrix multiplication. In the last stage, since F(L) /m and /m F(L) are the same up to data permutation, the equivalent of M L-point FTs must be computed. If we have some algorithm comput-ing M-point FT with in(M) multiplications and a(M) additions, then the algorithms of section 8 compute N-point"*FT
Ma(L) + La(M) (3.14)
additions and Mm(L) + N + Lm(M) (3.15)
Rderences 69
multiplications. The N in (3.15) comes from the diagonal matrix multipli-cation. Since many of the diagonal entries are 1 in practice, we can reduce this cost.
If we take a(M) = M(M — 1), (3.16)
rn(M) = M2 , (3.17)
then (3.15) becomes N(M±L±1), (3.18)
which should be compared to m(N) = N2.
References
[1] Cochran, W. T., et al. "What is the Fast Fourier Transform?," IEEE Trans. Audio Electroacoust. 15 45-55, 1967.
[2] Cooley, J. W. and Tukey, J. W. "An Algorithm for the ma.chine Calculation of complex Fourier Series," Math. Comp. 19 1965, 297-301.
[3] Gentleman, W. M., and Sande, G. "Fast Fourier Transform for Fun and Profit," Proc. AFIPS, Joint Computer Conference 9 563-578, 1966.
[4] Korn, D. J., and Lambiotte, J. J. "Computing the Fast Fourier Transform on a vector computer," Math. Comp. 33 977-992, 1979.
[5] Pea.se, M. C. "An Adaptation of the Fast Fourier Transform for Parallel Processing," J. ACM18 843-846, 1971.
'3‘
[6] Burrus, C. S. "Bit Reverse Unscrambling for a Radix 2' FFT," IEEE Trans. Acoust., Speech and Signal Proc. 36 July, 1988.
[7] Singleton, R. C. "An Algorithm for Computing the Mixed-Radix Fast Fourier Transform," IEEE Trans. Audio Electroacoust. 17 93-103, 1969.
[8] Swartztrauber, P. N. "FFT algorithms for vector computers," Parallel Computing, 1 45-63, North Holland, 1984.
[9] Temperton, C. "Self-Sorting Mixed-Radix Fast Fourier Transforms," J. Comput.Phys., 52(1) 198-204, 1983.
[10] Burrus, C. S. and Park, T. W. DFT/FFT and Convolution Algo-rithms, John Wiley and Sons, 1985.
70 3. Cooley-Tukey FFT Algorithms
[11] Oppenheim, A. V. and Schafer, R. W. Digital Signal Processing, Prentice-Hall, 1975.
[12J Nussbaumer, H. J. Fast Fourier Transform and Convolution Algo-rithms, Springer-Verlag, 1981.
Problems
1. Show that F(N)F(N)* = N IN .
2. Compute T4(16), T4(64) and T4(128).
3. Compute T3(9), T3(27) and T3(81).
4. Show directly that
P(12,4)T4(12)P(12,3)= T3(12).
5. Diagram the computation of F(3) 0 14 using the identity
F(3) 0 = P(12,3)(14 0 F(3))P(12,4).
6. Derive directly the factorization
F(27) = (F(9) 0 /3)T3(27)(/9 F(3))P(27,9).
7. From Theorem 3.3, use the transpose to derive the factorization
F(N) = P(N, S)(IR 0 F(S))P(N, R)TR(N)(Is 0 F(R))P(N, S).
8. Prove that (A 0 B)t = Bt
9. Show directly, without using Theorem 3.4, that the factorization
F(N) = P(N,2)(Im F(2))T2(N)(F(M) I2)
implies the factorization
F(N) = (F(2) 0 m)Tm(N)(I2 F(M))P(N,2).
10. Give an example showing that die arithmetic count foy.the com- putation y = F(12)x depends on the factorization taken for 12.
4 Variants of FT Algorithms and Implementations
4.1 Introduction
In chapter 3, additive FT algorithms were derived corresponding to the fac-torization of the transform size N into a product of two factors. Analogous algorithms will now be designed corresponding to transform sizes given as a product of three or more factors. In general, as the number of factors increases, the number of possible algorithms increases.
In this chapter, we derive the Cooley-Tukey [3] and Gentleman-Sande [4] FT algorithms. They are related by matrix transpose and distinguished by whether bit-reversal is applied at input or output. In both cases, FT factors of mixed-type,
im F(K) h (4.1)
appear in the factorization as discussed in chapter 2. This factor can be viewed as M concurrent FTs on vectors of length L. Applying the commutation theorem, this factor can be replaced by the 'vector' factor
F(K) IML) (4.2)
which can be viewed as the vector K-point FT on vectors of length ML. In theory, a vectorization of the Cooley-Tukey FFT algorithm is produced by systematically replacing all mixed-type FT factors by their corresponding v,ector factors. However, implementing a vector factor on a specific vec-WI- computer cannot, in general, be accomplished without breaking up the computation into pieces that can be fit into the vector registers. This par-titioning of the cornputation introduces concurrency back into the factor,
72 4. Variants of FT Algorithms and Implementations
and is one of the main difficulties in matching algorithm to architecture. This problem was discussed in chapter 2.
Parallel algorithms and vector algorithms are easily related by the commutation theorem. The commutation theorem introduces explicit permutation matrices into the factorization. Not surprisingly, these per-mutation matrices are built from the stride permutations. Variants of the Cooley-Tukey FFT algorithms, to a large extent, depend on the permuta-tion matrices that are used to bring about vectorization or parallelization. For example, for N = MLK, we have the two formulas
F(K) = P(N, LK)(Im F(K) IL)P(N, LK)-1 (4.3)
and
F(K) 0 IML
(4.4)
= (P(MK, K) 0 IL)(Im F(K) IL)(P(MK,K)-1 ).
The variants derived by Pease [6], Korn and Lambiotte [5], and Agarwal and Cooley [1] depend on factorization (4.3) while the auto-sort variant derived by Stockham in [9] depends on factorization (4.4). The Korn-Lambiotte FFT algorithm is the vector analogue of the parallel FFT algorithm of Pease. Two features distinguish these algorithms. First, the main compu-tational stages are the same in all of these algorithms and are given by the vector FT factor (4.2). In the Korn-Lambiotte FFT algorithm and the Agarwal-Cooley FFT algorithm, bit-reversal is required at input or output, which can be a time-consuming step on many vector computers. However, the internal permutations introduced by the commutation theorem, as seen in (4.3), have uniform structures throughout the different stages of the computation, and can be implemented by the stride load memory feature of many vector computers on vectors of maximal lengths. The auto-sort variant does not require bit-reversal at input or output. It accomplishes this savings by distributing bit-reversal throughout the computation. From (4.4), we see that the internal permutations are tensor products and can be viewed as vector stride permutations; but, the vector lengths are not maximal and change throughout the computation.
In section 2, we derive the radix-2 Cooley-Tukey FFT algorithm and the radix-2 Gentleman-Sande FFT algorithm. Bit-reversal is defined. In section 3, the Pease FFT algorithm is derived, and its vector form due to Korn and Lambiotte is discussed. The auto-sort FFT algorithm is derived in section 4. In the final three sections, the mixed radix generalizations of these algorithms are given.
4.2 Radix-2 Cooley-Tukey FFT Algorithm
We derive a Cooley-Tukey FFT algorithm for transform size N = 2k. The algorithm decomposes the computation of an N-point FT into a se-
4.2 Radix-2 Cooley-Tukey FFT Algorithm 73
quence of k operations each requiring two-point FTs followed by an output permutation. Our derivation is based on the factorization
F(N) = P(N)(12 0 F(M))T(N)(F(2) I m), M = N12, (4.5)
where P(N) = P(N, M) and T(N) = Tm(N). We begin with two examples.
Example 4.1 F(4) = P(4)(/2 F(2))T(4)(F(2) 12).
Example 4.2 F(8) = P(8)(/2 F(4))T(8)(F(2) /4).
The operation 1-2 OF(4) can be factored using example 4.1 and the tensor product identity,
/ 0 (BC) = (I B)(I C), (4.6)
with the result
12 ® F(4) — (1.2 0 P(4))(/4 F(2))(/2 T(4))(/2 0 F2 0 /2). (4.7)
Placing (4.7) into example 4.2, we have
F 8 = P ( 8 ) (1"2 P ( 4 ) (I4 ® F ( 2 (4.8)
(12 0 T(4))(1-2 0 F(2) 0 /2)T(8)(F(2)' 0 /4)•
We organize the computation into stages by setting
= /4 0 F(2),
X2 = (12 0 T(4))(I2 0 F(2) 12),
X3 = T(8)(F(2) 0 14), Q = P(8)(I2 0 P(4)),
and writing F(8) = QX1X2X3. Each computational stage is carried out by using a two-point FT, but readdressing is necessary between stages. The loops that implement these stages were discussed in chapter 2.
Direct computation shows that Q satisfies the condition
Q(Xi 0 x2 0 x3) = x3 0 x2 0 xi, (4.9)
where xi, x2, x3 are two-dimensional vectors. Since these tensor products span C8, condition (4.9) uniquely defines Q. We call Q the eight-point bit-reversal for the following reason. Each integer 0 < n < 8 can be uniquely written as
n = ao 2ai 4a2, 0 < ao, ai, a2 < 2 (4.10)
aid we call the ordered triple,
(ao,ai,a2),
74 4. Variants of FT Algorithms and Implementations
the binary bit representation of n. Consider the permutation of the indexing set,
7r(ao,ai,a2) = (a2,ai, ao), 0 < ao, ai,a2 < 2,
given by reversing the bits. The following table describes 7r.
l'able 4.1 Bit-Reversal ir:
000 000 001 100 010 010 011 110 100 001 101 101 110 011 111 111
Direct computation shows that the permutation matrix corresponding to 7r is Q.
More generally, if N = 2k, the N-point bit-reversal is the permutation Q uniquely defined by the condition,
cl(xi 0 ... 0 xk) = xk 0 ... 0 xi, (4.11)
where xi is a two-dimensional vector. Arguing as above, Q corresponds to the indexing set permutation 7r given by bit-reversal. Explicitly, each integer 0 < n < N can be uniquely written as
n = ao + 2ai + • • + 0 < an < 2,
and we call the ordered k-tuple
(ao,a1,...,ak —1),
the binary bit representation of n. Define the permutation
7r(ao, ,ak_i) = (ak_1,... ,ai, ao).
The corresponding permutation matrix satisfies (4.10) and is N-point bit-reversal.
Denote N-point bit-reversal by Q(N), N = 2k . We will show that
C2(N) = P(N)(12 0 P(NI2)-)- - (IN/4 P(4)) (4.12)
= P(N)(I2 Q(NI2)).
Define the sequence of permutations
Qi = P(N),
4.2 Radix-2 Cooley-Tukey FFT Algorithm 75
Q2 = /2 0 P(N/2),
Qk-1 = /N/4 0 P(4) = /2 0 (/N/8 P(4))•
By (4.6),
Qz. • • Qk-i = 120 Q',
where = P(N12)(I2 P(NI4)) • • (IN18 0 P(4)),
which we see is the factorization on the right-hand side of (4.12) corre-sponding to N/2 = 2k-1. By induction, we can assume that Q' -= Q(NI2) is N/2-point bit-reversal. Definition (4.11) implies that
(Q2 " Qk-1)(xi 0 X2 • • • 0 Xk) = Xi 0 (Xk 0 • • • 0 x2)-
We complete the induction step using
P(N) (xi 0 x) = 0 xi,
where x is an N/2-dimensional vector. In the same way, we can show that
C2(N) -= (P(4) 0 iN/4) • • • (P(NI2) 0 12)P(N).
Throughout we will set T(21) = T21-1(21). By (4.5),
F(2k+1) = P(2k+1)(/2 0 P(2k))T(2k+1)(P(2) 0 ik)•
Arguing as in example 4.2, which is an induction on transform size, i.e., on k, we have the following result of Gentleman and Sande [41.
Theorem 4.1
F(2k) = Q(2k) H(/2k-i 0 T(21))(1.2k-t 0 F(2) 0 /2/-1). /=1
We continue the convention that, for matrices Xi, • , Xk,
H = Xi " • Xk • 1=1
Setting Xi = T(21))(i2k_i F(2)0 /21-i), we can write
F(2k) = Q(2k)X1X2 • • • Xk •
Observe that the first stage computation
X = T(2k)(F(2) I2k- 1)
76 4. Variants of FT Algorithms and Implementations
is a vector operation while the last stage computation
= 12k- 0 F(2)
is a parallel operation. In general, the i-th stage computation, computes 2/-1 copies of the two-point FT on the vectors of size 2k-1 fol-lowed by 21-1 copies of the twiddle factor T(2k-1+1). It can be viewed as a parallel action on vectors of size 2k-/. Vector length varies through the computation from 2k-1 to 1.
Taking the transpose yields the Cooley-Tukey radix-2 FFT algorithm [3].
Theorem 4.2
F(2k) = [11(1-2E-1 0 F(2) 0 I2k-i)((.121-1 Mk-1+1)1(2(2k). 1=1
Setting IT/ = XL_/±1, we can write F(2k) = YiY2 • • • YkQ(2k). The first stage computation
Yk = 12k-i 0 F(2)
is now a parallel operation while the last stage computation
= (F(2) 0 /2k_f.)T(2k)
is a vector operation. The Cooley-Tukey FFT has bit-reversal at input (decimation in time),
while the Gentleman-Sande FFT has bit-reversal at output (decimation in frequency). As written, the Gentleman-Sande FFT performs an FT followed by a twiddle factor at every stage, but regrouping reverses the order. It is standard to combine these steps in code.
4.3 Pease FFT Algorithm
In [6], Pease designed a variation of the Cooley-Tukey FFT which he asserts is 'better adapted to parallel processing in a special purpose machine'. A few examples will show what he had in mind.
Set P(N)= P(N,M) and T(N)= Tm(N) with N = 2k and M = 2k-1.
Example 4.3 Consider the factorization
F(4) = P(4)(/2 F(2))T(4)(F(2) 12).
In section 2, we described the data flow 'orthe corresponding computation. One of the main features of the Pease FFT is that we have constant
data flow in all stages of the computation. To accomplish this, we use the commutation theorem in the form
F(2) 0 12 = P(4)(I2 F(2))P(4).
4.3 Pease FFT Algorithm 77
We have
F(4) = P(4)(/2 0 F(2))T(4)P(4)(/2 0 F(2))P(4).
The smaller size FT factors are all the same. This factor, as discussed in chapter 2, is especially suited for parallel processing. The data flow is now explicitly part of the factorization and is constant throughout the compu-tation. As envisioned by Pease, a single hardwired device can implement the action of P(4).
Example 4.4 We will derive a variation of the factorization (4.8) where each stage of the computation has same data flow. Set
P = P(8, 2).
From chapter 2, P2 = P(8, 4) = P-1, P3 = 18.
By the commutation theorem,
F(2) 014 = P(I4 0 F(2))P-1,
/2 0 F(2) g /2 = P2(/4 0 F(2))/3- 2
Placing these identities in the factorization in example 4.2, we have
F(8) = Q(8)(/4 0 F(2))(/2 0 T(4))
P2 (14 0 F(2))P-2T(8)P(14 0 F(2))13-1 . (4.13)
Diagonal matrices remain diagonal matrices upon conjugating by permu-tation matrices. This idea will be used repeatedly to change data flow at the cost of changing twiddle factors. Setting
T2 = P(I2 0 T(4))/3-1,
T3 = P-1T(8)P,
we can rewrite (4.13) as
F(8) = Q(8)(/4 0 F(2))/3-1T2(/4 0 F(2))P-1T3(/4 0 F(2))/3-1
Since P-1 = P(8,4), we have the Pease eight-point FT.
Theorem 4.3
F(8) = C1(8)(140F(2))P(8, 4)
T2(14 0 F(2))P(8,4)T3(/4 0 F(2))P(8,4),
where T3 = T2 ( 8) and T2 = P(8, 2)T(4)P(8, 4).
78 4. Variants of FT Algorithms and Implementations
We distinguish three stages in the computation:
= Ti(/4 0 F(2))P(8, 4), / = 1, 2, 3, Ti = 181
and we can write F(8) = Q(8)XiX2X3. Each stage begins with the same readdressing P(8, 4) followed by the same FT computation /4 0 F(2). Only the twiddle factors vary from stage to stage.
To derive the general case, we consider the factorization of Theorem 4.1. The goal is to design an algorithm that has the same data flow in each stage of the computation. Set
P -= P(2k,2).
In chapter 2, we proved that
pl p (2k 21), pk = N
As in the preceding section, set
= (/2k_i T(21))(/2k-t 0 F(2) 0 /21-1).
By the commutation theorem,
p1-1 xipk-1+1 =
where Ti is the diagonal matrix
= P1-1(/2k_i T(2/))Pk-1+1.
Observe that Xi = /2k_i OF(2). From the Gentleman-Sande FT algorithm, we have
F(2k) = Q(2k)xip-ipx2p-zpzx3p-3... pk-lxk
Q(2k)xip-1(px2p-1)p-1(p2x3p-2)p-1 (pk-1 xkp)p-1
= Q(2k),C1P-1T2X1P-1T3X1P-1 • • • TkX1/3-1,
proving the generalized Pease FFT.
Theorem 4.4
F(2k) = (.1(.2 ) Ti(i2k-i 0 F(2))P(2k, 2k-1), 1=1
= P1-1(12k-1 T(21))Pk-1+1, 1 <1 < k.
4.4 Auto-Sorting FT Algoritlun 79
Each stage of the Pease FFT,
Ti(I2k-i. 0 F(2))P(2k, 2k- 1 ),
begins with the same readdressing P(2k, 2k-1) followed by the same parallel FT computation /2k-i 0 F(2). Only the twiddle factors vary through the stages.
A vector version of the Pease FFT was presented by Korn and Lambiotte [5] for implementation on the STAR 100. Setting
Z = F(2)0 /2k-i,
we have Z = PX1P-1, P = P(2k, 2)
and can rewrite the Pease factorization as
p(2k) C2(2k)p-1ZT2p-1Z Tkp-1Z.
Setting = PTIP-1 = T(21) 0 I2k--1,
we can write
F(2k ) = Q(2k ) 13-1 Z (P-1t2Z)P- 1 • (13- 111 Z),
proving the Korn-Lambiotte FFT.
Theorem 4.5
F(2k) = Q(2k) p(2k,2k-ixT(2/) /2k_0(F(2) /2k_i). i=i
The Korn-Lambiotte FFT has complete vectorization and constant data flow. Only the twiddle factor varies at different stages of the computation.
Factorizations in Theorems 4.4 and 4.5 are decimation-in-frequency since the output is bit-reversed. Taking transpose results in decimation-in-time, since now the input is bit-reversed. This form is due to Singleton [7].
4.4 Auto-Sorting FT Algorithm
The cost of performing N-point bit-reversal on either output or input data can be an important part of the overall cost of an FT computation on many mitchines. In Cochran et al. [2], an FFT algorithm attributed to Stockham is designed which computes the FT in proper order without requiring per-mutation either after or before the computational stages. We call such an
80 4. Variants of FT Algorithms and Implementations
algorithm an auto-sort algorithm. Temperton examines, in detail, the im-plementation of the Stockham FFT and mixed radix generalizations on the CRAY-1 in a series of papers [10, 11, 12].
The main idea underlying the Stockham auto-sort FFT is to distribute the N-point bit-reversal throughout the different stages of the computation. At the same time, the FT computations are unchanged. However, there is a trade-off. First, while the data flow in each stage of the Pease FFT is the same, in the Stockham FFT the data flow varies from stage to stage. Also, the data permutation required in each of the computational stages of the Pease FFT can be effectively implemented using generally available features of vector machines. In particular, in the radix-2 case, the perfect shuffle
P-1 = P(N, NI2), N = 2k
is matched to the vector instruction
stride by N/2
applied to the N-point data. In the Stockham FFT, the corresponding data permutations can also be implemented using the vector instruction 'stride', but it operates on data of varying sizes analogous to the changing data flow or vector lengths in the Cooley-Tukey FFT.
Example 4.5 Denote four-point bit-reversal by Q(4) and eight-point bit-reversal by Q(8). Then
Q(8)(1-4 0 F(2))Q(8)-1 = F(2) 0 /4,
(Q(4) 0 /2)(1-2 0 F(2) 0 /2)(Q(4)-1 0 /2) = F(2) 0 1.4.
To derive the eight-point auto-sort FFT, first vectorize the factorization of Theorem 4.3 using bit-reversals:
F(8) = (F(2) 0 1.4)(2(8)(/2 0 T(4))
(Q(4) 0 .12)(F(2) 0 14)(Q(4) I2)T(8) (F(2) 0 /4)•
Again we conjugate the twiddle factors by permutations and obtain
F(8) = (F(2) 0 /4)Q(8)
(Q(4)-1 0 /2)T2 (F(2) 0 /4)(Q(4) 0 /2)T(8)(F(2) 14),
where T2 = (Q(4) 0 12)(12 0 T(4))(Q(4)'..-1 0 12). Since
Q(8)(Q(4) 0 /2) = P(8, 2),
Q(4) = P(4, 2),
we have the eight-point Stockham FFT.
4.5 Mixed Radix Cooley-Tukey FFT 81
Theorem 4.6
F(8) = (F(2) 0 14)P(8,2)T2(F(2) 0 /4)(P(4, 2) 0 h)Ti (F(2) 0 14),
where 71 = T4(8) and T2 = (P(4,2) 0 12)(12 T2(4))(P (4,2) 0 /2)-
The Stockham FFT is a complete vector FT in which bit-reversal has been distributed through the computation. Data flow is no longer constant between stages.
Denote the 2i-point bit-reversal by Q(2i) and set
= Q(21) , < < k.
By the definition of bit-reversal,
Qk-i-Fi(I2k-, 0 F(2) 012 1 -0 = (F ( 2 ) 0 I2k-i
Set Z = F(2) 0 /2k-i . Since
1 t-Fi = Z
= P(2/, 2) 0 /2k_i,
where = Q k - 1 -1- 1 (1.2k -t 0 T(2 1 ) ) Q- 1
the Gentleman-Sande FFT can be rewritten as
F (21') = (C2 kX1C kl)Q kC2 kli'''Xk
= ZQ 172ZC2k-1(2;_i 2 • • • Tit,Z,
and we have the Stockham auto-sorting FFT.
- Theorem 4.7
F(2k) = TAF(2) 0 1.2k-i)(P(2k-i+1 ,2) 0 121-i), 1=].
= (2k-1+1U-2k-I T(2I))Q k-1 l+1'
4.5 Mixed Radix Cooley-Tukey FFT
Radix-2 algorithms were the first to be designed and dominated on serial machines. On vector machines, where arithmetic processing is very fast, the cost of data transfer becomes a significantly more important ratio of the overall cost. Radix-4 algorithms reduce this data transfer cost without
82 4. Variants of FT Algorithms and Implementations
appreciably increasing the arithmetic cost. Agarwal and Cooley have de-signed radix-4 FFT algorithms for implementation on the IBM 3090 vector facility. Mixed radix FFTs offer additional tools for utilizing high-speed processing without being hampered by data transfer problems. The theory underlying mixed radix FFT algorithms has been developed in several pa-pers and is completely analogous to the theory developed in the preceding sections. (See [8] for an early account.) In this section, we generalize the radix-2 FFT to the mixed radix case.
We begin with the factorization
F(N) P(N,L)(Im ® F(L))71(N)(F(M) 0 N = ML. (4.14)
Suppose that N = NiN2N3. With M = N2 and L = N3, we have
F(N2N3) = P(N2N3, N3)(iN2 F(N3))TN3 (N2N3)(F(N2) iN3 ).
Taking M = Ni and L = N2N3, we have
F(N) = P(N, N2N3)(iNI 0 P(N2N3))TN2N3(N)(P(Ni) 0 -T.N2N3)•
Using the general tensor product identity
I (BC) = (I B)(I C),
we have the three-factor mixed radix FFT. .
Theorem 4.8 If N = NiN2N3, then
F(N) Q(IN,N2 F(N3))T2(IN, F(N2) IN,)Ti(F(Ni) IN,iv3),
where Q is the permutation matrix
Q = P(N , N2N3)(I P(N2N3, N3)),
and Ti and T2 are the diagonal matrices
= TN2N3(N), T2 = 0 TN,(N2N3).
F(Ni) IN,N, is the vector NI-point FT on vectors of length N2N3, F(N2) IN, is an Ni independent vector N2-point FT on vectors
of length N3, and IN, N2 0 F(N3) is an NiN2 independent vector N3-point FTs. In particular, vector length varies as follows:
N2N3, N3-;` L
The general mixed radix factorization is proved by induction on the num-ber of fa.ctors. Suppose throughout that N = NiN2 • • • NK . Set N(0) =- 1, N'(K) = 1 and
N(k) = • • Nk,
4.5 Mixed Radix Cooley-Tukey FFT 83
Ni(k) = NIN(k) = Nk±i. - NK•
Define the N x N permutation matrix Q(N) by
Q(N)(x1 0 • • • 0 xx) = xi( 0 • • • 0 xi , xk E C Nk
Q(N) is the mixed radix analog of bit-reversal as defined in section 4.2 and is called the N-point data transposition corresponding to the ordered factorization N = Ni • • • NK. A direct computation shows that
Q(N) = P(N,N(1))(1N1 Cl(Ar'(1)))-
Set
T(N/ (k — 1)) = TN,(k)N'(k — 1).
By (4.14),
F(N) = P(N,N(1))(IN, F(N'(1))T(N)(F(Ni) 0 IN,0.)))-
Arguing by induction, we have the next result.
Theorem 4.9
F(N)
Q(N)11(IN(K-00T(N'(K — k)))(1- r(x_k) F(NK —k+i) IN'(K-k+1))• k=1
Setting
Xk = (4(K-k) T(N' (K — k)))(IN(R- _k) 0 F(NK—k+i) 0 I iv, (K--k+1.)),
we can write
F(N) = Q(N)XiX2 • • • XK
The first computational stage,
XK = T(N)(F(Ni) IN'(1)),
is a vector operation and the last computational stage,
X1 = IN(K_1) F(NK),
is a parallel operation.
84 4. Variants of FT Algorithms and Implementations
4.6 Mixed Radix Agarwal-Cooley FFT
A generalization of the radix-2 Pease FFT to mixed radix was designed by R.C. Agarwal and J.W. Cooley [1] for implementation on the IBM 3090 Vector computer. The goal, as stated, is to produce a fully vector-ized mixed radix FFT algorithm requiring all of the loads/stores with only small strides.
Consider N = NiN2N3. By the transpose of the factorization given in Theorem 4.8, we have
F(N) = (F(Ni)01N2N3)
Ti(IN, 0 F(N2) 0 IN3)T2(I NiN2 F(N3))Q 1, (4.15)
where Ti = TN2N3(N) and T2 = TN3(N2N3). Set Pi = P(N, NO. By the commutation theorem, we can rewrite (4.15) as
F(N) = (F(Ni) 0 hv2N3 )TiPi(F(N2) hviN3)
(Pi 1T2P1)(Pi. 1P3 1)(F(N3) iNiN2)P3Q-1-
Applying the commutation theorem once more,
1T2P1 = 1(-fiv, 0 TN3(N2N3))Pi = TN3(N2N3) •
Since P2 = /Y1PV, we have proved the three factor Agarwal-Cooley algorithm.
Theorem 4.10 If N = NiN2N3, then
F(N) -=
(F(Ni) IN2N3)TiPi(F(N2) IN1 N3 )71132 (F (NO INiN2)P3Q,
where 71 = TN,N2(N), = TN3(N2N3) and Pi = P(N, 1 < < 3.
Setting = (F(Ni) 0 IN2N3)TiPi,
X2 = (F(N2) 0 INiN3)T2P25
X3 = (F(N3) IN,N2)P39
we can write F(N) = XiX2X3(el.
The form of each factor Xi, X2, X3 is the same (7-' = I), beginning with a stride permutation P(N,Nk) followed by a twiddle factor 71 and completed by a vector FT F(Nk) 0 /N/Ark •
4.7 Mixed Radix Auto-Sorting FFT 85
This factorization offers complete vectorization and data flow given by
small strides P(N, = 1, 2, 3. To prove the general case, N = NiN2 • • • NK, we take the vector form of
Theorem 4.9 and regroup the terms using the identity
P(N, (k))P(N, N (k ± 1)) = P(N, Nk-F1)•
We now have the general Agarwal -Cooley FFT.
Theorem 4.11 For N Ni • NE
F(N) =[11 ((F(Nk) 0 IN IN3731,P(N, Nk))]Q(N)-11 k=1
where Tk is the diagonal matrix
= T(N' (k — 1)) 0 IN(k—i)•
From
((F(Nk) 0 I N INkgrc) P(N, Nk) = P(N, Nk)(IN /iv, 0 F(Nk))71' ,
where Tic' is the diagonal matrix
= P(N, (0)(1 N(k_i) 0 T(N' (k — 1))P(N, N(k)),
we have the parallel version of the preceding theorem.
Theorem 4.12 For N = Ni • • NK
F(N) = F(Nk),Tini Q(N)-1, k=1
= p(N,rr(k))(IN (k_i) T( N /(k -1))P(N,N(k)).
Talcing matrix transpose in the preceding two theorems, we arrive at two additional factorizations which, although similar in form, transfer between decimation-in-time and decimation-in-frequency and change the interstage data readdressing from P(N, Nk) to P(N, NI Nk). In this way we can vary the place where data transposition is taken and the sizes of the strides required in the data readdressing stages.
4.7 Mixed Radix Auto-Sorting FFT
In the auto-sorting FFT, data transposition is distributed throughout the computation, removing the need for data transposition at input or
86 4. Variants of FT Algorithms and Implementations
output. The trade-off is that interstage data readdressing becomes more complicated. Consider N = NiN2N3 and the factorization of Theorem 4.8:
F(N) = Q(1N,N2 F(N3))
T2(IN, F(N2) IN3)71.(F(Ni) 0 1N2N3)• (4.16)
Q is data transposition corresponding to the ordered factorization
N = NiN2N3.
It follows that
Q(1-NIN2 P(N3)) — (F(N3) iNiN2)Q•
Let Q2 denote data transposition corresponding to the ordered factor-ization
N/N3 = NiN2,
and set R2 = Q2 0 IN3•
Observe that Q2 = P(N/N3, N2).
Since R2(1-NI 0 F(N2) /N3)R2 1 — F(N2) IN/N2
we can rewrite (4.16) as
F(N) = (F(N3) 0 INiN2)QR21
(VF(N2) 0 IN1 N3 )R2 )(Ti. (F(N1) IN2N3 ))7
where is the diagonal matrix
= R2T21=q1.
Direct computation shows that
Q1c1 = P(N, N3),
proving the three factor mixed radix auto-sorting FFT algorithm.
Theorem 4.13 If N = NiN2N3, then
F(N) = (F(N3) INiN2)P(N, N3)TRE(N2) 0 INiNa)
(P(Ni N2, N2) 0 IN3)71(F(N1) 0 IN2N3))
where T1 = = TN2N3(N) and
= (Q2 0 IN3)(IN, 0 TN3(N2N3))(Q21 IN3)-
4.8 Summary 87
The general mixed radix auto-sort FFT is derived using the same arguments. Define
(2k(xNi xNk xNk xNi
Setting RK = Q(N) and
Rk = Qk 0 I isp(k), 1 < k < K ,
we have
F(Nk) ININk = Rk(IN( k- 1) F(Nk) INI(k))Rk 1
and Rk_iRk-1 = P(N(k), N (k - 1)) 0 IN, (k)•
Arguing as above, from Theorem 4.9, by regrouping interstage permuta-tions, we have the general auto-sorting FFT.
Theorem 4.14
F(N) = 11 (F(Nk) 0 I NiNkTr (P(N(k), N(k - 1)) IN, (0), k=1
where Tr is the diagonal matrix
= Rk(I N(k_i) 0 T(N' (k - 1)))1c1
Three additional auto-sorting FFTs can be derived using transpose and the commutation theorem.
4.8 Summary
I. N = 2k T(21+1) = T2/ (21+1).
Q(2k)(ai 0 • • 0 ak) = ak 0 ak-i 0 • • al , 6., ak E C2.
Gentleman-Sande
F(2k) = Q(2k)11(12k-1 0 T(21))(12k-1 0 F(2) 0 /2/-1). i=
Cooley-Tukey
F(2k) =[11(I2i-i F(2)0 I2k-t)((.£21-1 T(2"+1))1C2(2k)
1=1
88 4. Variants of FT Algorithms and Implementations
Pease
F(2k) = Q(2k) HTiv2k_I F(2))p(2k,2k-i),
where 1 < 1 < k, is the diagonal matrix
= P(2k ,21-k)(12k-1 T(21))P(2k , 21-1)-1
Korn-Lambiotte
F (2k = Q (2k) H P(2k , 2k-1)(T(2/) 0 /2k-1 ) (F(2) 0 /2k-i). t=1
Auto-Sort
F(2k) = H 7)(F(2) /2k-i)(P(2"+1, 2) 0 /21-1), 1=1
where
= (Q(2k-11-1) I2,1)(.12k_i T(21))(Q(2k-1±1) 01'21_0-1
II. Mixed Radix
N =- • • • NK •
N(k) = Ni • • • Nk•
N' (k) = N (k).
T (NI (k — 1)) = p ( k)N1 (k — 1).
Qk(xi 0 • • 0 xk) xk 0 • • 0 xi.
Rk = Qk IN,(k)•
Gentleman-Sande
F(N) =
Q(N) 11(iN(K_ T(Ni (IC - k)))(IN(K-k) F(NK-k-F1) IlsP(K—k+1))• k=1
Agarwal-Cooley
F(N) = [11 ((F (Nk) 0 I N INk)T4F(N , Nk))]Q(N)-1,
k=1
References 89
where is the diagonal matrix
= T(Isr(k - 1)) 0 IN(k - 1).
Auto-Sort
F(N) = H(Fork, .4„,,,Tnp(N(k),N(k _ 1)) I N'(k)), k=1
where Tic" is the diagonal matrix
= Rk (1-N(k_ i) T(Nt(k - 1))).1 1.
References
[1] Agarwal, R. C. and Cooley, J. W. "Vectorized Mixed Rada Discrete Fourier Transform Algorithms", IBM Report., March 1986.
[2] Cochran, W. T., et al. "What is the Fast Fourier Transform?", IEEE Mans. Audio Electroacoust. 15, 1967, 45-55.
[3] Cooley, J. W. and Tukey, J. W. "An Algorithm for the Machine Cakulation of Complex Fourier Series", Math. Comp. 19, 1965, 297- 301.
[4] Gentleman, W. M. and Sande, G. "Fast Fourier Transform for Fun and Profit", Proc. AFIPS, Joint Computer Conference 29, 1966, 563- 578.
[5] Korn, D. G. and Lambiotte, J. J. "Computing the Fast Fourier Transform on a Vector Computer", Math. Comp. 33, 1979, 977-992.
[6] Pease, M. C. "An Adaptation of the Fast Fourier Transform for Parallel Processing", J. ACM 15, 1968, 252-265.
[7] Singleton, R. C. "On Computing the Fast Fourier Transform", J. ACM 10, 1967, 647-654.
[8] Singleton, R. C. "An Algorithm for Computing the Mixed Radix Fast Fourier Transform", IEEE Trans. Audio Electroacoust. 17, 1969, 93-103.
[9] Temperton, C. "Self-Sorting Mixed Radix Fast Fourier Transforms", J. Comput Phys. 52, 1983, 1-23.
90 4. Variants of FT Algorithms and Implementations
[10] Temperton, C. "Fast Fourier Transforms and Poisson-Solvers on Cray-1", Supercomputers, Infotech State of the Art Report, Jesshope C. R. and Hockney R. W. eds., Infotech International Ltd., 1979, 359-379.
[11] Temperton, C. "Implementation of a Self-Sorting In-Place Prime Factor FFT algorithm", J. Comput. Phys. 58(3), 1985, 283-299.
[12] Temperton, C. "A Note on a Prime Factor FFT", J. Comput. Phys. 52(1), 1983, 198-204.
[13] Heideman, M. T. and Burrus, C. S. "Multiply/add Trade-off in Length-2n FFT Algorithms", ICASSP'85, 780-783.
[14] Duhamel, P. "Implementation of Split-Radix FFT Algorithms for Complex, Real, and Real-Symmetric Data", IEEE Trans. on Acoust., Speech and Signal Proc. 34(2), April 1986.
[15] Vetterli, M. and Duhamel, P. "Split-Radix Algorithms for Length pr" DFT's", ICASSP'88, 1415-1418.
Problems
1. Write a code implementing each stage Xi, 1 < / < k, of the Gentleman-Sande algorithm.
2. Write a code implementing bit-reversal.
3. For N= 8, 16, 32 and 64, describe the twiddle factor Ti in the Pease algorithm.
4. Derive the general form of the twiddle factor in the Pease factoriza-tion.
5. From the Pease algorithm, design an algorithm reversing the order of permutations and the twiddle factors.
6. From the Pease algorithm, design an algorithm having bit-reversal at output.
7. Determine the general form of the twiddle factors in the Stockham factorization.
8. Describe the twiddle factors in thelaixed radix Agarwal-Cooley FFT algorithm.
9. Describe the twiddle factors in the mixed radix auto-sort FFT algorithm.
5 Good-Thomas PFA
5.1 Introduction
The additive FFT algorithms of the preceding two chapters make no ex-plicit use of the multiplicative structure of the indexing set. We will see how the multiplicative structure can be applied, in the case of transform size N = RS, where R and S are relatively prime, to design an FT al-gorithm that is similar in structure to these additive algorithms but no longer requires the twiddle factor multiplication. The idea is due to Good [2] in 1958 and Thomas [8] in 1963, and the resulting algorithm is called the Good-Thomas Prime Factor algorithm (PFA).
If the transform size N = NiN2, then one form of an additive algorithm can be expressed by the factorization
F(N) = (F(Ni) 0 IN,)T(IN, 0 F(N2))P, (5.1)
where P is a permutation matrix and T is a diagonal matrix or twiddle factor. Corresponding to a decomposition of the transform size N of the form N = RS, where R and S are relatively prime, one form of the Good-Thomas PFA is given by the factorization
F(N) = (F(R) 0 is)(iR 0 F(S))Q2, (5.2)
where Qi and Q2 are permutation matrices. We can rewrite (1.2) as
F(N) = Qi(F(R) F(S))Q2. (5.3)
92 5. Good-Thomas PFA
An obvious advantage of (1.3) is that the multiplications required in the twiddle factor stage of (1.1) are no longer necessary. Burrus and Es-chenbacher [1] and Temperton [4] point out that a variant of (1.3) can be implemented in such a way that it is simultaneously self-sorting and in-place. In the preceding chapter, these properties served to distinguish the data flow of the different additive FFT algorithms, but in no case were both present. We will discuss some of Temperton's ideas below.
5.2 Indexing by the CRT
The main tool in the indexing of input and output data for the Good-Thomas PFA is given by the CRT. Suppose that
N = RS, (R, S) = 1. (5.4)
The CRT asserts the existence of a ring isomorphism
ch : ZIR x ZIS ZIN,
where Z/R x Z/S denotes a ring-direct product with componentwise addi-tion and multiplication. We will take (b to be the specific ring-isomorphism given by the complete system of idempotents relative to the decomposition (5.4) as described in chapter 1. Explicitly, elements el and ez in ZIN can be found such that
ei 1 mod R, ei 0 mod S, (5.5)
ez =- 0 mod R, e2 1 mod S. (5.6)
Then e? ei mod N, e2 mod N, (5.7)
eiez 0 mod N, (5.8)
ei + ez 1 mod N. (5.9)
Using these properties, we can prove that the mapping
0(ai,a2) ale]. a2e2 mod N, 0 < al < R, 0 < a2 < S (5.10)
is a ring-isomorphism of Z/Rx Z1S onto Z/N. We will use (5.10) to define a permutation 7r of Z/N. First, by (5.10),
each a E ZIN can be written uniquely as, -
a E aiei + azez mod N, 0 < ai < R, 0 < az < S.
Since we aLso have that a E ZIN can be written uniquely as
a = az + aiS, 0 < ai < R, 0 < az < S,
5,3 An Example, N = 15 93
a permutation ir of Z/N can be defined by the formula
7r(a2 + aiS) ale]. + a2e2 mod N, 0 < < R, 0 < a2 < S. (5.11)
Order the indexing set ZIN by 7r:
0, e2, (S - 1)e2,
+ e2, • • • , + (S-1)e2,
(R - 1)ei, (R - 1)ei + e2, -1)ei + (S-1)e2,
and denote the corresponding permutation matrix by Q. Then the matrix
F„ = QF(N)Q-1, (5.12)
is given by Fir = [e(a)7r(b)] 0<a b<IV'
v = e21-z/N (5.13) . _ , We will now explicitly describe Fr,. First an example will be considered.
5.3 An Example, N = 15
Talce R = 3 and S = 5. The idempotents are
ei = 10, e2 = 6.
The permutation 7r of (5.11) is given as
7r -= {0, 6, 12, 3, 9; 10, 1, 7, 13, 4; 5, 11, 2, 8, 141.
,We distinguish three blocks by the following notations,
A (0, 6, 12, 3, 9),
B = (10, 1, 7, 13, 4),
C = (5, 11, 2, 8, 14).
Each block begins with a different multiple of el. Corresponding to the nine Cartesian products of these blocks, we have nine submatrices of F„. Consider first the submatrix corresponding to A x A:
1 1 1 1 1 1 w6 w12 w3 w9
w12 w9 w6 3 w e(2/ri/15) (5.14) w , w3 w6 w9 w12
w9 w3 w12 w6
94 5. Good-Thoma.s PFA
Setting u = e(2"/5), we can rewrite (5.14) a.s
[1 1 1 1 1
1 U2 U4 U 1L3 1 U4 u3 u2 u , u _ e(27ri/5), (5.15)
1 U U2 U3 U4
1 U3 U U4 U2
We denote (5.15) by F5 and, following Temperton [51, call it a rotated five-point FT. We can relate F5 to the five-point FT matrix,
[11 1 1 1 1 u u2 u3 u4
F(5) = 1 u2 u4 u u3 , (5.16) 1 u3 u u4 u2 1 u4 u3 u2 u
in two ways. First, if in F(5) we replace u by we2 = u2, F(5) becomes F5.
An algorithm computing the action of F(5) can be modified to compute the action of F5 by determining the consequences of this replacement through the different stages of the algorithm. Second,
F5 -= PF(5), (5.17)
where '10000 00100
P= 00001. 01000 _00010
Since both F5 and F(5) are symmetric matrices and 12' = P-1, taking transpose on both sides of (5.17) gives
F5 = F(5)P-1.
Consider the submatrix corresponding to the Cartesian product B x B. Direct computation from (5.13) shows that this submatrix is
io io io io io wio w W 7 W13 W 4 [ W
W 10 w w4 W
1 0 W173 W :VW4 W 13 • W Ww wW W W7
w10 4 13
W7 -4. W
-
(5.18)
Factoring out wl°, we can rewrite (5.18) as
v2F5,
5.3 An Example, N = 15 95
where v = e(21"/3) = w5. Continuing in this way, we have
F5 F5 F5
= [F5 v2 F5 v F51, F5 V F5 V 2 F5
which we can rewrite as F, = F3 0 F5 , (5.19)
where 1 1 1
F3 = 1 V 2 V .
1 V V 2
Since
F3 = 131 F(3), (5.20)
where 1 0 0
P' = [0 0 1], 0 1 0
F3 can be formed from F(3) by replacing v in F(3) by = v2. Putting (5.19) into (5.12), we have
F(15) = Q-1(F3 F5)Q, (5.21)
where the permutation matrix Q is given by (5.11). Matrix Q, although not strictly a stride permutation, has a circular structure. We begin by stride-6 mod 15 from the index 0,
0, 6, 12, 3, 9,
but in the second stage, instead of beginning at the index 1, we begin at the index 10 and stride-6 mod 15,
10, 1, 7, 13, 4.
In the last sweep, we begin at 5 and stride-6 mod 15,
5, 11, 2, 8, 14.
In [5] and [6], Temperton discusses the direct implementation of rotated FTs.
We can use (5.17) and (5.20) to rewrite (5.21) as
F(15) = Q-1 (P' F(3) 0 PF(5))Q,
F(15) = Q-1(P' P)(F(3) F(5))Q.
96 5. Good-Thomas PFA
Setting (2/ = Q-1(pl ID),
we have that F(15) = Qi(F(3) F(5))Q. (5.22)
From (5.22), additional factorizations of F(15) can be provided. In all cases, after an initial input permutation, we compute the action of F(3) 0 F(5), then permute the output to obtain the natural order. Generally, input and output permutations are not the same and are more complicated than those discussed above for direct implementation of (5.21). We notice, however, that no twiddle factor appears in (5.22).
5.4 Good-Thomas PFA for the General Case
Returning to the permutation 7r of section 2, we set
Ao = {0, c2, • • • , (S 1)€21, (5.23)
- {el , el e2, • • • , el + (S — 1)e2}, (5.24)
AR_i = 'RR — 1)ei, (R - 1)ei + e2, • • • , (R — 1)ei + (S - 1)e2}. (5.25)
We now can write 7r = (A0, Ai, ..., AR-1)•
Consider the submatrix of Fr corresponding to the Cartesian product
Aj X Ak.
A typical component in this submatrix is given by forming first the product mod N,
+/e2)(kei + me2), 0 < /, m < S,
which by (5.7)-(5.9) can be written as
jkei ±lme2, 0 <1, m < S.
The (/, in) coefficient of this submatrix is
(wei)ikove2yrn, 0 /, TT/ < S.
Set
Fs = [(we2)1m]o<2,.<s, w e2r/IN.
The submatrix corresponding to A, x Ak iS
[welijk Fs.
5.4 Good-Thomas PFA for the General Case 97
Continuing in this way,
= FR 0 Fs, (5.26)
where FR = RW61)310<3,k<R•
The matrices FR and Fs are called rotated FTs by Temperton. By (5.5) and (5.6), w" is a primitive R-th root of unity and we2 is a
primitive S-th root of unity. To see this, set
v = e(27ri/ 11) , u = e (271-i/S)
Since ei = S, for some fi with (fi, R) = 1, and
wel = vh,
w" is a primitive R-th root of unity. The corresponding result for we2 is proved in the same way. It follows that we can write
FR = PiF(R), (5.27)
where Pi is an R x R permutation matrix, and
Fs = P2F(S), (5.28)
where P2 is an S x S permutation matrix. Placing (5.27) and (5.28) in (5.26), we have
F, = (Pi 0 P2)(F(R) F(S)),
and, by (5.12),
F(N) = Cri (Pi P2)(F(R) o F(S))Q- (5.29)
It follows that, to compute the action of F(N), we begin with the input permutation Q, compute the action of the tensor product F(R)0F(S) and complete the computation by arranging the output in its natural order by the permutation Cr l(Pi 0 P2). Other factorizations can be obtained by taking the transpose on both sides of (5.27),
FR = F(R)Pi-1, (5.30)
and placing (5.30) rather than (5.27) in (5.26). If modules implementing the tensor product FR Fs exist, then the
data flow of the computation of F(N) is given by the permutation Q. The permutation Q can be viewed as follows. We begin at the index point 0 and stride by e2. This process continues, where at each new stage we begin at the index point given by a multiple of ei.
,Since F(R) F(S) can be viewed as S actions of F(R) followed by R actions of F(S), the arithmetic of (5.29) is given by the formulas
a(N) = Sa(R) + Ra(S),
98 5. Good-Thomas PFA
m(N) = Sm(R)+ Rm(S),
where algorithms computing R-point FT are taken which require a(R) additions and m(R) multiplications. The multiplications required by the twiddle factor in the additive algorithms are no longer necessary.
Formula (5.29) can be generalized to several factors. Suppose that N = nin2 = mon2n2, where ni = mim2, (mi, m2) = 1. We can write
F(ni) = le(F(mi) F(m2))R, (5.31)
where R and R' are permutation matrices. Placing (5.31) into (5.29), we have
F(n) = Rm(F(mi) 0 F(m2) F(n2))R", (5.32)
where R" and Ri" are permutation matrices. In subsequent chapters, we will combine the Good-Thomas PFA with
multiplicative FT algorithms to produce several FT algorithms having distinct data flow and arithmetic. (See [3].)
5.5 Self-Sorting PFA
Burrus and Eschenbacher [1] point out that the Good-Thomas PFA can be computed in-place and in-order. Temperton [4, 5, 6] discusses the im-plementation of PFAs on different computer architectures, especially on CRAY. He shows that the indexing required for the PFA was actually simpler than that for the conventional Cooley-Tukey FFT algorithm.
Temperton implemented the RS-point FT using FR and Fs directly. The indexing for input and output data in this case is the same.
Consider the following example. For N = 42 = 6 7, the corresponding system of idempotents is {el, e21 = 17,361. The mapping given in (5.23)— (5.25) can be described by the two-dimensional array,
- 0 7 14 21 28 35 36 1 8 15 22 29 30 37 2 9 16 23 24 31 38 3 10 17 18 25 32 39 4 11 12 19 26 33 40 5
_ 6 13 20 27 34 41_
We can implement this by the simple code,
INTEGER /(R) DATA I/0,7,14,21,28,35/
Updating the indexing for each subsequent transform is achieved by simple auto-increment addressing mode,
References 99
DO 100 L=1, S J=I(R) +1 DO 200 K=R, 2, —1
I(K)=I(K —1) +1 200 CONTINUE
I(1)=J 100 CONTINUE
The code requires no IF statements or address computation mod N. Temperton [5] describes in detail the minimum-add rotated discrete Fourier transform modules for sizes 2, 3, 4, 5, 7, 8, 9 and 16.
References
[1] Burrus, C. S. and Eschenbacher, P. W. "An In-place In-order Prime Factor FFT Algorithm", IEEE Trans. Acoust., Speech and Signal Proc., 29, 1981, pp. 806-817.
[2] Good, I. J. "The Interaction Algorithm and Practical Fourier Analysis", J. Royal Statist. Soc., Ser. B20, 1958, pp. 361-375.
[3] Kolba, D. P. and Parks, T. W. "A Prime Factor FFT Algorithm Using high-speed Convolution", IEEE Trans. Acoust., Speech and Signal Proc., 25, 1977.
[4] Temperton, C. "A Note on Prime Factor FFT Algorithms", J. Comput. Phys., 52, 1983, pp. 198-204.
[5] Temperton, C. "A New Set of Minimum-add Small-n Rotated DFT Modules", to appear in J. Comput. Phys.
[6] Temperton, C. "Implementation of A Prime Factor FFT Algorithm on CRAY-1", to appear in Parallel Computing.
[7] Temperton, C. "A Self-Sorting In-place Prime Factor Real/half -complex FFT Algorithm", to appear in J. Comput. Phys.
[8] Thomas, L. H. "Using a Computer to Solve Problems in Physics", in Applications of Digital Computers, Ginn and Co., 1963.
[9] Chu, S. and Burrus, C. S. "A Prime Factor FFT Algorithm Using Dis-tributed Arithmetic", IEEE Trans. Acoust., Speech and Signal Proc., 30(2), April 1982, pp. 217-227.
100 5. Good-Thomas PFA
Problems
1. Find the system of idempotents of N = 2 • 3, and define the permutation matrix Q as in section 2.
2. Find the system of idempotents of N = 4 • 5, and define the permutation matrix Q as in section 2.
3. Find F2 and F3 for a six-point Good-Thomas PFA based on the idempotents of problem 1.
4. Find F4 and F5 for a 20-point Good-Thomas PFA based on the idempotents of problem 2.
5. Give arithmetic counts for problems 3 and 4 by direct computation of F2 , F3 , F4 and F5 .
6. Give arithmetic counts for 6-point and 20-point Cooley-Tukey FFT algorithms, where F(2), F(3), F(4) and F(5) are carried out by direct computation. Compare with those of problem 5.
7. Derive a Good-Thomas PFA for N = 75, and give F3 and F25 •
8. Derive a Good-Thomas PFA for N = 100, and give F4 and F25 . Derive the Cooley-Tukey FFT algorithm with a factorization of 100 = 10-10. Compare the arithmetic counts of these two algorithms.
9. Give the self-sorting indexing table for N = 40 = 5 . 8 as in section 5.
6 Linear and Cyclic Convolutions
Linear convolution is one of the most frequent computations carried out in digital signal processing (DSP). The standard method for computing a linear convolution is to zero-tap, turning the linear convolution into a cyclic convolution, and to use the convolution theorem, which replaces the cyclic convolution by an FT of the corresponding size. In the last ten years, theo-retically better convolution algorithms have been developed. The Winograd Small Convolution algorithm [1] is the most efficient as measured by the number of multiplications.
First, we derive the convolution theorem by two different methods. The second method is based on the CRT for polynomials. A special case of the
,-CRT then is applied in a more general setting to derive the Cook-Toom [2] algorithm. The generalized (polynomial) CRT then is used to derive the Winograd Small Convolution algorithm. We emphasize the interplay between linear and cyclic convolution computations.
6.1 Definitions
Consider vectors h and g of sizes M and N. The linear convolution of h and g is the vector s of size L = M N — 1 defined by
Sk E hk_ngn, 0 < k < L, n=0
where we take 11, = 0 if m > M and gri = 0 if n > N.
102 6. Linear and Cyclic Convolutions
Example 6.1 The linear convolution s of a vector h of size 2 and a vector g of size 3 is given by
SO = h0g05
Si = higo + hogi,
s2 = higi + hog2,
s3 = 1/02.
Associate the polynomial h(x) of degree M — 1 to the vector h of size M,
h(s) -= ho + hix + • • +
Direct computation shows that formula (6.1) is equivalent to the polynomial product
s(x) = h(s)g(x).
The representation of linear convolution by polynomial product permits the application of results in polynomial rings, especially the CRT.
Example 6.2 Consider the linear convolution s of a vector h of size 3 and a vector g of size 4. By definition,
so = hogo,
si = higo + hogi,
s2 = h2go + + hog2,
S3 = h2gi h1g2 h0g35
S4 = h2g2 h1933
S5 = h2g3.
The linear convolution can be described by matrix multiplication:
s
-Ito hi h2 0 0
_ 0
0 ho hi h2 0 0
0 0
ho hi h2 0
0 - 0 0
ho hi h2 _
g•
—
In general, if s is the linear convolution of a vector h of size M and a vector g of size N, we can write
s H g ,
6.1 Definitions 103
where H is the L x N, L = M + N - 1, matrix
ho hi
0 ho
0 -
0
• •
H = hm-i
0 •
hm_i 0
•
• -
•
-
ho hi
•
_ 0 0
Consider two vectors a and b of size N. The cyclic convolution c of a and b, denoted by a* b, is the vector of size N defined by the formula
N-1
Ck = E ak_nbn, 0 < n < N. n=0
The indices of the vectors are taken mod N.
Example 6.3 The cyclic convolution c of two vectors a and b of size 3 is given by
co = aobo + a2bi + aib27
ci = aibo + aobi + a2b2,
c2 =- a2b0 + aibi + aob2.
Observe that a_i = a2 and a_2 = ai.
In chapter 1, we discussed the quotient polynomial ring
C[x]/(xN - 1) (6.1)
consisting of the set of all polynomials of degree less than N, where addition A and multiplication are taken as polynomial addition and multiplication mod
(xN - 1).
Example 6.4 Consider two polynomials
a(x) = ao + aix + a2x2,
b(x) = bo +bix +b2x2.
The product is
a(x)b(x) = aobo + (aibo + aobi)x
+ (a2b0 + aibi + a0b2)/2 + (a2bi + aib2)x3 + a2b2x4.
This is the linear convolution. The product
c(x) a(x)b(x) mod (x3 - 1)
104 6. Linear and Cyclic Convolutions
is formed by setting x3 = 1 in the expansion of the product a(x)b(x). We find that the coefficients of c(x) are given by
co = aobo + a2bi aib2,
= aibo aobi a2b2,
C2 = a2b0 + aibi + a0b2.
Thus, multiplication in the ring
c[x]/(x3 - 1)
computes the 3 x 3 cyclic convolution. In general, multiplication in the ring (6.1) computes the N x N cyclic
convolution. To see this, consider polynomials a(x) and b(x) of degree less than N , and compute the product a(x)b(x),
2N-2 ( n
a(x)b(x) = E E an_kbk xn , (6.2) n=0 k=0
where a„ = by, = 0 whenever n > N . Setting XN = 1 in (6.2), we have
N-1
C(X) = E cans- a(x)b(x) mod (xN — 1), (6.3) n=0
where
n+N
Cn = E an—kbk + E an+N_kbk (6.4) k=0 k=0
N-1
= Ean_kbk + E an-FN — kbk k=0 k=n+1
N-1
= E an_kbk• k=0
In (6.4), the indices are taken mod N . By definition, we see that (6.3) com-putes the N x N cyclic convolution. An important outcome of the discussion is that N x N cyclic convolution can be computed by first computing linear convolution as a polynomial product and then setting XN = 1.,
As with linear convolution, cyclic convolution also can be eximessed by matrix multiplication.
Example 6.5 Returning to example 6.3, we can write
c = Cb,
6.1 Definitions 105
where ao a2 ai
C = [al ao ad • a2 al ao
The matrix C is an example of a circulant matrix, which we will define below. If S denotes the cyclic shift matrix
0 0 1 S = [1 0 01 ,
0 1 0
then 0 1 0
S2 = [0 0 11 , S3 = 13.
1 0 0
We can write the matrix C in the form
C = ao/3 + (LIS + a2S2.
The N x N cyclic shift matrix S is defined by the rule,
XN-1
X0 SX =
-XN-2
Observe that SN = IN. By an N x N circulant matrix we mean any matrix of the form
N-1
C = E anSn. (6.5) = 0
At times, we will denote the dependence of C on a by writing C(a):
ao aN _1 al al d a2
C(a) = al • • aN_i aN_ 2 • • ao
Example 6.6 The 4 x 4 cyclic shift matrix is
0 0 0 1
S = 01 01 00 00
0 0 1 0
106 6. Linear and Cyclic Convolutions
Notice that
[0 00 01 Oil [0 01 01 00
s2 = ° s3 = ° 1 0 0 0 ' 0 0 0 11 ' 0 1 0 0 1 0 0 0
and S4 = Li. The 4 x 4 circulant matrix,
C = a0/4 + aiS + a2S2 + a3S3,
is ro a3 a2 ail
C = al ao a3 a2 .
a2 al ao a3 a3 a2 al ao
As we read from left to right, the columns of C are cyclically shifted:
C = [a Sa S2a S3a1.
Exarnple 6.7 Denote by en, 0 < n < N, the vector of size N consisting of all zeros except for a 1 at the n-th place. Observe that
eo *b = b,
* b = Sb,
•,
eN_i * b =- SN—lb.
Consider the N x N cyclic convolution c = a* b. Writing
N-1
a = E anen,
n=0
we have N-1
a * b = E age, * b), n=0
which by example 6.7 can be rewritten as
N-1
a* b = E anS"b. n=0
By (6.5), a* b = C(a)b.
6.2 Convolution Theorem 107
The N xN cyclic convolution c = a*b can be computed by multiplication in the quotient polynomial ring C[x]/(xN — 1),
c(x) a(x)b(x) mod (xN — 1),
or by circulant matrix multiplication,
c = C(a)b. (6.6)
Direct computation from (6.6) shows that
C(a* b) = C(a)C(b).
More generally, we can prove that the set of all N x N circulant matrices is a ring under matrix addition and multiplication and is isomorphic to the quotient polynomial ring C [x]/(xN — 1).
6.2 Convolution Theorem
The N x N cyclic convolution can be computed using N-point FTs. This is especially convenient when efficient algorithms for N-point FTs are avail-able. The result that permits this interchange is the Convolution Theorem. We will give two proofs. The first depends on the representation of cyclic convolution as a matrix product by a circulant matrix. We will soon see that the FT matrix diagonalizes every circulant matrix. The second proof uses the representation of cyclic convolution as multiplication in the quotient polynomial ring C [x]/(xN — 1).
Example 6.8 Set F = F(3). Denote by S the 3 x 3 cyclic shift matrix and by D the matrix
1 0 0 D = [0 v 01, v = e271-2/ 3
0 0 v2
Then 1 1 1
FS =[v v2 11= DF, v2 v 1
which implies that FSF-1 = D
and F diagonalizes S. In addition,
FS2F-1 = (FSF-1)2 = D2,
and F diagonalizes S2.
108 6. Linear and Cyclic Convolutions
An arbitrary 3 x 3 circulant matrix is of the form
C(a) = aoh +aiS+ a2S2.
Since F diagonalizes each term of this sum, it diagonalizes C(a),
FC(a)F-1 = ao/3+ aiD +a2D2.
Writing this out, we have
ao az ai go 0 0 F [al ao ad F-1= [ 0 gi 0
az al ao 0 0 gz
where
go = ao + ai + az,
= ao + vai + v2a2,
g2 = ao v2ai vaz.
We see that FC(a)F-1 = diag(g),
where g = F(3)a.
We can extend this argument to prove that
F(N)SF(N)-1 = D,
where S is the N x N cyclic shift matrix and D is the diagonal matrix,
1
D — 1, v = epirifiv).
V1V-1
It follows that F(N)SkF(N)-1 = Dk.
Thus, F(N)C(a)F(N)-1 = diag(g), (6.7)
where g = F(N)-a.
In words, the N-point FT matrbc F(N) diagonalizes every N x N circulant matrix C(a).
Set
G(a) = diag(Fa),
6.2 Convolution Theorem 109
where F = F(N). We can rewrite (6.7) as
C(a) = F-1G(a)F.
Since a* b = C(a)b,
we have the following theorem.
Theorem 6.1 For vectors a and b of size N,
a* b = F(N)-1G(a)F(N)b,
where G(a) = diag(F(N)a).
Theorem 6.1 determines an algorithm computing the cyclic convolution a* b:
1. Compute F(N)b.
2. Compute F(N)a.
3. Compute the componentwise product (F(N)a)(F(N)b).
4. Compute F(N)-1(F(N)aF(N)b).
The componentwise product can be described by
(F(N)a)(F(N)b)= G(a)F(N)(b),
where G(a) = diag(F(N)a). The nonsymmetric role of a and b in this computation should be em-
phasized. In standard applications to digital filters, we fix the vector a (the elements of a linear system) and then compute the cyclic convolution a* b
'for many input vectors b. As a consequence, the diagonal matrix G(a) can be precomputed and does not enter into the arithmetic cost of the process.
The second proof uses the CRT to `diagonalize' multiplication in the ring C[x]/(xN - 1). Consider the factorization
N-1 xN = H (x _ v.), v = e(2/1-i/N) (6.8)
n=0
Applying the CRT to (6.8), we have that the mapping
a(1)
a(x) a(,v)
a(vN-1)
110 6. Linear and Cyclic Convolutions
establishes a ring-isomorphism,
N-1
C [X1/(XN - 1) c,
where iiTT.N=-01 C denotes the ring-direct product of N copies of C with com-ponentwise addition and multiplication. In particular, a polynomial a(x) of degree < N is uniquely determined by the values a(vn), 0 < n < N.
Consider two polynomials a(x) and b(x) of degree < N, and set c(x) a(x)b(x) mod (xN —1). The polynomial c(x) is uniquely determined by the values c(vn), 0 < n < N. Since (vn)N = 1,
c(vn) = a(vn)b(vn), 0 < n < N. (6.9)
We see that multiplication mod (xN —1) can be computed by N complex multiplications (6.9) along with some mechanism that translates between C [x] /(xN — 1) and ILN-01 C. In fact, this mechanism is the N-point FT, and (6.9) is a disguised form of the convolution theorem. To see this, observe that
N-1 a(1) = E an,
n=-0
N-1
a(v) = E n„ V un,
n=0
N-1
a(VN-1) = E Vn(N-1)an,
n=0
which implies that a(1) a(v)
= F(N)a, (6.10)
a(vN-1)
where a is the vector of length N associated to the polynomial a(x). Placing (6.10) into (6.9) we have
F(N)c = (F(N)-a)(F,(N)b), (6.11)
where the right-hand side is a componentwise product.
6.3 Cook-Toom Algorithm 111
6.3 Cook-Toom Algorithm
The derivation of the convolution theorem using the CRT admits important generalizations that can be used to design algorithms for computing linear and cyclic convolutions. The simplest is the Cook-Toom algorithm, which we discuss in this section.
Take N distinct complex numbers
{ (107 all • • • aN-1},
and form a polynomial
m(x) = (x - ao)(x - al) • (x - aN-1)•
We will begin by designing algorithms to compute polynomial multipli-cation mod m(x) or, equivalently, multiplication in the quotient polynomial ring
C [x] /m(x). (6.12)
Applying the CRT as in the preceding section, the mapping
p(a0)
a(ai) a(x) ->
a(aN-1)
establishes a ring-isomorphism,
N-1
C [X7M(X) c, n=0
with the result that a polynomial a(x) of degree < N is uniquely determined by the values a(ar,), 0 < n < N. To see how to recover a(x) from these values, write
a(ao) = ao + aoai + • + aLv laN-15
, N-1 a(aN_I) = ao aN — iai + • • • 1- aN_iaN-1,
where a(x) = V'N-1 a rn. In matrix form, this becomes L-dn=0 n
a(ao) a(ai)
= W a,
a(aN-1)
112 6. Linear and Cyclic Convolutions
where a is the vector of components of a(x) and W is the Vandermonde matrix ,
-1 Cto • - • cr9vN-1
1 al • • cti -1 W = . •
_1 aN_i • aNN:1
Since the elements an, 0 < n < N, are distinct, the matrix W is invertible, so that we can recover a(x) from
- a(a0) - a(ai)
a = W-1
- a(aN-i)-
Consider two polynomials a(x) and b(x) in the quotient polynomial ring (6.12). Set
c(x) a(x)b(x) mod m(x).
By the CRT ring-isomorphism
c(cin) = a(an)b(an), 0 < n < N,
we have c = W-1((Wa)(Wb)), (6.13)
where (Wa)(Wb) denotes componentwise multiplication. Equation (6.13) generalizes the convolution theorem.
Example 6.9 Talce m(x) = x(x + 1). Then
[1 0 w = w-i _ 1 -1]
Example 6.10 Take m(x) = (x - 1)(x ± 1). Then
W = F(2).
Example 6.11 Take m(x) = x(x - 1)(x ± 1). Then
1 0 0 W = [1 1 11 .
1 -1 1
Example 6.12 Take m(x) = x(x -
[1. 0 1111
W =
1 -1
1 2
0
1 4
+.1)(x - 2). Then
0
-11 8
6.3 Cook-Toom Algorithm 113
Equation (6.13) can be modified to design an algorithm for computing the linear convolution. Consider polynomials g(x) and h(s) of degrees N —1 and M — 1, respectively. The linear convolution
s(x) = h(x)g(x)
has degree L — 1, where L = M + N — 1. Denote by g, h and s the vectors of sizes N, M and L, corresponding to the polynomials g(x), h(x) and s(x). Take L distinct complex numbers cri, 0 < 1 < L, and form the polynomial of degree L
L-1
rn(x) = ll(x _
We call rn(x) a reducing polynomial. The design of an efficient algorithm depends, to a large extent, on the choice of a 'good' reducing polynomial. Define submatrices of W by
-1 ao • • • anm-1
1 ai
Wm =- . •
_1 aL-1 ' a Lm--1.1
and - 1 ao • • • anN-1
cd`v-i 1 al
WN = . • •
- cEL-1 • • '
Since deg(s(x)) = L — 1 < deg(rn(x)) = L,
we have s(x) = s(x) mod m(x),
which means that we can compute the linear convolution s(x) by computing the product h(s)g(x) in C[x]/m(x). In fact, (6.13) can be applied. The vectors g and h can be identified with vectors in CL by placing zeros in the last L — N and L — M coordinates. Since
WNg = Wg, Wmh = Wh,
we have s = W-1((Wmh)(WNg)).
an the examples that follow, we assume that
E Z, 0 < / < L,
114 6. Linear and Cyclic Convolutions
and W, W-1 are rational matrices. Computing the actions of W and W-1 requires multiplications by rational numbers only. Complex number multiplications occur in forming the componentwise product
(Wmh)(WNg),
which we can rewrite as a diagonal matrix multiplication
D(h)WNg,
where D(h) diag(Wmh).
In practice, the vector h is fixed over many computations, and the di-agonal matrix D(h) is precomputed and is not counted in the arithmetic cost. In this case, computing the linear convolution can be carried out in the following three steps:
1. Compute WNg. This part requires (N - 1)L additions and LN multiplications by rational numbers.
2. Compute D(h)WNg. This part requires L multiplications.
3. Compute W-1(D(h)WNg). This part requires L(L-1) additions and L2 multiplications by rational numbers.
In summary, computing linear convolution with h fixed and Wm(h) precomputed requires
(L N - 2)Ladditions,
(L + N)Lmultiplications by rationals,
Lmultiplications.
This should be compared with the straightforward computation of M x N linear convolution, which requires
(N -1)(M - 1)additions,
NMmultiplications.
The arithmetic described here is for the general case. Significant reduction occurs if the numbers ai, 0 < 1 < L, are carefully chosen.
Example 6.13 Consider the 2 x 2 linear convolution, and take rn(x) = x(x -1)(x +1). Then
s I 01 0 01 [1 0 ( 1 0 11h llg .
I -1 7
-1 -1
We can write this out in the following sequence of steps:
6.3 Cook-Toom Algorithm 115
O. Precompute
Ho = ho, = ho + H2 = h0 hl•
1. Compute
Go = go, Gi = go +Th., G2 — go — gi.
2. Compute
= HoGo, =Gi, S2 = H2G2.
3. Compute
1, „ 1 So = SO, Si = —
2 — S2), s2 = + —
2(Si + S2).
If we compute 11/1 and 0.2 in the precomputation stage, then multi-plication by in step 3 can be eliminated. We see that five additions and three multiplications are required to carry out the computation, compared to one addition and four multiplications by straightforward methods. As is typical, multiplications are reduced at the expense of additions.
A better algorithm can be produced using the following modification. Consider again the linear convolution, s(x) = h(x)g(x), of polynomials h(x) and g(x) of degrees M-1 and N — 1, respectively. Take L — 1 distinct numbers
0 < 1 < L — 1,
and form the polynomial
m'(x) = (x — cto)(x — cti) " • (x — ceL-2)-
Cqmpute c(x) h(s)g(x) mod mi(x).
c(x) has degree L — 2. Since the difference between s(x) and
c(x) + hm_igN_imi(x)
is a polynomial of degree L — 2 having L — 1 roots, we can recover s(x) from c(x) by the formula
s(x) = c(x)+hm_igN_imi(x). (6.14)
Now we compute s(x) in two stages:
1. 'Compute c(x) h(x)g(x) mod m'(x).
2. Compute s(x) by (6.14).
116 6. Linear and Cyclic Convolutions
The modification above reduces the required additions without any change in the required multiplications. Computing s(x) by the above two stages is denoted by
s(x) h(x)g(x) mod tri(x)(x — co).
Example 6.14 Consider again the 2 x 2 linear convolution by taking m(x) = x(x + 1)(x — oo) and computing
s(x) h(x)g(x) mod m(s).
First compute c(x) h(s)g(x) mod x(x + 1)
by c _ [11 011 (( [ 01] h) [ _01] g))
Writing this out, we have the following sequence of steps:
O.Ho = ho, Hi = ho — hi,
1.Go = go, Gi = — 91, 2.,50 = HoGo, =Gi,
3.co = So, ci = So —
We assume that step 0 is precomputed. Steps 1-3 require two additions and two multiplications. We complete the computation by the following sequence of operations:
4.s2 = higi,
5.so = co, = + 52.
In the above algorithm, three additions and three multiplications are re-quired, reducing the arithmetic by two additions as compared to example 6.10. The whole computation can be written in one matrix equation as follows:
s ill 01 0 [ 01 1 h) ( 011
[0 0 10 1 [0 1
Example 6.15 Consider the 2 x 3 linear convolution and let
m(x) = x(x — 1)(x +1)(x — co).
The computation of linear convolution
s(x) = h(x)g(x),
6.3 Cook-Toom Algorithm 117
where deg(h(x)) = 1 and deg(g(x)) = 2, can be carried out by first computing
c(x) h(x)g(x) mod x(x — 1)(x + 1)
and then using the formula
s(x) = c(x) + hig2x(x —1)(x + 1).
The first part is given by
c 01 0 0 [ 01 01 00 (
—1 1 1 [0
01 ( [1
0 1
—1
0 11 1
We carry out this computation as follows:
ho + hi ho — hi 0.Ho = ho, = , 112 = (precarnputed),
2 2 1.G0 = go, Gi = go + gi + 92 G2 = 90 — 91 + 92, 2.S0 = HoGo, = HiGi, S2 = H2 G2
3.00 — SO, el = S1 — S2, C2 = —SO Si ± S2.
This part requires six additions and three multiplications (8,s before, multiplications by 1 have been placed in precomputation stage).
We complete the computation by
4.s3 = hig2.
5.so = co, si = — s3, sz = cz.
In one matrix equation,
[10 0 01[10001 0 1-1-1 0100
s = —111 0 0010 0 0 0 1 0001
[11 1
(0
(11 —1
1
[11 1 0
? —1 0
1 g)
11 • The small size linear convolutions described in the above examples can
be efficiently computed by the Cook-Toom algorithm. The factors of the reducing polynomials have roots 0, ±1 with the result that the matrices Wm and WN have coefficients 0, ±1. The matrix W-1 is more complicated, but the rational multiplications can be carried out in the precomputation stage. This is a general result that will be discussed in section 5. As the size
118 6. Linear and Cyclic Convolutions
of the problem grows, the roots of the reducing polynomials must contain large integers that appear along with their powers in the matrices Wm and WN. If the large integer multiplications are carried out by additions, then as the size of the problem grows, the number of required additions grows too large for practical implementation. In any case, the computation becomes less stable as the size grows [3]. In the next section, we will present efficient larger size algorithms using a generalization of the Cook-Toom algorithm.
The linear convolution can be used to compute multiplication in quotient polynomial rings. In section 4, we will use this approach to present cyclic convolution algorithms [5].
Example 6.16 We want to compute
ci(x) h(x)g(x) mod mi(x),
where
mi(x) = x2 ,
Tri2(x) = x2 + 1,
m3(x) = x2 — 1,
in,4(x) = x2 + x + 1,
771,5(x) =- x2 — x + 1.
Computing first the linear convolution s(x) -= h(x)g(x) by the algorithm designed in example 6.11, we have
s 01 ( h) (Ili 01 1 g))
[0 0 [0 1 _I 0 1
The operation, mod mi(x), can be viewed as matrix multiplication. Set
A =
We have
1 0
ci = [ _1
Continuing in this way, we have
e2 [ 01
c_ [1
[1 —1
1 [1
0
0 1
1
0 —11 . 1
"h)(Ag))'
((Ah) (Ag)),
((Ah)(Ag)),
6.4 Winograd Small Convolution Algorithm 119
1 0 —1] c4 = [ i 0 ((Ah) (Ag)) ,
ri 0 ((Ah) (Ag)) .
C5 = [ 1 --1 2
6.4 Winograd Small Convolution Algorithm
The Cook-Toom algorithm uses the CRT relative to a reducing polynomial m(x) constructed from linear factors. In the examples, the roots of these linear fa.ctors are restricted to be integers, since by doing so non-rational multiplications are kept to a minimum. However, additions grow rapidly as the size of the computation increases. A major part of these additions is needed to carry out the rational multiplications coming from the integer coefficients of the linear factors. Of major importance is the numerical stability of the computation [3].
By applying the CRT more generally, Winograd designed algorithms that could efficiently handle a larger collection of small size convolutions. The growth in the number of required additions is not as rapid as in the Cook-Toom algorithm, while the number of required multiplications increases modestly.
Consider a reducing polynomial
rn(x) = mi (x)rn2(x) • • rar(x),
where mi(x), 0 < / < r, are relatively prime. We do not require that these polynomials be linear. This leads to the possibility of building reducing polynomials rn(x) of higher degrees than before and still having factors with small integer coefficients. As we saw in the preceding section, the coefficients of the factors of the reducing polynomials become multipliers in the corresponding algorithm. If these coefficients are small integers, then these multiplications can be carried out by a small number of additions. As the size of these integers grows, the number of required additions grows.
Suppose that in(x) = mi(x)m2(x), where mi (x) and m2 (x) are relatively prime. The extension to more factors follows easily. We want to compute multiplication in the polynomial ring
C [x] /m(x). (6.15)
By the CRT, we can carry out this computation as follows. Suppose that
deg(m(x)) = N, deg(mk(x)) = Nkl k = 1,2.
Take polynomials h(x) and g(x) of degree < N . We want to compute
c(x) -a- h(x)g(x) mod m(x).
120 6. Linear and Cyclic Convolutions
1. Compute the reduced polynomials
h(k)(x) h(x) mod mk(x), k = 1, 2,
g(k)(x) g(x) mod mk(x), k = 1, 2.
2. Compute the products
C(k) (X) h(k) (1)g(k) (X) mod rnk(x), k = 1, 2.
The CRT g-uarantees that c(x) is uniquely determined by the polynomials c(k)(x), k = 1, 2, and prescribes a method for its computation. The unique system of idempotents
{el (x), e2(x)},
corresponding to the factorization rn(x) = mi(x)m2(x), satisfies
ek(s) 1 mod mk(x), k = 1, 2,
ei(x) 0 mod rnk(x), 1, k = 1, 2, k.
Then 1 =_ ei (x) + e2(x) mod m(s),
and c(x) c(1)(x)ei(x) + c(2)(x)e2(x). (6.16)
To complete the computation of c(x) by (6.16), we require the following steps:
3. Compute the products
Ck(S) C(k) (s)e k (x) mod m(x), k = 1, 2.
4. Compute the sum
c(x) = ci(x) + c2(x).
In the first stage of the algorithm, we compute multiplications in the polynomial rings
C[x]Imk(x), k = 1, 2. (6.17)
In part, multiplications in the polynomial ring (6.15) have been replaced by multiplications in the polynomial rings (6.17). Efficient small size al- gorithms computing multiplications in (6.17) provide buildink blocks for
—
computing multiplications in (6.15). In the previous section, we designed algorithms for computing linear
convolution and multiplication in quotient polynomial rings in the form
c = C ((Ah) (Bg)) ,
6.4 Winograd Small Convolution Algorithm 121
where h, g are input vectors, c is an output vector and A, B and C are matrices corresponding to Wm, WN and W-1. Such algorithms are bilin-ear algorithms. We will now see how to piece together bilinear algorithnas computing multiplication in the polynomial ring C[x]im(x). This per-mits efficient small algorithms to be used as building blocks in larger size algorithms.
Suppose that bilinear algorithms compute the products,
C(k) (X) h(k) (X)g(k) (X) mod mk(x), k = 1, 2,
c(k) = ck ((Akh(k)) (Bkg(k))) (6.18)
We assume that Ak and Bk are N x Nk matrices and that Ck is an Nk X N matrix. The operation mod mk(x) can be computed by matrix multiplication,
h(k) = Mkh,
g(k) _ mkg,
where Mk is an Nk X N matrix having coefficients determined by mk(x). Set
A= [AA11 , B [B1 m = 2 B2 M2
and set
Mi _ 0 1 Am = [A2m21 , Bm =
.L.,2L ■u2 C 0 C2 ]
Then [c(1)] c(2) = C(Amh)(Bmg).
The vectors c(1) and c(2) determine the polynomials c(1)(x) and c(2)(x), 'which now must be put together using the idempotents. Multiplication by ek(s) mod m(x) can be described by an N x Nk matrix Ek, k = 1, 2. We have
ck = Ekc(k), k = 1, 2
and c = EC((Amh)(Bmg)), (6.19)
where E = E2]•
The efficiency of (6.19) depends on two factors: the efficiency of the small bilinear algorithms (6.18) and the efficiency of how these building blocks are put together. We assume throughout that the factors mi(x) and m2(x) contain only small integer coefficients. Then M has only small integer coefficients. Although the matrix E has rational coefficients, as we will see
122 6. Linear and Cyclic Convolutions
in section 6 that frequently its action can be computed in a precomputation stage.
Example 6.17 Take m(x) = x(x2 + 1).
We will use (6.19) to compute
c(x) a- h(x)g(x) mod m(x),
where the bilinear algorithm computing multiplication mod (x2+1) is taken from example 6.16. First, with
mi(x) = x, m2(x) = x2 + 1,
we have [1 0 —1
Mi = [ 1 0 0 ] , M2 -= ] .
We can see directly that
Ai = = = [ 1 .
Then
10 0 0 C = [0 1 0 —11.
01-11
The idempotents are given by
ei(x) = x2 + 1, e2(x) = —x2 ,
from which it follows that
1 0 0 Ei. = [ 01 , E2 = [ 0 1 I .
1 - 1 0 '--- ,...
Direct computation shows that
10 0 0 EC = [0 1 —1 11.
1 —1 0 1
From example 6.16,
A2 -= B2 = 1 [1
0
0 - 1 1
1 0 0
, C2 = [ 1 0 - 1 1 1 - 1 1 i '
1 Am = B m =
[1
1 0
0 0
—1 1
0 —1 —11' 0
6.4 Winograd Small Convolution Algorithm 123
Thus
00 01 00 01 1 0 0 01
c = 0 1 —1 1 1 —1 —1 1 —1 —1
[1 —1 0 1
(0 1 0 0 1 01
Example 6.18 Take
m(x) = (x + 1)(x2 + x + 1),
where mi(x) -= x + 1, m2(x) = x2 + + 1.
Then 0 _I. m1=0. 11, M2 = 1 -11
Directly, Al = Bi = Ci = [ 1] .
By example 6.13, we can take
[1 0 A2 = B2 = 1 -1 , G = [
li °I. 01 ' 0 1
The idempotents are given by
ei(x) = x2 +x +1, e2(x) = — (x2 + x),
implying that 1 0 1
= E2 = [-1 21 . 1 -1 1
Putting this all together in (6.19), we have
c_ [1. 1 —1 01 [11 —01 11 11 —01 11
1 1 —2 1 1 0 —1 1
1 —1 0 1 —1 0 ( 0 1 —1 0 1 —11 g)
Example 6.19 We design an algorithm computing multiplication mod m(x) where
m(x) = mi.(x)1712(x),
m,i(x) = x(x2 + 1), m2(x) = (x + 1)(x2 + x +1).
124 6. Linear and Cyclic Convolutions
From the preceding two examples, we can compute multiplication mod mk(x), k = 1,2, by taking
[1 0 0 1 0 0 0
1 0 —1 Ai = = = [0 1 —1 11 ,
1 —1 —11 ' 1 —1 0 1
0 1 0
Directly,
1 - 1 1 1 0 - 1
A2 = B2 = 1 - 1 0 0 1 - 1
1 1 - 1 0 C2 = 1 1 —2 11 .
1 0 —1 1
10 0 0 0 . 0 0 0 0 0 0 _10
10 0 _1 2 —2 M2 = 0 1 0 —2 3 —21 .
0 0 1 —2 2 —1
Then -1 0 0 0 0 0- 1 0 —1 0 1 0 1 —1 —1 1 1 —1 0 1 0 —1 0 1
Am = Bm = 1 —1 1 —1 1 —1 1 0 —1 1 0 —1 1 —1 0 1 —1 0
_O 1 —1 0 1 — 1_
The idempotents are given by
1 ei (x) = —
2 (3x5 5x4 6x3 + 5/2 ± 3x ± 2),
1 e2 (x) = — — (3x5 + 5x4 + 6x3 + 5x2 + 3x),
2
and since in(x) = x6 + 2x5 + 3x4 3x3 2x2 x,
we have that
-2 0 0 - -0 0 0 .7
3 —1 1 3 —3 1 1 5 —3 1 1 5 —3 —1 '`-'1 2 6 —4 0 E2 = 6 —4 0
5 —3 —1 5 —3 —1 _3 —1 —1_ _3 —1 —1_
6.5 Linear and Cyclic Convolutions 125
Direct computation shows that C' = EC is given by
2 0 0 0 0 0 0 0 4 —2 1 0 1 0 2 —2
ci = —1 2
6 6
—4 —4
3 4
—2 —4
1 2
2 2
0 2
—4 —4
4 —2 3 —4 1 2 0 —4 2 0 1 —2 1 2 0 —2
Then, by (6.19), c CVAmh)(Amg)).
6.5 Linear and Cyclic Convolutions
The methods of section 4 decompose the computation of polynomial multiplication mod m(x) into small size computations of polynomial mul-tiplications mod mi(x) and rn2(x), where m(x) = mi(x)m.2(x), ini(x) and m2(x) are relatively prime. We can apply these ideas to linear and cyclic convolutions to decompose a large size problem into several small size problems. As we will see, algorithms computing linear convolution, cyclic convolution and multiplication modulo a polynomial can be used as a part of other algorithms of the same type. This permits large size problems to be successively decomposed into smaller and smaller problems.
Example 6.20 Consider the 2 x 2 linear convolution. Using the algo-rithm given by (6.19) for the reducing polynomial rn(x) = x(x2 — 1) leads to the same algorithm as that designed in section 3.
Example 6.21 Consider the 2 x 3 linear convolution
s(x) = g(x)h(x),
where g(x) = go + h(s) ho + hix + h2x2
First compute the product
c(x) g(x)h(x) mod x(x2 +1)
by the algorithm of example 6.14. Then we compute s(x) by
s(x) = c(x) + gili2x(x2 +1).
In the next few examples, two 3 x 3 linear convolution algorithms will be derived. The first is based on the Cook-Toom algorithm while the second follows from the methods of section 4.
126 6. Linear and Cyclic Convolutions
Example 6.22 Consider the 3 x 3 linear convolution
s(x) = g(x)h(x)
of polynomials g(x) and h(s) of degree 2. Compute
c(x) g(x)h(x) mod x(x — 1)(x +1)(x — 2)
by the Cook-Toom algorithm. Then
s(x) = c(x) + g2h2x(x — 1)(x + 1)(x — 2).
Working this out, we have
1 s = —
4C((Ag)(Ah)),
where
1 0 0 40000 1 1 1 0 2 —2 0 8
A = 1 —1 1 , C= —7 5 3 —1 —4 1 2 4 3 —3 —1 1 —8 0 0 1 0 0 0 0 4
c(x) g(x)h(x) mod (x4 — 1).
Using the factorization
x4 - = (x2 + 1) ( x2 - 1 )
in (6.19), we have
r 0 01 r 010 mi= [0 0 j, m2=1_0 0 j-
Compute multiplication mod (x2 1) and multiplication mod, (x2 — 1) by example 6.16. Then
0 Ai = A2 = Bi = B2 = -1 ,
0 1
.
Before designing the second 3 x 3 linear convolution algorithm, we will design a four-point cyclic convolution algorithm that will then be used to design a 3 x 3 linear convolution algorithm having slightly more multipli-cations but significantly fewer additions. We also note that the convolution theorem can be used to efficiently compute a four-point cyclic convolution.
Example 6.23 Consider the four-point cyclic convolution
6.5 Linear and Cyclic Convolutions 127
o -11 cl= -1 J
The idempotents are given by
1 (x) = (x2 — 1),
and we have
p. 0 C2 = -
E2(x) = (x2 +1),
1 0 1 0 _ 1 0 1 1 0 1
el — —1 0 ' e2 1 0 0 —1 0 1
Putting all of these together,
1 c = —
2C((Ag)(Ah)),
where 1 0-11 01 1 —1 1 1 —1 1
C = —1 0 11 01 ' —1 1 —1 1 —1 1
1 0 —1 0 1 —1 —1 1 0 1 0 —1
A = 1 0 1 0 1 —1 1 —1 0 1 0 1
-
Example 6.24 Consider the 3 x 3 linear convolution
s(x) = g(x)h(x).
'First compute the four-point cyclic convolution
c(x) g(x)h(x) mod (x4 — 1)
by the algorithm designed in example 6.20. We note that since the degree of g(x) and h(x) is equal to two, we can rewrite example 6.20 as
1 c = —C((A'g)(A111)),
where 1 0 —1 1 —1 —1 0 1 0
A' = 1 0 1 1 —1 1 0 1 0
128 6. Linear and Cyclic Convolutions
We can now compute s(x) by
s(x) = c(x) + g2h2(x4 — 1).
We see that now all of the coefficients are 0, 11 while in example 6.19, 'large' integers appear in the matrices.
Example 6.25 Consider the 4 x 4 linear convolution
s(x) = g(x)h(x).
First compute c(x) g(x)h(x) mod m(x),
where the reducing polynomial is
m(x) = x(x +1)(x2 +1 )(x2 + x +1).
Then s(x) = c(x)+ g3h3m(x).
To compute c(x), we use the algorithm designed in example 6.16. Since
deg(g(x)) = deg(h(x)) = 3,
we have c C'((kg)(A'h)),
where 1 0 0 0 1 0 —1 0 1 —1 —1 1 0 1 0 —1
A' = 1 —1 1 —1 ' 1 0 —1 1 1 —1 0 1 0 1 —1 0
and C' is as given in example 6.16. Consider the N-point cyclic convolution,
c(x) g(x)h(x) mod (xN — 1).
If an efficient N-point FT is available, then the convolution theorem is usually the best approach for computing an. N-point cyclic :Convolution. The algorithms of section 4 also can be called upon. For instance, take the factorization,
K-1
XN -1 = ok(x) , (6.20) k=0
6.5 Linear and Cyclic Convolutions 129
where the polynomials Ok (x), 0 < k < K , are the prime factors of XN - 1 over the rational field Q. These polynomiaLs are usually called cyclotomic polynomials. If
g(k) =- g(x) mod q5k(x),
h(k) h(x) mod Ok(x), 0 k < K,
then the cyclic convolution c(x) can be found from the products,
c(k)(x) g(k)(x)h(k)(x) mod Ok(x), 0 < k < K, (6.21)
by the formula, K-1
C(X) E 0(k)(X)Ek(S),
k=0
where {Ek(x) : 0 < k <
is the unique system of idempotents corresponding to the factorization (6.20).
As discussed in section 4, choosing a factorization over the rational field Q implies that the only multiplications required to carry out the algorithm are those given in (6.21). We continue to assume that the factorization is over Q, but point out that factorization over other fields can lead to efficient algorithms. This will be the case if multiplication by elements from the field can be efficiently implemented. For example, the field Q(i) of Gaussian numbers consisting of all complex numbers a + ib, a and b rational, is frequently taken. The value of 'extending' the field of the factorization is that the prime fa,ctors are of smaller degrees.
In the following examples, we will work out algorithms following the above approach. The multiplications in (6.21) will be computed by first passing through linear convolution.
Example 6.26 Consider the three-point cyclic convolution,
c(x) g(x)h(x) mod (x3 — 1).
The factorization (6.20) is given by
X3 - = (X - 1)(X2 + X +1).
Example 6.16 provides the bilinear algorithm for computing multiplica-tion mod x2 + x + 1. We have
[i 0 -11 = [1. M2 =
0 1 -1
= = = [ 1 ,
130 6. Linear and Cyclic Convolutions
1 0 A2 = B2 = 1 - 1 C [
2 - 0 1
The idempotents are given by
1 n
el (x) = (x' + x + 1), e2(x) =
1 1
1
0 -1 1
-1 0 •
(x2 + x — 2),
and .1 1 2 —1
= [11 , E2 = -1 2 I . 3 1 — 1 — 1
By (6.19), we have the bilinear algorithm
c = —1C1((kg)(A'h)),
3
where
1 1 1 - 1 1 1 —2
1 0 —1 A' = C' = [1 1 — 2 1 I .
1 —1 0 ' 1 —2 1 1
0 1 —1
Example 6.27 Consider the five-point cyclic convolution
c(x) g(x)h(x) mod (x5 — 1).
Factorization (6.20) is
x5 — 1 = (x — 1)(x4 + x3 + x2 + x + 1).
Then
=[ 1 1 1
Directly,
Multiplication mod (x4+x3+x2+x+1) can be computed by first computing the 4 x 4 linear convolution by the algorithm designed in eximple 6.22. Using the notation of example 6.22,
A2 = B2 = A',
C2 = CI .
1 0
0 1
0 0
0 0
—1 —1
1 1] , M2 = 0 0 1 0 —1 •
0 0 0 1 —1
Ai = = = [ 1 ].
Direct computation shows that
6.6 Digital Filters 131
1 1 1 1 0 0 0 —1 0 —1 0 0
—1 —1 1 0 Am = 1 0 —1 0 = Bm.
—1 1 —1 0 0 —1 1 —1
—1 0 1 —1 1 —1 0 0
To complete the ingredients needed for (6.19), we observe that the idempotents are
ei (x) = —3
(1 + x + X2 ± X3 + X4),
1 e2 (x) = — —
3(-4 ± ± X3 ± X4),
which can be used to compute
-1 - —4 1 1 1
1 1 1 1 —4 1 1
Ei = ,7 1 „ E 2 =- - - 1 1 —4 1 . j 1 3 1 1 1 —4
_1 _ 1 1 1 1
6.6 Digital Filters
The bilinear algorithms computing convolution, developed from the CRT, have the form
s = C((Bh)(Ag)), (6.22)
where C is usually more complicated than A or B. For application to digital filtering, we typically have one of the inputs, say h fixed, at least over many occurrences, and g varies. We will now discuss the concept of the transpose of (6.1), which permits s to be computed by the formula
s _ nteth)00), (6.23)
where is the matrix determined by reversing the columns of B and o is the matrix determined by reversing the rows of C.
Since h can be viewed as fixed, the computation Gill can be made once and for all. This precomputation stage, once made, does not enter into the overall efficiency of the algorithm, which now depends on the matrices A
132 6. Linear and Cyclic Convolutions
and B. In the examples of section 5, the entries of A and B were always 0, 1 and —1, and that makes very obvious the advantage of precomputing Cth.
In [3], the implications of this discussion to the stability of the computation were studied.
We turn now to a proof of (6.2). The result depends on the following observation about Toeplitz matrices.
Let T be a Toeplitz matrix that admits the factorization
T = CDB,
and let R denote a matrix of the same size of T given by
0 0 • • 0 1 0 0 • • • 1 0
0 1 - • • 0 0 1 0 • • 0 0
Then
which proves that
Consider now
We can write
= RTR = (RC)D(BR)= aD13,
T = (6.24)
s(x) = g(x)h(x) mod (xN — 1).
s = C(h)g,
where C(h) is the circulant and, hence, the Toeplitz matrix with the vector h as its first column. Suppose that we have a bilinear algorithm
s = C((Ag)(Bh))
computing s. Let D be the diagonal matrix
D = diag(Ag),
Then we write
g = CDBh,
and C(h) = CDB.
By (6.3),
C(h) = /31/30,
References 133
from which it follows that
s = nt((Ag)(oth)).
We have additional results that can be proved from (6.3). For example, if m(x) = x4 - 6, where 6 is a constant, then
s(x) g(x)h(x) mod m(x)
can be computed by s = T(h)g,
where li(h) is the Toeplitz matrix
ho
h.1
ehN-i •
ho •
• •
• •
6hi Viz
Tc(h) = [ .
hN_i hN-2 ho
Arguing as above, if s = C((Ag)(Bh)) is a bilinear algorithm, we have
s = fit((Ag)(ath)).
Since (Ag)(011)
is componentwise multiplication, the order can be changed, and
s = nt((e-th)(Ag)), (6.25)
where vector h represents the system elements and g is the input vector.
References
[1] Winograd, S. "Some Bilinear Forms Whose Multiplicative Complex-ity Depends on the Field of Constants", Math. Syst. Theor., 10, 1977, pp. 169-180.
[2] Agarwal, R. C. and Cooley, J. W. "New Algorithms for Digital Con-volution", IEEE Trans. Acoust. Speech and Signal Proc., 25, 1977, pp. 392-410.
[3] Auslander, L., Cooley, J. W. and Silberger, A. J. "Number Stability of Fast Convolution Algorithms for Digital Filtering", in VLSI Signal Proc., IEEE Press, 1984, pp. 172-213.
[4] Blahut, R. E. Fast Algorithms for Digital Signal Processing, Addison-Wesley, 1985, Chapters 3 and 7.
134 6. Linear and Cyclic Convolutions
[5] Nussbaumer, H. J. Fast Fourier Transform and Convolution Al-gorithms, Second Edition, Springer-Verlag, 1981, Chapters 3 and 6.
[6] Burrus, C. S. and Parks, T. W . DFT/FFT and Convolution Algorithms, John Wiley and Sons, 1985.
[7] Oppenheim, A. V. and Schafer, R. W. Digital Signal Processing, Prentice-Hall, 1975.
Problems
1. For two vectors h = [2, 3, 4, 5] and g = [6, 7, 8, 1], compute their linear convolution by
a. Convolution summation.
b. Polynomial multiplication.
c. Matrix multiplication.
2. For two vectors a = [2, 3, 4, 5] and b = [6, 7, 8, 1], compute their cyclic convolution by
a. Convolution summation.
b. Polynomial multiplication.
c. Matrix multiplication.
3. Write the cyclic shift matrices S5 and S6 . Prove that
= S8 = h .
4. For a = [1, 2, 3, 4, 5], write the circulant matrix C(a), and represent C(a) by the cyclic shift matrix S5 .
5. Compute the four-point cyclic convolution of problem 2 by the con-volution theorem. Show that the results are the same as the direct computation.
6. Diagonalize the matrix
2 A =
[4
1 8
8 4 2 1
1 8 4— 2
2 1 8 •
4
7. Show that F(5)S5 = D5F(5), where D5 is a diagonal matrix, and give the diagonal matrix.
[e rn(x) = x(x — 1)(x ± 1)(x ± 2)(x — 2)(x - Cook-Toom algorithm for a 3 x 4 lineal
d the arithmetic counts of problem 8.
Problems 135
- oo), and derive a - convolution.
7 Agarwal-Cooley Convolution Algorithm
The cyclic convolution algorithms of chapter 6 are efficient for special small block lengths, but as the size of the block length increases, other methods are required. First, as discussed in chapter 6, these algorithms keep the number of required multiplications small, but they can require many ad-ditions. Also, each size requires a different algorithm. There is no uniform structure that can be repeatedly called upon. In this chapter, a technique similar to the Good-Thomas PFA will be developed to decompose a large size cyclic convolution into several small size cyclic convolutions that in turn can be evaluated using the Winograd cyclic convolution algorithm. These ideas were introduced by Agarwal and Cooley [1] in 1977. As in the Good-Thomas PFA, the CRT is used to define an indexing of data. This in-dexing changes a one-dimensional cyclic convolution into a two-dimensional one. We will see how to compute a two-dimensional cyclic convolution by 'nesting' a fast algorithm for a one-dimensional case inside another fast algorithm for a one-dimensional cyclic convolution. There are several two-dimensional cyclic convolution algorithms that, although important, will not be discussed. These can be found in [2].
7.1 Two-Dimensional Cyclic Convolution.
Consider two M x N matrices
g = [g (m, n)]0<m<M, 0<n<N)
h = [h(m, n)]0<m<m, o<n<N•
138 7. Agarwal-Cooley Convolution Algorithm
We will define the two-dimensional cyclic convolution
s=h*g.
Associate to g and h the polynomials in two variables,
M-1 N-1
G(x , y) = E E g(rn,n)sm , (7.1) m=0 n=0
M-1 N-1
H (x , y) = E E h(k,l)xk yl (7.2) k=0 i=0
Form the polynomial
S (x , y) = H (x, y)G(x , y) mod (xm — 1) mod (yN — 1),
we first form the polynomial product
H (x , y)G(x , y)
and then reduce mod (xm — 1) by setting xm = 1; in the same way, we reduce mod (yN — 1) by setting yN = 1. We can write
M-1 N-1
S (x , y) = E E s(m,n)si yn (7.3) m=0 n=0
We call the M x N matrix
S = 1S(M, 71)]0<m<M, 0<n<N
the cyclic convolution of h and g. We can compute s by the following nesting procedure. First, by accumu-
lating all the terms that have the same power of x, we can rewrite (7.1) as
m —1
G(x,y) = E gy.,(y)xr , m=0
where N-1
gm(y) = E g(m, n)yn , 0 < m < M. n=0
In the same way, we can rewrite (7.2) and (7.3) as
m—i H (x, y) E hm(y)xm ,
m=0
7.1 Two-Dimensional Cyclic Convolution 139
m—i
S(x, y) = E sm(y)xm
Then
m- si(y) E.-- E hi_m(y)g,,,(y) mod (yN - 1), 0 < / < M, (7.4)
m=0
which can be viewed as cyclic convolution mod M where the data are no longer talcen from the complex field but are taken from the ring C[y]/(yN - 1).
Main Idea The cyclic convolution algorithms of chapter 6, designed for complex data, hold equally well for data taken from any ring containing the complex field. In particular, they can be used to compute (7.4). In this case, multiplication and addition mean multiplication and addition in C[y1/(yN - 1). The multiplication is cyclic convolution mod N.
Suppose that cyclic convolution mod M is computed by an algorithm using a(M) additions and m(M) multiplications, with similar notation for cyclic convolution mod N. Then the M polynomials of (7.4),
st(y), 0 5_1 < M,
are computed using m(M) N-point cyclic convolutions and a(M) additions in C[y]/(yN - 1). Since each N-point cyclic convolution is computed using a(N) additions and m(N) multiplications, we have that
rn(M)ni,(N)
multiplications and Na(M) + ni(M)a(N)
additions are needed to compute s. The order of the operations above can be interchanged by reversing the
roles of M and N. This has no effect on the number of multiplications but does affect additions.
We will now translate this discussion into matrix language. The M polynomial computations given in (7.4) can be rewritten as
so (Y) ho(Y) si (y) hi(y)
hm -1(Y) _ sm-i.(Y) _
hm-i(g) • ho(Y) •
hm -2 (Y) •
114) - h2(Y)
go(Y) gi(g)
ho(Y) _ .gm-1(Y)
mod (yN - 1). (7.5)
140 7. Agarwal-Cooley Convolution Algorithm
The matrix
- ho(Y) hm—i(Y) • • • hi(Y) hi (Y) ho (Y) • • • h2 (Y) H(y) =-
_ hm—i(Y) hm-2(Y) • • • ho(Y)
is a circulant matrix having coefficients in C[y]/(yN — 1). Set
- g(rn,0) g (m, 1)
= , 0 < m < M,
_ g(m, N — 1) _
and observe that gni is the vector formed from the m-th row of the matrix g. In the same way, define hi, 0 < / < M and sk, 0 < k < M. Let Hi denote the circulant matrix having hi as the 0-th column. We can rewrite (7.5) as
so go si gi
= H
sm—i _ gm—i _
where H is the block circulant matrix with circulant blocks -
Ho Hm_i • • Hi Ho • •
H = (7.6)
Hm—i
and (7.6) is the matrix description of a two-dimensional cyclic convolution.
In chapter 6, bilinear cyclic convolution algorithms were designed as matrix factorizations of circulant matrices. We will extend these one-dimensional algorithms to the two-dimensional computation given by H. Matrices A and B define a bilinear N-point cyclic convolution algorithm bilinear algorithm/cyclic convolution if, for any N x N circulant matrix C, a diagonal matrix G can be found satisfying
C = BGA. (7.7)
A class of algorithms of this kind has been given in chapter 5. In the con-volution theorem, we have A = B-1 = F(N). In the Winograd algorithms, the matrices A and B are matrices of small integers but are no longer square matrices.
7.1 Two-Dimensional Cyclic Convolution 141
First consider the special case
H = C CI, (7.8)
where C is an M x M circulant matrix and C' is an N x N circulant matrix. H has the form (7.6). Take bilinear algorithms computing M-point and N-point cyclic convolutions
C = BGA, (7.9)
C' = B'G' A' , (7.10)
where G and G' are diagonal matrices. Placing (7.9) and (7.10) in (7.8), we can write
H = (B B')(G G')(A 0 A'),
where G G' is a diagonal matrix. Consider again the matrix H given in (7.6). By (7.10), we can write
Ht = , 0 < 1 < M, (7.11)
where the diagonal matrix GI is determined by Hi. Placing (7.11) in (7.7), we can rewrite H as
H = (I m B')D1(Im A'), (7.12)
where D' is the block circulant matrix having diagonal matrix blocks
G'm_i • G'0
D' =
G' _ m -1 G10
Suppose that the size of each diagonal matrix q, 0 < < M, is K . Then A' is an K x N matrix and B' is an N x K matrix.
The matrix P(MK,K)D'P(MK,M) (7.13)
is a block diagonal matrix consisting of KMxM circulant blocks. By (7.7), we can write (7.13) as the matrix direct sum
E EBBGkA, (7.14) k=0
where Gk, 0 < k < K , is a diagonal matrix. This implies that D' can be written as
(BO/K)D(A®/K) (7.15)
142 7. Agarwal-Cooley Convolution Algorithm
for some diagonal matrix D. Placing (7.15) in (7.12),
H = (B B')D (A 0 A'). (7.16)
We see that the bilinear algorithms (7.9) and (7.10) can be used to com-pute the two-dimensional cyclic convolution given by H. In particular, the convolution theorem,
H = (F(M) F(N))-1D(F(M) F(N)), (7.17)
is the two-dimensional convolution theorem.
7.2 Agarwal-Cooley Algorithm
The CRT will be used to turn a one-dimensional N-point cyclic convolution where
N = NiN2, (NI., N2) = 1,
into a two-dimensional x N2 cyclic convolution. By the results of section 1, we then can carry out the computation by nesting an Ni-point cyclic convolution algorithm inside an N2-point cyclic convolution. Formula (7.17) is the explicit form of this nesting.
Choose idempotents el and e2 satisfying
ei 1 mod NI, ei 0 mod N2
e2 0 mod NI., e2 1 mod N2 .
Each n, 0 < n < N, can be uniquely written as
n niei + n2e2 mod N, 0 < ni < 0 < n2 < N2.
Consider the N-point cyclic convolution
s = h * g,
which we can rewrite in the form
where H is the circulant matrix
s = Hg,
h(0) h(N — 1) 7 • h(1) h(1) h(0) - h(2)
H =
h(N — 1) h(N — 2) • h(0)
7.2 Agarwal-Cooley Algorithm 143
We will show that a permutation matrbc P can be found such that
Ps= (PHP-1)Pg, (7.18)
where PH13-1 is a block circulant matrix with circulant blocks. As a consequence, formula (7.18) computes Ps as the two-dimensional cyclic convolution in the sense of formula (7.6).
Example 7.1 Take N = 6 with Ni = 2 and N2 = 3. The idempotents are
el = 3, e2 = 4.
Consider the six-point cyclic convolution
s = Hg,
where H is a 6 x 6 circulant matrix. Define the permutation ir of Z/6 by
7r = (0,4,2; 3,1,5),
and denote by P the corresponding permutation matrix. Then
-
-100 000
000 010
001 000 P =
000 100 '
010 000
_000 001_
Direct computation shows that
h(0) h(2) h(4) h(3) h(5) h(1)
h(4) h(0) h(2) h(1) h(3) h(5) h(2) h(4) h(0) h(5) h(1) h(3)
PHP-1= h(3) h(5) h(1) h(0) h(2) h(4)
h(1) h(3) h(5) h(4) h(0) h(2)
h(5) h(1) h(3) h(2) h(4) h(0)
which is a 2 x 2 block circulant matrix having 3 x 3 circulant blocks. The input and output vectors are
g(0) s(0) -
g(4) s(4) g(2) s(2)
Pg g(3) ' s(3) '
g(1) s(1)
g(5) s(5)
h(n2 + N2ni,k2+ N2ki) = hani - ki)ei -I- (n2 - k2)e2), (7.19)
where 0 < ni, ki < Ni, 0 < n2, k2 < N2 . From (7.19), we have that PHP-1 is an NI x Ni block circulant matrix having N2 X N2 circulant blocks:
,
_
where Hi is the circulant matrix having 0-th column:
[
h(tei) h(lei +e2)
.
h(lei + (N2 - 1)„e2) -,,
Ho Hivi—i • • 111 Hi Ho ' • •
PHP-1 =
.
_ H-Ni-i • • - Ho
144 7. Agarwal-Cooley Convolution Algorithm
and
Ps = (PHP-1)Pg
is a two-dimensional 2 x 3 cyclic convolution. Consider the general case N = AriAr2, where Ni and N2 are relatively
prime. As in the Good-Thomas PFA, we define the permutation ir of ZIN by the formula
7r(n2 Nzni) niet -1-n2e2 mod N, 0 < ni < 0 < nz < N2
Denote the corresponding permutation matrix by P. Pg is formed by reading across the rows of the matrix
g(0) g(e2) • g((N2 - 1)e2)
g(et) g(ei+ e2) • • g(ei + (N2- 1)e2)
g((Ni- 1)ei) g((Ni - 1)ei -I- (N2 - 1)e2)
Write
PHP-1 = [h(1,k)], 0 <1,k < N.
Then
As a result,
Ps = (PHP-1)Pg (7.20)
is a two-dimensional x N2 cyclic convolution.
Problems 145
Fast algorithms computing two-dimensional cyclic convolution can now be applied to compute (7.20) and in this way the N-point cyclic convolu-tion. If Ni-point cyclic convolution is computed using a(Ni) additions and ni(Ni) multiplications, then
m(Ni)m(N2)
multiplications are needed to compute the N-point cyclic convolution s by (7.20) and
N2a(Ni) rrt(Ni)a(N2)
additions are required.
References
[1] Agarwal, R. C. and Cooley, J. W. "New Algorithms for Digital Con-volution", IEEE Trans. Acoust., Speech and Signal Proc., 25, 1977, pp. 392-410.
[2] Blahut, R. E. Fast Algorithm,s for Digital Signal Processing, Addison-Wesley, 1985 , Chapter 7.
[3] Nussbaumer, H. J. Fast Fourier Transform and Convolution Algo-rithms, Second Edition, Springer-Verlag, 1981, Chapter 6.
[4] Arambepola, B. and Rayner, P. J. "Efficient Transforms for Multidi-mensional Convolutions", Electron. Lett., 15, 1979, pp. 189-190.
Problems
1. Derive a three-point Winograd cyclic convolution algorithm.
2. Derive a five-point Winograd cyclic convolution algorithm.
3. Derive a 15-point Agarwal-Cooley convolution algorithm using the results of problems 1 and 2.
4. Derive a four-point Winograd cyclic convolution algorithm.
5. Derive a 12-point Agarwal-Cooley convolution algorithm using the results of problems 1 and 4.
6. Write a 2 x 2 cyclic convolution algorithm
S (x, y) = H (x, y)G (x , y), (mod X2 - 1)(mod y2 — 1).
7. Write a 2 x 2 polynomial product
S (x, y) -= H (x, y)G(x, y), (mod X2 ± 1)(mod y2 + 1).
8 Multiplicative Fourier Transform Algorithm
The Cooley-Tukey FFT algorithm and its variants depend upon the exis-tence of nontrivial divisors of the transform size N . These algorithms are called additive algorithms since they rely on the subgroups of the addi-tive group structure of the indexing set. A second approach to the design of FT algorithms depends on the multiplicative structure of the indexing set. We applied the multiplicative structure previously, in chapter 5, in the derivation of the Good-Thomas PFA.
In the following chapters, a more extensive application of multiplicative structure will be required. The first breakthrough was due to Rader [1] in 1968, who observed that a p-point Fourier transform could be computed by a (p — 1)—point cyclic convolution. Winograd [2, 3] generalized Rader's results to include the case of transform size N = , p a prime. Combined with Winograd's cyclic convolution algorithms, these methods lead to the Winograd Small FT Algorithm which we will derive in detail in chapter 9.
In tables 8.1 and 8.2, taken from Temperton [4], we compare the number of real additions and real multiplications required by conventional methods and by the Winograd methods. For the transform sizes included in tables 8.1 and 8.2, Winograd's algorithm requires substantially fewer multiplica-tions at the cost of a few extra additions. However, as the transform size increases, although the Winograd algorithm continues to maintain its ad-vantage in multiplications, the price in additions becomes higher. This is to be expected since these algorithms depend on cyclic convolution algo-rithms. Standing alone, the Winograd small FT algorithms (WSFTA) are practical only for a collection of small size transforms. In tandem with the Good-Thomas PFA, the Winograd algorithms can be effectively used to
148 8. Multiplicative Fourier Transform Algorithm
handle medium and some large size transforms. The Winograd Large FT algorithm, [5] is based on a method of nesting the WSFTAs in the Good-Thomas PFA. This results in algorithms that minimize multiplications. This nesting technique can be described using tensor product formula-tion. Suppose that N = RS, where R and S are relatively prime. By the Good-Thomas algoritlun,
F(N) = P(F(R) F(S))Q, (8.1)
where P and Q are permutation matrices defined by the Good-Thomas method. A WSFTA for an R-point FT has the form
F(R) = (8.2)
In the cases under consideration, Ai and Ci are matrices of zeros and ones, and Bi is a diagonal matrix whose entries are either real or purely imaginary. In the same way, we can write
F(S) = C2B2A2.
The dimension of Bi in (8.2) is in general greater than R, and consequently Ai and Ci are not square matrices. Using the tensor product formula,
(A 0 B)(C D) = (AC) 0 (BD), (8.3)
we can place (8.2) and (8.3) in (8.1) and write the N-point Fourier transform matrix as
F(N)= PCBAQ, (8.4)
where C and A are matrices of zeros and ones:
c = ® c2,
A = Ai 0 A2;
B is a diagonal matrix with real or purely imaginary entries on its diagonal
B = Bi 0 B2,
and P and Q are permutation matrices. The number of multiplications required to compute an R-point FT by
(8.2) is the dimension m(R) of the diagonal matrix Bi. It follpys that the number of multiplications required to compffte an N-point FT is
m,(N) = m,(R)m(S),
the dimension of the matrix B.
8. Multiplicative Fourier Transform Algorithm 149
If a(R) denotes the number of additions required to compute an R-point FT by (8.2), then the number of additions required to compute the N-point FT by (8.4) is
a(N) = Ra(S)+ m(S)a(R).
Kolba and Parlcs [6] implemented the Good-Thomas algorithm by direct computation of each small FT using the Winograd FFT without nesting. In this case, the number of multiplications and additions are given respectively by
m(N) = Sm(R)+ Rm(S),
a(N) = Sa(R)+ Ra(S).
In general, R < m(R) and S < m(S),
which imply that the Kolba-Parks approach has the advantage when it comes to additions. However, in most cases, Winograd's approach has the advantage when measured by multiplications. Tables 8.3-8.6, also taken from Temperton [4], compare the conventional approach, the Good-Thomas approach where conventional methods are used on the factors, the Kolba-Parks approach and the Winograd approach. As can be seen from tables 8.4-8.6, Winograd's technique offers substantial savings in multi-plications relative to all other methods. However, it is the least efficient with respect to additions. In all cases, we see that additions dominate multiplications. Temperton [7] argues for the Good-Thomas approach with conventional computation on factors on computers such as CRAY, where additions and multiplications are performed simultaneously. On these com-puters multiplications are 'free' in the sense that they are carried out while the more numerous additions are being performed. Other implementation considerations are discussed in [8, 9].
In the following chapters, we will present a class of algorithms [10-15], that combine features of all of these algorithms. Tensor product rules are used throughout. The fundamental factorization has the form
F(N) -= PCAP-1, (8.5)
where P is a permutation matrix, A is a preadditions matrix with all of its entries being 0, 1 or —1 and C is a block-diagonal matrix having skew-circulant blocks (rotated Winograd cores) and a tensor product of such blocks.
We have implemented these algorithms and their variants on the Micro VAX II. For a large collection of transform sizes, these algorithms require roughly half the run-time of comparable programs in Digital's Scientific Library (LabStar Version 1.1). We will see this in tables 8.7 and 8.8.
Although the fundamental factorization and its variants are highly struc-tured and uniform, a direct attack on the programming of matrix A is
150 8. Multiplicative Fourier Transform Algorithm
complicated. There is no apparent looping structure. However, as discussed in [14], the programming effort can be greatly simplified and automated by the use of macros and production rules that take advantage of the 'local' structure of the preadditions.
For all of the transform sizes listed in tables 8.7 and 8.8, we used the Variant 1 form of the fundamental factorization, as described in the follow-ing chapters. The main programs are written in Fortran, and they call the small DFT subroutines written in assembly.
From tables 8.7 and 8.8, we can see that many transform sizes not in-cluded in Lab Star have been programmed. Timings for most sizes are significantly better than the nearest size Cooley-Tukey algorithm.
Tables of Arithmetic Counts and Timing
R.A.— the number of real additions. R.M.— the number of real multiplications.
Table 8.1 Conventional methods.
Sizes R.A. R.M. 2 4 0 3 12 4 4 16 0 5 32 12 7 60 36 8 52 4 9 80 40
16 144 24
Table 8.2 Winograd.
Sizes R.A. R.M. 2 4 0 3 12 4 4 16 0 5 34 10 7 72 16 8 52 4 9 88 — 20
16 148 20
Tables of Arithmetic Counts and Timing
'able 8 3 Conventional methods.
7,es R.A. R.M. )5 2272 1492 )8 2018 1012 12 2162 1188 20 2302 1116 26 2684 1672 28 2242 900 10 5322 2708 52 5954 2500 56 5122 2050 15 8492 5728 20 7202 3396
Table 8.4 Good-Thomas
7.es R.A. R.M. )5 1992 932 12 1968 744 20 2028 508 26 2452 1208 10 4656 1256 52 5408 2416 15 7516 3776
Table 8.5 Kolba-Parks.
zes R.A. R.M. )5 2214 590 12 2188 396 20 2076 460 26 2780 568 10 4812 1100 52 6064 1136 15 8462 2050
152 8. Multiplicative Fourier Transform Algorithm
Table 8.6 Winograd.
Sizes R.A. R.M. 105 2418 322 112 2332 308 120 2076 276 126 3068 392 240 5016 632 252 6640 784 315 10406 1186
Table 8.7 Timing comparisons (pq and pqr cases).
8
15
16
23
3 x 5
24
1.13 ms.
1.49 ms.
2.87 ms.
21 3 x 7 2.23 ins.
32 25 5.78 ms.
33 3 x 11 4.35 ins.
35 5 x 7 4.22 ms.
39 3 x 13 4.78 ms.
51 3 x 17 6.97 ms.
64 26 12.49 ms.
93 3 x 31 16.03 ms.
105 3 x 5 x 7 16.01 ms.
—
128 27 27.80 ms.
ms.= 10-3 second.
Size Factors pq(pqr) Dec Lab Star
--_,
References 153
Table 8.8 Timing Comparisons (4p, 4pq and p2q).
Size Factors 4p(4pq) Dec Lab Star
8 23 1.49 ms.
12
16
4 x 3
24
0.415 ms.
2.87 ms.
20 4 x 5 0.985 ms.
28 4 x 7 2.92 ms.
32 25 5.78 ms.
44 4 x 11 5.86 ms.
45 32 x 5 6.52 ms.
52 4 x 13 6.53 ms. ,
60 4 x 3 x 5 6.39 ins.
64 26 12.49 ms.
68 4 x 17 9.27 ms.
124 4 x 31 20.42 ms.
128 27 27.80 ms.
ms.= 10-3 second.
References
[1] Rader, C. M. "Discrete Fourier Transforms When the Number of Data Samples Is Prime", Proc. IEEE, 56, 1968, pp. 1107-1108.
[2] Winograd, S. "On Computing the Discrete Fourier Transform", Proc. Nat. Acad. Sci. USA., 73(4), April 1976, pp. 1005-1006.
[3] Winograd, S. "On Computing the Discrete Fourier Transform", Math. of Computation, 32(141), Jan. 1978, pp. 175-199.
154 Chapter 8. Multiplicative Fourier Transform Algorithm
[4] Temperton, C. "A Note on Prime Factor FFT Algorithms", J. Comp. Phys., 52,1983, pp. 198-204.
[5] Blahut, R. E. Fast Algorithms for Digital Signal Processing, Addison-Wesley, 1985, Chapter 8.
[6] Kolba, D. P. and Parks, T. W. "Prime Factor FFT Algorithm Using High Speed Convolution", IEEE Trans. Acoust., Speech and Signal Proc., 25,1977, pp. 281-294.
[7] Temperton, C. "Implementation of Prime Fa.ctor FFT Algorithm on Cray-1", to be published.
[8] Agarwal, R. C. and Cooley, J. W. "Fourier Transform and Convo-lution Subroutines for the IBM 3090 Vector Facility", IBM J. Res. Devel., 30, Mar. 1986, pp. 145-162.
[9] Agarwal, R. C. and Cooley, J. W. "Vectorized Mixed Radix Discrete Fourier Transform Algorithms", IEEE Proc., 75(9), Sep. 1987.
[10] Heideman, M. T. Multiplicative Complexity, Convolution, and the DFT, Springer-Verlag, 1988.
[11] Lu, C. Fast Fourier Transform Algorithms For Special N's and The Implementations On VAX, Ph.D. Dissertation, The City University of New York, Jan. 1988.
[12] Tolimieri, R., Lu, C. and Johnson, W. R. "Modified Winograd FFT Algorithm and Its Variants for Transform Size N = pi' and Their Implementations", accepted for publication by Advances in Applied Mathematics.
[13] Lu, C. and Tolimieri, R. "Extension of Winograd Multiplicative Algo-rithm to Transform Size N = p2q, p2qr and Their Implementation", Proc. ICASSP 89, Scotland, May 22-26.
[14] Gertner, I. "A New Efficient Algorithm to Compute the Two- Di-mensional Discrete Fourier Transform", IEEE Trans. Acoust., Speech and Signal Proc., 36(7), July 1988.
We.
MFTA: The Prime Case
9.1 The Field Zlp
For transform size p, p a prime, Rader [1] developed an FT algorithm based on the multiplicative structure of the indexing set. The main idea is as follows. For a prime p, Zlp is a field and the unit group U(p) is cyclic. Reordering input and output data relative to a generator of U(p), the p-point FT becomes essentially a (p —1) x (p —1) skew-circulant matrix action. We require 2(p — 1) additions to make this change. Rader com-putes this skew-circulant action by the convolution theorem that returns the computation to an FT computation. Since the size (p — 1) is a com-
`posite number, the (p — 1)-point FT can be implemented by Cooley-Tukey FFT algorithms. The Winograd algorithm for small convolutions also can be applied to the skew-circulant action. (See problems 3, 4 and 5 for basic properties of skew-circulant matrices.)
Example 9.1 If p = 3, then U(3) has the unique generator 2.
Example 9.2 If p =-. 5, then U(5) has two generators, 2 and 3. We can order U(5) by consecutive powers of 2,
1, 2, 4, 3,
or by consecutive powers of 3,
1, 3, 4, 2.
156 9. MFTA: The Prime Case
In the following table, we give generators z corresponding to several odd primes.
Table 9.1 Generator of U(p).
Size I 3 I 5 7 I 11 13 [ 17 Generator I 212 31 2 2 I 6
Choose a generator z of U (p). The order of U (p) is p - 1. We will reorder the indexing set according to successive powers of the generator z as follows:
0, 1, Z, Z2, ... ZP-2,
with zP-1 =- 1 mod p. We call this ordering the exponential ordering based on z. The relationship between the canonical ordering and the exponential ordering is given by the indexing set permutation
(k) = {13;c_i k = z , 1 < k < p,
which we call the exponential permutation based on z.
Example 9.3 Relative to the generator 2 of U(5), the exponential permutation is
Ir -= ( 0 1 2 4 3 )
Exarnple 9.4 Relative to the generator 3 of U(7), the exponential permutation is
7r = ( 0 1 3 2 6 4 5 ).
A useful fact in building algorithms is
(p-1)/2 = -1,
which implies that the exponential ordering based on z has the form
1)/2 , _1 , -z , , - z(P- ') /2 0 , 1 , z , . . . , z(P-
In general, if 7r is a permutation of Z/N and P(7r) is the corresponding N x N permutation matrix
xir(o) x.„(i)
P(r)x =
X.ff (N-1) °:
then the matrix F, satisfying F(p) = .13-1(7r)F,P(7r) is given by
= [w7r(k)ir(i)] w = e276/N 0<j, k<N
9.2 The Fundamental Factorization 157
9.2 The Fundamental Factorization
We assume throughout this chapter that a generator z of U(p) has been specified, and that 7r is the exponential permutation based on z. We will design algorithms computing p-point FT based on reordering input and out-put data by the exponential ordering. In matrix formulation this amounts to permuting input and output data by the permutation matrix P cor- responding to the exponential permutation Explicitly, P is the p x p permutation matrix defined by the formula
Px = y,
where yo = so and
Yk = Xyk-1, 1 < k < p.
Consider the matrix F„ defined by F(p) = P-1F,P. Set v = e21"/P Since
1, /=Oork=0, v
71-(1)7r(k)
= vz
1+k-1, 1 < 1, k < p,
we have _11
= . : C (p)
_1
where C(p) is the (p — 1) x (p — 1) skew-circulant matrix
CO)) [Vzi+k ] 0<l,k<p-1"
We call F, the FT matrix and C(p) the Winograd core based on the gener-ator z of U(p). Unless otherwise specified, we will assume throughout that a generator has been chosen and suppress the dependence of Fir and C(p) on the choice of generator.
Example 9.5 The Winograd core based on the generator 2 of U(3) is
c,(3) = v., v21 , v = e27,i/3
Lvz v
Example 9.6 The Winograd core based on the generator 2 of U(5) is r V2 V4 V3
V2 V4 V3 V C(5) = 21, v = 6
27ri /5 V4 V3 V V
V3 V V2 V4
158 9. MFTA: The Prime Case
For any vector x of size p, denote by x' the vector of size p — 1 formed by deleting the 0-th component of x. The action of Fir can be computed by the following two formulas. If y = Firx, then
p—i
Yo = E x,n, m=0
= C(p)x'
where lp_i is the vector of size p — 1 of all ones.
Example 9.7 Based on the generator 2 of U(5), we can compute y = Firx by the formulas
4
Yo = E xn„ m=0
2 4 3 yy vV2 Vv 4 Vv 3 Vv Xx 0
o V e21-z/5
• Y3 V4 V3 V V2 X3 X0
Y4 V3 V V2 V4 X4 X0
Up to the permutation of input and output data by the permutation ma-trix P, the action of F, computes the action of F(p). Since the permutation matrix P has the form
1 0 P = [ 0 QI'
where Q is a permutation matrix, we have
F(p) = 11. 1 [
1 Q-1C(P)Q 1-
_.
Set p' = p — 1 and define the p x p matrix
A(p) -=-
Observe that A(p)
Example 9.8
1 1 1 —1
• I ,
—1
does not depend on
1 1 A(3) = —1 1
—1 0
1 lpt, =-
1.2,,
r.
1 01 . 1
9.2 The Fundamental Factorization 159
Example 9.9 -11111
—11000 A(5) = —1 0 1 0 0 .
—10010 _-10001
Example 9.10 Consider the Winograd core C(5) of example 9.6. Since
± v3 ± v4 = 0, v = e27ri/5, ± V ± V2
we have C(5)14 = — 14,
implying that 10 0 00- 0
F, =[0 C (5) A(5). 0 0 _
Using the matrix direct sum notation, we can rewrite this as
= (1ED C(5))A(5),
which leads to the computation of y =- Firx by the following steps:
• Compute
ao = xo + xi + x2 + x3 + x4,
• = — xo,
a2 = x2 — xo,
a3 = x3 — xo,
• = X4 — Xo.
• Compute Yo = ao,
Y2 = C(5) a21 a3
YY43 a4
The results of example 9.10 hold in general. The main fact we need is that, for v = e2"t/P , we have
p-2
Vz = 0, (9.1) k=0
160 9. MFTA: The Prime Case
which implies that C(p)lp, = -1p, (9.2)
and the following theorem.
Theorem 9.1 F, = e C(p))A(p),
where C(p) is the Winograd core and
r 1 11; , 1. A(p) -
1P'
- I ,
The factorization given in the theorem is called the fundamental factorization. It computes the action of F, in two stages:
• An additive stage describe by the matrix A(p).
• A multiplicative stage described by the Winograd core C(p).
Table 9.2 F, = [1 ED C(p)1A(p) direct method.
Factor R.A. R.M.
A(p) 4(p - 1) 0
C (p) 2(p -1)(2p - 3) 4(p - 1)2
F, 2(p -1)(2p - 1) 4(p - 1)2
Table 9.3 Arithmetic count of C(p): Direct method.
Factor R.A. R.M.
5 56 64
7 132 144
11 380 400
13 552 576
17 992 1024
R.A. - the number of real additions. R.M. - the number of real multiplications.
9.2 The Fundamental Factorization 161
The additive stage requires 2p' additions. In the next section, several implementations of the multiplicative stage will be given that have vari-ous arithmetic counts. We have called this stage the multiplicative stage since, by the convolution theorem, the skew-circulant matrix C (p) can be diagonalized. (See problems 4 and 5.)
Table 9.4 Arithmetic count of C(p): Convolution theorem.
Factor R.A. R.M.
5 38 12
7 82 36
11 202 76
13 214 76
17 326 100
R.A. - the number of real additions. R.M. - the number of real multiplications.
Taking transpose on both sides of the fiindamental factorization, we have
= At (p)(1 ED C(p)). (9.3)
The multiplicative stage now c,omes before the additive stage. A second algorithm computing
y = F,x
will now be given. Set E(p) = lp,
the p x p matrix of all ones. Form the matrix
- E(p),
which can be written as
F, - E(p) = 0 ED W(p),
where W(p) is the pi x skew-circulant matrix given by
w (P) = (P) - E(P').
The computation becomes
y = (F, - E(p))x E(p)x.
162 9. MFTA: The Prime Case
We see that E(p)x = yol,
can be computed using p' additions. The computation is arranged in two stages:
p-1 • YO = j=0 X j.
• y' = W(p)x'
As in the preceding approach, we require 2p' additions and the action of the p' x p' skew-circulant matrix W (p).
Example 9.11 Based on the generator 2 of U(5),
V — 1 V2 — 1 V4 — 1 V3 — 1 2 1 4 i
V — i V — V -- 3 1 V — 1
W(P) = V4 — 1 V3 — 1 V — 1 V2 — 1 ' [
3 1 1 V2 — 1 V4 — 1
V — V —
V = e272/5
The computation of y = F, x can be carried out by
4
YO = E Xm rn=0
Y1 — YO X1
Y2 — YO = W(P) [X2
Y3 — Yo X3
Y4 — YO X4
9.3 Rader's Algorithm
For a prime p, consider the fundamental factorization
F, = (1 (131 C(p))A(p).
Throughout this section, set p' = p-1. Unless otherwise specified, additions and multiplications mean complex additions and complex multiplications. Every addition is equivalent to two real additions. There are several ways of computing multiplications. We will assume that direct methods are used, so that every multiplication requires two real additions and four real mul-tiplications. In this section and the next, we will design variants of the fundamental factorization that reduce multiplications or change the balance between the multiplications and additions.
By the convolution theorem, C(p) can be diagtmalized by
D(p) = F(p')-1C(p)F(p')-1
Placing this result into the fundamental factorization, we have the following result.
9.4 Reducing Additions 163
Theorem 9.2 (Rader FFT I)
= e F(p'))(1 G D(p))(1 F(p'))A(p).
Up to the preaddition stage, A(p), and the diagonal multiplication stage, 1 ED D(p), we can compute F, and F(p) by two pi-point FTs.
9.4 Reducing Additions
Diagonalizing the Winograd core C(p) reduces the number of multiplica-tions needed to carry out the computation. This is an important, but not the only, consideration even when small computers that have fast adders and slow multipliers are used. Data flow can have a great effect on the actual time cost of carrying out a computation. However, measuring the efficiency of a given data flow is extremely machine-dependent, and beyond the scope of this work.
On some larger machines, the speed of additions is nearly equal to that of multiplications. In this case, algorithms that balance between the number of additions and the number of multiplications should be most efficient.
Throughout this section, set pi = p — 1 and denote by e the vector of size p' having 1 in the 0-th component and 0 in all other components. Define the p x p matrix B(p) by
B(p)(1@ F(p')) = (1 ED F(0)A(p).
Since F(p')e =
we have B(29) _= r et 1
I_ —11 e j Theorem 9.3 (Ftader FFT II)
= (1 ED F(p'))(1 e D(p))B(p)(1 e F(p')),
where 1 et B(P) = [ _pie 1 •
Rader FFT II is symmetric in the sense that 1 ED F(p') initiates and completes the computation of F,.
Example 9.12 11000
—41000 B(5)= 00100 .
00010 _ 00001_
164 9. MFTA: The Prime Case
Computing the action of B(p) requires two additions and an integer multiplication by —p', which should be compared with the 2p' additions required for the action of A(p). The additions have been dramatically re-duced with one extra multiplication as the trade-off. Another important fact is that the arithmetic of B(p) is independent of p.
The second approach for reducing additions comes from the special form of the Winograd core C(p).
Example 9.13 Take p = 7. The matrix C(7) has the form
X(7) X* (7)1. C(7) = [x.(7) x(7)
Based on the generator 3, we have
[ V V3 V2
X(7) = v3 v2 v , v = e21'417 V2 V V3
A straightforward computation of the action of C(7) requires 132 real additions and 144 real multiplications.
A partial diagonalization method or block diagonalization m,ethod will be used to replace complex arithmetic by purely real arithmetic. Set
Y(7) = (F(2) 0 /3)-1C(7)(F(2) 0 /3)-1
Y(7) is a block diagonal matrix consisting of two blocks, one with purely real coefficients and one with purely imaginary coefficients. A direct computation shows that
y(7) = [ X(7) + X* (7) 0 2 I_ 0 X(7) — X* (7)] •
Placing this result into the fundamental factorization, we have
Fir = (1ED (F(2) 0 /3))(1 Y(7))(1 e (F(2) 0 /3))A(7).
The matrix X(7) + X(7)
has only real entries. Multiplication of a real number and a complex number requires no real additions and two real multiplications. It follows that the action of X(7)+X* (7) requires 18 real multipliotions and 12 reatadditions. The matrix
X(7) — X* (7)
has only purely imaginary entries. We assume that multiplication by i requires no addition or multiplication. The arithmetic of X(7) — X* (7) is
9.4 Reducing Additions 165
equivalent to that of X(7)+X* (7). It follows that the action of X(7)—X*(7) requires 18 real multiplications and 12 additions.
The computation of the action of Fir is decomposed into a preaddition stage given by A(7), a two-point FT stage given by 1 ED (F(2) 0 h), an essentially real multiplication stage given by lED Y(7) and a final two-point FT stage given by 1 ED (F(2) 0 /3). Computing C(7) by this method should be compared to the p = 7 case of table 9.4.
Table 9.5 C(7) = (F(2)
Factor
I3)Y (7)(F(2) h).
R.A. R.M.
X + X* 12 18
X — X* 12 18
Y 24 36
F(2) 0 h 12 0
C 48 36
R.A. — the number of real additions. R.M. — the number of real multiplications.
The general case follows in the same way. C(p) has the form
r x(p) x.(p)i (9.4) c(p) = pc.(p) x(P)
The partial diagonalization method leads to the next result.
Theorem 9.4 (Rader FFT III)
F, = e (F(2) 0 /p72))(1 e Y(p))(1 ED (F(2) 0 /p72))A(p),
where Y(p) is the block diagonal matrix
1 Y(P) [(X(P) + X* (A) (X(P) — X* (PM.
As in the example, X(p) + X* (p) has only real entries and X(p)— X*(p) has only imaginary entries. The action of Y(p) requires only real arithmetic.
Denote by f the vector of size pi whose first p'/2 components are 1 and second p'/2 components are O. Define the p x p matrix Bi (p) by
(p)(1 e (F(2) 0 Ipy2)) = (1 6 (F(2) 0 Ipv2))A(p).
A direct computation shows that
r ft BI(P)— I_ —2f rp‘ •
166 9. MFTA: The Prime Case
In general, the action of Bi(p) requires p' additions.
Table 9.6 Rader FFT III.
Factor R.A. R.M.
X(p) + X*(P) (p - 1)(p - 3)/2 (p - 1)2 12
X (p) - X* (p) (p - 1)(p - 3)/2 (p - 1)2 12
(P) (P - 1)(P - 3) (p - 1)2
F(2) 0 /p72 2(p - 1) 0
C(p) (p - 1)(p + 1) (p - 1)2
(P 1)(23 + 5) (p - 1)2
Table 9.7 Rader FFT III.
Factor R.A. R.M.
5 40 16
7 72 36
11 160 100
13 216 144
17 352 256
19 432 324
R.A. - the number of real additions. R.M. - the number of real multiplications.
Theorem 9.5 (Rader FFT IV)
F, = (1 e (F(2)0 ipy2))(1 e Y(p))Bi(p)(1 e (F(2) ® Ipv2)),
where r ft 1 Bl(P) = I_ -2f •
As in Rader FFT II, the factorization is symmetric.
9.5 Winograd Small FT Algorithm
167
Table 9.8 Rader FFT IV.
Factor R.A. R.M.
Bi (P) 2(p - 1) (p - 1)
F, (p - 1)(p + 3) P(P - 1)
Table 9.9 Rader FFT IV.
Fa.ctor R.A. R.M.
5 32 20
7 60 42
11 140 110
13 192 156
17 320 272
19 396 342
R.A. - the number of real additions. R.M. - the number of real multiplications.
9.5 Winograd Small FT Algorithm
The action of the Winograd core C(p) in the fundamental factorization also can be computed by the Winograd small convolution algorithm. (See problems 4 and 5.) Recall that the Winograd factorization for a skew-circulant matrix C(p) has the form
C(p) =-- LGM, (9.5)
where L and M are matrices of small integers and G is a diagonal matrix. The matrices L and M are generally not square matrices.
Example 9.14 Consider the five-point FT matrix factorization.
[1 0 0 0 0 0
F„ = 0 C(5) A(5), 0 0
168 9. MFTA: The Prime Case
where [ v v2 V4 V3
c(5) = v24 v34 v3 v2 ,
V = e(21"/5). V V V V
V3 V V2 V4
Several Winograd factorizations of C(5) have been derived in chapter 5. For example, we have
C(5) = LGM,
where 1 0-11 01
1 —1 1 1 —1 1 L =
—1 0 11 01' —1 1 —1 1 —1 1
- 1 0 —1 0 -
1 —1 —1 1
0 1 0 —1 M =
1 0 1 0 '
1 —1 1 —1
_ 0 1 0 1 _
and G = diag(g), where
V — V4
V V — V4 — (V3 — V2)
1 V3 1 3 2 V — V
g = M v4 = —2 • v v4 V2 V + V4 — (V3 ± V2)
V3 + V2
Then 1 0] [1 0] [1 0 A(\
Flr—[OL OG 0/1/`"'
In general, if (9.5) is the Winograd factorization of the Winograd core C(p), then we have the factorization
F, = M' ,
where G' is the diagonal matrix
• = G,
and L' and M' are matrices of small integers,
• = 1 e L,
/1/1' = (1 M)A(p).
9.6 Summary 169
Since F(p) P-1F,P = P-1L'G'111P,
by setting A = Af' P,
B = G',
and c = P-1L',
we have F(p) = CBA, (9.6)
which has the same form as the Winograd small FFT algorithm. This form was given in chapter 8. The computation of (9.6) can be carried out in three stages: The first stage is the preaddition stage given by matrix A, the second stage is the multiplication stage given by the diagonal matrix B, and the last stage is the postaddition given by matrix C.
The Winograd algorithm increases the number of additions but greatly reduces the number of multiplications.
9.6 Summary
For a generator z of U(p), define the matrix F, by
F(p) P-1F„P,
where 7r is the exponential permutation based on z and P is the permutation matrix corresponding to 71". F„ describes the p-point FT on input and output data ordered exponentially by z.
Set pi = p — 1 and v = e27"/P. Define the preaddition matrices
1 lt
A(p) =[ I
[ et B(p)=
—pie /p, '
Bi(p) ftl —2f '
where e and f are vectors of size pi, defined in section 4. Define the Winograd core C(p) as the skew-circulant matrix with the
0-th row p-2
V, VZ, VZ
170 9. MFTA: The Prime Case
and observe that C(p) has the form
r x(p) x.(p) 1. c(P) = x.(p) x(p)
Define the diagonal and block diagonal matrices
D(p) = F (0-.1 C(p)F (0-1
1 [ X(p) + X* (P) 0 YU)) = -2 0 X (p) - X* (p)]
Fundamental factorization
F, = ( 1 ED C(p))A(p).
Rader FFT I
• = (1 F(p'))(1 D(p))(1@ F(p'))A(p).
Rader FFT II
• = (1 e F(p'))(1 ED D(p))B(p)(1 ED F (0).
Rader FFT III
• = (1 031 (F(2) 0 I p72))(1 ED Y(p))(1 ED (F(2) 0 2))A(p).
Rader FFT IV
• = e (F(2) ® ipy2))(1 e Y(P))B4)(1 e (F(2) ® 472)).
Winograd Algorithm By the Winograd small convolution algorithm, we have
C(p)= LGM,
where L and M are matrices of small integers and G is a diagonal matrix, and we have
= (1 ED L)(1 ED G)(1 ED M)A(p).
The implementation of these algorithms has been carried out on the Micro VAX II. A major issue, apart from run-time, is simplicity of code generation. In particular, for the Micro VAX, Wader FFTs I and II appear to have the best structure for programming. However, for implementation on computers such as the CRAY X-MP and IBM 3090, where multiplications are tied to additions, an arithmetically balanced algorithm is preferred. In this case, Rader FFT IV is the best.
Problems 171
References
[1] Rader, C. M. "Discrete Fourier Transforms When the Number of Data Samples Is Prime", Proc. IEEE, 56, 1968, pp. 1107-1108.
[2] Winograd, S. "On Computing the Discrete Fourier Transform", Proc. Nat. Acad. Sci. USA, 73(4), April 1976, pp. 1005-1006.
[3] Winograd, S. "On Computing the Discrete Fourier Transform" , Math. Cornput., 32, Jan. 1978, pp. 175-199.
[4] Blahut, R. Fast Algorithms for Digital Signal Processing, Addison-Wesley Pub. Co., 1985, Chapter 4.
[5] Heideman, M. T. Multiplicative Complexity, Convolution, and the DFT, Springer-Verlag, 1988, Chapter 5.
Problems
1. Show that 3, 2, 2 and 6 are generators of the unit group U(7), U(11), U(13) and U(17), respectively.
2. Find generators for the unit group U(7), U(11), U(13) and U(17) that are different from the ones given in problem 1.
An N x N matrix S is skew-circulant if S,,j = for i + j = k + l mod N.
3. Show that for a skew-circulant matrix S, = S.
4. Show that C is a circulant matrix if and only if S = CR is a skew-circulant matrix, where R is the time-reversal matrix.
5. Use problem 3 to show that FSF is a diagonal matrix whenever S is a skew-circulant matrix and F is the FT matrix.
6. Order ZI7, Z/11, Z/13 and Z/17 by the powers of the generators given in problem 1.
7. Write the Winograd core corresponding to the generator z = 3
for U(7).
8. Write the Winograd core corresponding to the generator z = 5 for U(7), and observe the difference with problem 7.
172 9. MFTA: The Prime Case
9. F(7) can be written as
1 1
Q -1 C(7)Q
Determine the permutation matrix Q.
10. Verify table 9.4.
11. Derive the Rader algorithm for an 11-point Fourier transform.
12. Derive the Good-Thomas algorithm for F(10) in terms of F(2) and F(5). Using this derivation, compute the C(11) Winograd core.
13. If a given computer has the CPUTIME ratio as one real multiplication per ten real additions, what is the threshold size p for which we would choose Rader FFT II to compute F(p) instead of Rader FFT I.
14. Prove the formulas given in table 9.3.
15. What is the basic difference between the Winograd algorithm and the algorithms derived in sections 2, 3 and 4?
16. Give the arithmetic counts for the five-point Winograd algorithm.
17. Derive the Winograd algorithm for F(3).
10 MFTA: Product of Two Distinct Primes
10.1 Basic Algebra
The results of chapter 9 will now be extended to the case of a transform size that is a product of two distinct primes. As mentioned in the general introduction to multiplicative FT algorithms, several approaches exist for combining small size algorithms into medium or large size FT algorithms by the Good-Thomas FFT. The advantage of using the Good-Thomas FFT is that tensor product rules directly construct multiplicative FT al-gorithms for appropriate composite size ca.ses. The method is completely algebraic and results in composite size algorithms whose factors contain tensor products of prime size fa,ctors. However, these results are not totally appealing since complex permutations appear. A related problem is that tensor products are taken over direct sum factors.
In the following two chapters, we will derive multiplicative composite size FT algorithms based directly on the CRT ring-isomorphism. This approach will necessarily repeat some of the constructions used in deriving the Good-Thomas FFT, but will result in multiplicative composite size FT algorithms that more naturally extend the prime size cases.
Our approach emphasizes and is motivated by the results of chapter 9. By employing tensor product rules, we derive the fundamental factorization
F, = CA,
174 10. MFTA: Product of Two Distinct Primes
where C is a block diagonal matrix having skew-circulant blocks (rotated Winograd cores) and tensor products of these skew-circulant blocks, and A is a matrix of preadditions. Variants will then be derived.
Talce N = pq, where p and q are distinct primes, and consider the ring ZIN. Throughout this section, set p' = p — 1 and q' = q— 1. The unit group of ZIN,
U(N)= {a E ZIN:(a,n)=1},
is not a cyclic group. To determine the structure of U(N), we will use the CRT. Throughout we assume that generators zi and z2 of U(p) and U(q) have been specified and suppress the dependence of the Winograd cores C(p) and C(q) on these generators.
Consider the complete system of idempotents lei, e2} for the factoriza-tion N = pq. In chapter 5, in the derivation of the Good-Thomas PFA, we defined the ring-isomorphism
cb:ZIpx Zlq'-' ZIN
by the formula
0(ai,a2) = aiei + a2e2 mod N, 0 < ai < p,0 < a2 < q.
The ring-direct product ZIpx Zig is taken with respect to coordinatewise addition and multiplication. The ring-isomorphism 0 restricts to a group-isomorphism
q5:U(p) x U(q) '--' U(N).
Since U(p) is a cyclic group, U(N) is the direct product of cyclic groups of order p' and q', and every u E U(N) can be written uniquely as
u a- ziei + 4e2 mod N, 0 < l < pi , 0 < k < q' .
We order U(N) by taking k to be the faster running parameter. An element a E ZiN that is not a unit is called a zero divisor. The set
eiU(N) = fziiei : 0 < / < Pi}
consists of all elements in ZIN that are divisible by q but not p, and the set
e2U(N) = {4e2:0 < k < ql
consists of all elements in Z/N that are divisible by p but not q. The ordering of U(N) induces an ordering of tbe sets eiU(N) and e2U(N). Order the set Z IN by the permutation
7r = (0, eiU(N), e2U(N), U(N)).
We call 7r the exponential perm,utation based on the factorization N -= pq.
10.2 Transform Size: 15 175
10.2 Transform Size: 15
Take p = 3, q = 5 and N = 15. In this case, the idempotents are
el = 10, e2 = 6,
and every element a E Z/15 can be written uniquely as
a al • 10 + a2 • 6 bmod 15, 0 < al < 3,0 < a2 < 5.
Take generators 2 of U(3) and 2 of U(5). Every element u E U(15) can be written as
u = 2/ • 10 + 2k 6, 0 < / < 2, 0 < k < 4.
U(15) is ordered as 1, 7, 4, 13, 11, 2, 14, 8,
and we have
10U(15) = {10,51, 6U(15) =16,12,9,31.
The exponential permutation based on 15 = 3 x 5 is'
7r = (0; 10, 5; 6, 12, 9, 3; 1, 7, 4, 13, 11, 2, 14, 8) .
We now proceed to describe the matrix
F, = [u7(3)71 , 0 < j , k < 15, w = e27"/15
Set U = W5 = e27"/3
V = W3 = e2wi/5)
and set v2 v4 v3 V
2 [ U2 U2 I 5 e5 V4 V3 V2 V4 C3 =
U U V3 V V V
V V2 V4 V3
The matrices C3 and C5 are rotated Winograd cores. C3 is formed by
replacing u by u2 in C(3) and C5 is formed by replacing v by v2 in C(5). We also can write
-13 o o
G= [o 0 oio 10] C(3), C5 =
0 0 0 1 ‘-'‘").
_10 0 0
176 10. MFTA: Product of Two Distinct Primes
Direct computation shows that the bottom right-hand 8 x 8 submatrix of F„ is the tensor product
[U2C5 UC5 C3 0 C5 =
UC5 U2C5
Denote by 1, the vector of size m of all l's and by E(m, n) = 1,, 0 11 the m x n matrix of all l's. Set E(m) = E(m, m). We can rewrite F.„ as
1 lt4
12 C3 E(2, 4) C3 0 1t4 F =-
" 14 E(4, 2) C5 q 0 C5
18 C3 0 14 12 0 C5 C3 G
The highly structured form of F„, especially the repetition of C3 and C5 throughout the matrix, results from controlling data flow by the idempotents.
10.3 Fundamental Factorization: 15
We will now derive for F„ a factorization of the form
F.„ = C A, (10.1)
where A is a matrix of additions and C is a block diagonal matrix having skew-circulant blocks. This will generalize the prime transform size fac-torization of the same form derived in the preceding chapter. As in the preceding chapter, factorization (10.1) will be a spring board for several algorithms distinguished by arithmetic and data flow.
First,
C312 = —12, (10.2)
C514 — —14. (10.3)
The tensor product formula
(A B)(C D) = (AC) 0 (BD)
implies that C3(/2 0 lt4) = C3 0 1t4,
C5(1 014) = 0 C5.
Using (10.2) and (10.3) along with
E(m,n) = 1,, 0 ltn,
10.3 Fundamental Factorization: 15 177
we have
C3E(2, 4) = -E(2, 4),
C5E(4, 2) = -E(4, 2),
(C3 0 C5)18 = 18,
and we can write F, = CA,
where C is the block diagonal matrix
C = 1 e C3 e C3 e (C3 ® C3),
and 1 l& lt 4 11
[ -12 h -E(2, 4) /2 0 1'4 A --
-14 -E(4,2) 1.4 1. 0 /4 .
18 -1.2 0 14 -12 0 /4 /8
We can relate the matrix A to the matrix of additions in chapter 9. Recall that
A(p) _ r 1 4, 1 L -1 , r , i . P P
We can rewrite A as
A = 1 A(3) A(3) 0 lt4 1 [ -A(3) 0 14 A(3) 0 /4 i '
Now let Q0 = P(12, 4), as in chapter 2, and Q = 13 ED Q0. Straightforward computation shows that
C2o(A(3) 0 /4)Q0 1 = /4 0 A(3),
(A(3) 0 lt4)Q(71 = lt4 0 A(3),
Ch(A(3) 0 14) = 14 0 A(3),
and we have QAQ-1 = A(5) 0 A(3),
proving the following result.
Theorem 10.1
F, = (i. e C3 e C5 e (c3 ® C5))Q-1(A(5) 0 A(3))Q,
where Q = /3 ED P(12, 4), and C3 and C5 are the rotated Winograd cores
U2 U 21-i/3 G = [ U U2 i ) U = e ,
178 10. MFTA: Product of Two Distinct Prirnes
V2 V4 V3 V 4 3 2 e5 = V3 V V2 V4 , v = e2,0
[
V V V V
V V2 V4 V3
Table 10.1 Fir = CA, direct method (N = 15).
Factor R.A. R.M.
A 88 122}
228 272
F„ 316 272+122}
R.A. - the number of real additions M.A. - the number of real multiplication
10.4 Variants: 15
Variants of the above factorization will be designed in the spirit of chapter 9. First, by the convolution theorem,
D3 = F(2)-1C3F(2)-1,
D5 = F(4)-105F(4)-1
are diagonal matrices. Setting
F = 1 ® F(2) ED F(4) ED (F(2) F(4)),
D =1ED D3 IED D5 ED (D3 0 D5),
we have that D is a diagonal matrix and we can write
C = FDF,
proving the next result.
Theorem 10.2
= F(10) D3 ED D5 (D3 0 D5))FQ-1(A(5) 0 A(3))Q ,
where F is the FT factor
F = 1 ED F(2) ED F(4) ED (F(2) F(4))
and Q = /3 ED P(12,4).
10.4 Variants: 15 179
As in chapter 9, variants can be designed to reduce additions. First,
F, = FDBF,
where B = FAF-1.
By interchanging the order of computation, the action of A is replaced by the action of B, which, as we will now show, reduces the additions at the expense of a few rational multiplications. Define, as in chapter 9,
11000- -41000
1 , B(5) --- 0 0 1 0 0 . 00010 00001_
For the purpose of this discussion, set
X = 1 031 F(2),
and observe that F = X IED (X 0 F(4)).
A straightforward computation shows that
XA(3) = B(3)X,
which is what we need to prove that
B = 1 B(3) B(3) 0 et 1 L -4B(3) 0 e B(3) 0 14 i '
where et=[1000].
We see that 40 real additions and 8 multiplications by small integers are needed to compute the action of B.
The action of B can also be computed, without changing the arithmetic, by the factorization
B = Q-1 (B(5) 0 B(3))Q,
where Q = 13 031 P(12, 4). We then have the next result.
Theorem 10.3
F7,- = F(1 ED D(3) ED D(5) ED (D(3) 0 D(5))Q-1(B(5) 0 B(3))QF,
where Q = 13 ED P(12,4) and F is the FT factor defined in the previous theorem.
B(3) = [ 1
-2 0
1 1 0
0 0 1
180 10. MFTA: Product of Two Distinct Primes
C3 and C5 can be written in the form
C3 = x3 A;
,c; X3
X5 Xg` C5 = [ jq, x5 .
Setting 1 1
Y3 = -2(X3 + X3') -
2(X3 - X; ),
1 Y5 = -
2(X5 + X'5*) e -1 (X5 - XD,
2
we have the next result by Rader FFT III.
Theorem 10.4
F, = H(1 e Y3 ED Y5 Ef) (Y3 0 Y5))HQ-1(A(5) A(3))Q,
where Q = /3 ED P(12,4) and H is the FT factor
H = 1 ED F(2) ED (F(2) 0 /2) ED (F(2) 0 F(2) 0 /2).
The factor Y contains all of the multiplications. Reasoning as in the previous chapter, these are all real multiplications. Computing the action of Fir in this way requires 200 real additions and 68 real multiplications.
Table 10.2 C = HY H (N = 15).
Factor R.A. R.M.
H 44 0
Y 24 68
C 112 0
R.A. - the number of real additions M.A. - the number of real multiplication
The cost of additions can be reduced by computing F,
F„ = HY Bill,
where Bi = HAH - 1.
10.5 Transform Size: pq 181
From chapter 9, recall the definitions
1 1 0 Bi(3) = [ —2 1 0 1
0 0 1 -
1 1 1 —2 1 0
Bi (5) = —2 0 1 0 0 0
_ 0 0 0
,
0 -
0 0 0 0 0 1 0 0 1_
By the usual tensor product manipulations, we have
Bi = Q-1 (Bi (5) 0 Bi (3))Q,
with Q = /3 ,E9 P(12,4).
Theorem 10.5
F, = Ho. e Y3 e Y5 ED (Y3 0 Y5))Q 101(5) 0 B1 (3))QH,
where Q =- /3 ED P(12,4) and H is the FT factor given in the previous theorem.
Table 10.3. F„ = HY 1511H (N = 15).
Factor R.A. R.M.
Bi 44 {111
F, 156 68 ±{111
R.A. — the number of real additions M.A. — the number of real multiplication
Multiplication by integers has been placed in brackets.
10.5 Transform Size: pq
In this section, algorithms for N -= 15 will be generalized to the case of N = pq, where p and q are distinct primes. Set U = U(N) and v = e27i/N . Throughout, zi and z2 are generators of U (p) and U(q), and we suppress the dependence of the Winograd cores on these generators. Set p' = p — 1 and q` = q— 1.
Denote by fel, e2}- the complete system of idempotents for the factoriza-tion N = pq. Partition the indexing set ZIN by the sets
0 U = { 0 } ,
182 10. MFTA: Product of Two Distinct Primes
= {ztei I 0 < k <111,
e2U = {4e2 I 0 < < gib
U = {ztei + 4e2 I 0 < k < p',0 < g').
The permutation 7r = (0; eiU; e2U; U)
is the exponential permutation based on the factorization N = pg. Al-though the definition of 7r contains the idempotents ei and e2, since the idempotents are uniquely determined by the factorization, the definition of the exponential permutation is unambiguous once the generators zi and z2 are specified.
Consider the submatrix corresponding to the Cartesian product
eiU x eiU.
Since (eizn(eizri) = eIzik±r eizik±r mod N,
the submatrix of F, corresponding to ei U x ei U is the skew-circulant matrix
Cp= [(vel)4+r ], 0 < k, r < pi.
v" is a primitive p-th root of unity. In general, if u is any primitive p-th root of unity, then the matrix Cp(u) formed by replacing evi/P in C(p) by u is called the rotated Winograd core based on u. The matrix Cp is the rotated Winograd core based on v".
In the same way, the submatrix of F, corresponding to the Cartesian product e2U x e2U is the rotated Winograd core based on the primitive q-th root of unity ve2,
Cq = [ (v€2)4+' I, 0 < s < g'
Consider now the submatrix of F, corresponding to the Cartesian prod-uct U x U. A typical entry in this submatrix is given by raising v to the power
\ k+r i+s
(ztei + z2e2)(4ei + z2e2) = ei + z2 e2 mod N,
(v").44-'(ve2)4+'
Since / and s are faster running parameters, this submatrix can be decomposed into submatrices,
(Vel ) 1 CD 0 < k,r < ,
and the submatrix of F, corresponding to U x U is the tensor product
Cp ® Cq.
10.6 Fundamental Factorization: pq 183
Similar arguments apply to the remaining submatrices. We summarize the results in the following description of F„:
1 ltp, lt
[
ltr, q' F = lp, Cp E(p' , q') Cp 0 ltq,
1,, E(q' ,p') C, l'p, 0 C, ' 1r, Cp 0 1,, lp, 0 C, Cp 0 Cq
where r' = p'q' . The highly structured form of the matrix is due in large part to the use of idempotents.
If we set 1 lt,
F,(p) = [ , P i , lp, Cp
then F7, = r F,(p) F„(p) 0 itg, 1
I_ F,(p) 01,, F„(p) 0 C, i '
leading to the next result.
Theorem 10.6 Suppose that 7r is the exponential permutation of Z I N based on the factorization N = pq. Then
1 F„(p) F„(p) 0 l'q, 1 F„ =
[ F,(p) 0 1,, F,(p) 0 C, i '
where [ 1 ltp, 1
F,(p) =- 1,, Cp i
with Cp the rotated Winograd core based on vel and C, the rotated Winograd core based on ve2 , v = e27ri/N .
Tensor product manipulation shows that
F, = (Ip e P(pq" ,p))(F,(p) 0 Fir (q))(Ip ED P(pq' ,p))-1
10.6 Fundamental Factorization: pq
The goal is to produce a factorization of the form
Fir = C A,
where A is a matrix of additions and C is a block diagonal matrix having skew-circulant blocks. The main ideas were given in the 15-point example. First, since the sum of the m-th roots of unity equals zero for any integer m > 1, we have
Cplyy = —1,,,
184 10. MFTA: Product of Two Distinct Primes
C, =
As in chapter 9, we can write
F„(p) = ( 1 ED Cp)A(p),
F.„(q) = ( 1 ED COA(q).
Setting
and observing
we have
It follows that
C = 1 e Cp Cq e (Cp CO
Cq (Cp Cq) = ( 1 ED Cp ) ,
F,(p) ltq, = (1 ED Cp)(A(p) itq,)
F,(p) = —(C, (C, 0 C,))(A(p) lq,)
F„(p) Cq = (Cq e (cp 0 Cq))(A(p) 4).
F, = (1ED C, Cq ED (Cp Cq)) [ _A(pA(P) AA((pl 1.1 qtq: . (10.4)
The ring structure has naturally pointed the way to the highly structured form of the factorization of F,. As discussed in section 2, tensor products of the corresponding p-point and q-point algorithms directly lead to an arithmetically equivalent algorithm that has a different data flow. This is a constant theme throughout this section and the next.
Denote the matrix on the right-hand side of (10.4) by A. We can implement A as a tensor product,
A = Q-1(A(q) A(p))Q
where Q = Ip ED P(pq' , q'),
proving the next result.
Theorem 10.7 (Fundamental Factorization) If fel,e21 is the complete system of idempotents for the factorization
N = pq, then
F, (1 e cp c, e (cp o cq))Q-1(A(q) A(p))Q,
where Q = Ip ED P(pq' , q'), and C, and Cq are the rotated Winograd cores based on vel and ve2 v =
10.7 Variants 185
We see that F, can be built from the corresponding p-point and q-point algorithms desigmed in chapter 9 by tensor products. A general observation will be useful. If X is an m x m matrix, Y is an n x n matrbc, and if an algorithm computes the actions of X and Y using A(x) and A(y) additions, respectively, then the action of the tensor product X Y requires
nA(x) + mA(y)
additions.
Table 10.4 F., = C A, direct method (N =- pq).
Factor R.A. R.M.
A 4(p' q + pq') 0
C 2(p' (2p - 3)q + q' (2q - 3)p) 4(112 q + q/2
Table 10.5 Fir = CA, direct method.
Size R.A. R.M.
15 316 272
21 608 544
35 1284 1168
55 2892 2704
R.A. - the number of real additions M.A. - the number of real multiplication
10.7 Variants
The methods of chapter 9 will be applied to design several variants of the factorization providing options for arithmetic and data flow that can be matched to a variety of computer architectures.
Define the diagonal matrices Dp and D, by
= F (p' )- 1 C,F(p')-1 ,
= F(q')-1CqF(qi)-1.
Theorem 10.8
F, =- F (1 e Dp e D q ED (Dp DO)FQ-1(A(q) A(p))Q ,
186 10. MFTA: Product of Two Distinct Primes
where F is the FT factor
F = 1 ED F(p') F(q1) e (F(p') F(q')).
The cost of additions in the previous theorem can be reduced by interchanging the order of the operations. Write
F, = FD(FAF-1)F
and set B = FAF-1.
Arithmetically, the action of A is replaced by the action B, which we will see requires fewer additions. To see this, we need to describe B. Recall that e(m) is the vector of size m having the 0-th component 1 and all of the others 0, and
B(P) = [ 1
1 et
—p e Ip‘] ' e = e(P')
.
Direct computation shows that
( 1 ED F(p'))A(p) = B(p)(1 F(p')),
which is what we need to show that
B(p) et 1 B = [ ,B(P) e = e(g')
—p B(p) e B(p) '
Arguing as in the preceding section, with
Q = Ip P(pq',q'),
we have B = Q-1 (B(q) B(p))Q.
Theorem 10.9
F, = F(1 ED Dp Dq (Dp Dq))Q-1(B(q) B (p))Q F,
where Q = Ip ED P(pq'q') and F is the FT factor
F = e F(p1) e F(q') (F(p') F(0).
The arithmetic of B is given as follows:
4(p + q) R.A.
+ q} R.M.
In brackets, we have placed the number of multiplications by integers.
10.7 Variants 187
Table 10.6 Real additions.
Size A B
15 88 32
21 128 40
35 232 48
55 376 64
To reduce the cost of additions required to perform the complex mul-tiplications coming from the action of C, we note that Cp has the form
Xp Xp* Cp = [ x; xp .
Direct computation shows that
Cp = (F(2)0 lpy2)Y,(F(2) Ip'/2),
where = 1/2(Xp + X;)e 1/2(Xp — Xp*)
Arguing as before, we have the next result.
Theorem 10.10
Fir = H (1 Yp e Yp 11,))H Q-1 (A(q) A(p))Q,
where Q = lp ED P(pq' ,q') and H is the FT factor
H = le (F(2) ® 12) e (F(2) ® 472) e (F(2) ® ipy2 ® F(2) ® 4,/2)-
Table 10.7 uses the fact that 1/2(Xp + Xp*) has only real entries and 1/2(Xp — Xp*) has only imaginary entries.
Table 10.7 Arithmetic counts of H and Y.
Factor R.A. R.M
H 2(p — 1)q + (q — 1)p 0
y (p2 1)g (g2 1)p (p 1)2 q (q 1)2p
R.A. — the number of real additions M.A. — the number of real multiplication
188 10. MFTA: Product of Two Distinct Primes
Setting
Y = 1 e IP Yq (Yp Yq),
the factorization in the preceding theorem can be rewritten as F, = HY 131H, where
= HQ-1 (A(q) A(p))Q
Fewer additions are required to compute the action of Bi as compared with computing the action of A. To see this, we must describe Bi.
For even n, define the vector f (Th) of size n by
f(n) [ 1 1 0 ® 1n/2.
Recall the definition
r e t
Bi(P) = —2f 4, f = f(p')
.
Then (F(2) 0 4,f2)1p, = 2f ('I).
This leads to the following description of Bi:
[ Bi(p) (p) ft f = f(V) B1 = —2B1(p) f Bi(P) 1-q'
The usual tensor product manipulations show that
= Q-1(-131(4) 131.(P))(
where Q = ED P(pq' ,q').
Theorem 10.11
F, = H(1 ED Yp ED Yq e ( Yp YO)Q-1(131(q) 0 Bi(p))Q H
where Q = /I, ED P(pq' , q') and
H = 1 ED (F(2) I py2) e (F(2) 0 4/2) e (F(2) ® rpy2 ® F(2) ® 4/2).
Table 10.8 Real additions for computing C.
Size Direct Method C = HY H
15 228 112
21 480 200
35 1052 408
55 2516 864
10.8 Summary 189
10.8 Summary
Suppose that ir is the exponential permutation of Z/N corresponding to the factorization N = pq. Denote by lei, e2} the complete system of idem-potents for the factorization. Suppose that zi and z2 are generators of U(p) and U(q). Denote by F, the FT matrix corresponding to 7r. Set Q = Ip ED P(pq' , q').
In the following discussion we will suppress dependence on the choice of generators. Denote by Cp and Cq the rotated Winograd cores based on ye, and ve2, v = e27"/N. Define the multiplicative factors
C(p,q) = le cp cq e (cp cg).
Cp and Cq have the form
X X* C = [ P P P X; Xp
with a similar formula for Cq. Define the diagonal factors
D(p, q) = e Dp e e (Dp Dq),
where Dp = F(p')-1CpF(p')-1 with a similar formula for Dq. Define the block diagonal factors
Y(p,q) =1EDYp E13)Yq ED (Yp 01'0,
where 1 [ Xp* 0 1 =
0 X — X* '
P p
with a similar formula for Yq. Define the FT factors
F (p, q) = 1 ED F (p') ED F (q') ED (F (p') F (g'))
and
H(p, q) = le (F(2)0 /p72) ED (F(2) 0 4 /2) e (F(2) ® ipy2 ® F(2) o/v/2).
Define the preaddition factors
A(p, q) = Q'(A(q) A(p))Q,
B(p, q) = Q-1 (B(q) B(p))Q,
Bi(P, 4) = (Bi(q) Bi(p))Q,
where Q = Ip ED P(pq' , q').
190 10. MFTA: Product of Two Distinct Primes
• Fundamental Factorization
F„. = C(p, q)A(p, q).
• Rader FFT I
F, = F(p, q)D (p, q)F(p,q)A(p, q).
• Rader FFT II
• = F(p, q)D(p, q)B(p, q)F (p, q).
• Rader FFT III
F, = H (p, q)Y (p, q)H (p, q)A(p, q).
• Rader FFT IV
= 11(P, Oir 4)B1(1), OH (11,4).
References
[1] Good, I. J."The Interaction Algorithm and Practical Fourier Analy-sis", J. R. Statist. Soc. B, 20(2), 1958, pp. 361-372.
[2] Thomas, L. H. Using a Computer to Solve Problems in Physics, Application of Digital Computers, Ginn and Co., 1963.
[3] Burrus, C. S. and Eschenbacher, P. W. "An In-place In-order Prime Factor FFT Algorithm", IEEE Runs Acoust. Speech and Signal Proc., 29(4), Aug. 1981, pp. 806-817.
[4] Chu, S. and Burrus, C. S. "A Prime Factor FFT Algorithm Using Dis-tributed Arithmetic", IEEE Trans. Acoust. Speech and Signal Proc., 30(2), April 1982, pp. 217-227.
[5] Kolba, D. P. and Parks, T. W. "A Prime Factor FFT Algorithm Using High-speed Convolution", IEEE Trans. Acoust. Speech and Signal Proc., 25(4), Aug. 1977, pp. 281-294.
[6] Blahut, R. E. Fast Algorithms for DigitarSignal Processing, Addison-Wesley, 1985, Chapters 4 and 8.
[7] Nussbaumer, H. J. Fast Fourier Transform and Convolution Algo-rithms, Second Edition, Springer-Verlag, 1982, Chapter 7.
Problems 191
Problems
1. For p = 3 and q = 7, find the system of idempotents fel, e21.
2. Find the unit group U(21). List all of the elements by the ordering defined by the exponential ordering.
3. Take generator 2 for U(3) and 3 for U(5). Order the indexing set Z/15, and write a complete Fourier transform matrix corresponding to this ordering. Compare with the results in section 2.
4. Verify the tables in section 3, the arithmetic counts of F,(15).
5. Find the system of idempotents for Z/33, Z/35 and Z/39. Reorder the indexing set by the idempotents.
6. Write C3 and CH in F7,433).
7. Write C5 and C7 in F7,-(35).
8. Write C3 and Ci3 in F7,439).
9. Verify the arithmetic counts given in tables 10.4 and 10.7.
,.
11 MFTA: Composite Size
11.1 Introduction
In this chapter, we extend the methods introduced in the preceding two chapters to include the case of a transform size that is a product of three or more distinct primes. In fact, we will give a procedure for designing algo-rithms for transform size N = Mr, r a prime not dividing M, whenever an algorithm for transform size M is given. We will also include FT algorithms for transform size 4M, where M is a product of distinct odd primes.
11.2 Main Theorem
Let N = Mr, where r is a prime not dividing M. Throughout this section, r' = r — 1, v = e'riim and w = Fbc a permutation 7r of Z/M and a generator z of U(r). Denote by F, the FT matrix corresponding to ir
F, = [27(a)ir 0<a, b<M
F, computes the M-point FT on data reindexed by 7r. We will develop a procedure for constructing an N-point FT algorithm with F, embedded in the computation. Any algorithm computing the action of F, can be used to compute the N-point FT.
194 11. MFTA: Composite Size
Consider the complete system of idempotents 4} for Mr. By the CRT, every a E ZIN can be written uniquely as
a 7r(ai)ei + a2e2, 0 < < M, 0 < a2 < r.
Partition Z/N into the sets
S = fir(ai)e'l : 0 < al < MI,
T = fr(ai)e'l + zke2: 0 < al < M, 0 < k <
We order S by the parameter al and T by the parameters ai and k, with k as the faster running parameter. Define the permutation p of ZIN by
p= (S, T),
and consider the FT matrix Fp corresponding to p
Fp = [wP(1)P(01 0.<1 k<N*
Fp computes the N-point FT on the data reindexed by p. We decompose Fp into the four submatrices corresponding to the Carte-
sian products S x S, S x T,T x S and T x T. Consider first S x S. The corresponding submatrix is
= Rwely(a1)7 011 0<ai GM
Since we'i is a primitive M-th root of unity, F,' is formed by replacing e2"iim in F, by wel and we can write F„' = PF, for some permutation matrix P.
The submatrix corresponding to T x T,
[wir(ai)r(191)4-1-z'+ke'21
0<ai, bi GM , 0<1,k<r'
can be written as
[(we'i y(ai)/r(bi) (we )zi+k 1 7
0<al, bi<M, 0<1,kGr'
which, by the ordering on T, is
F„' 0 Cr, —
where Cr is the rotated Winograd core based on the primitive r-th root of unity we'2. Continuing in this way,
F — F:r ® P F.; r®Cr j•
11.2 Main Theorem 195
Since
Crir, = -lr',
we have
Im 0 4,1 F, -= ED (F,', 0 Cr)) [_.im
Moreover,
p(mr',111) (Ir, ® Im ) =
implies that
ltr'l Q-1 (A(r) 0 I m)Q = [_ m 1.7., mr, J
where Q = e P(Mr', r'), proving the next result.
Theorem 11.1 If 71- is a permutation of Z I M and {el, e'21 is the complete system of idempotents for the factorization N = Mr, r a prime not dividing M , then there exists a permutation p of ZI N such that
Fp = (F; e (F; Cr))Q-1 (A, 0 I m)Q
with Q -= Im ED P(Mr' ,r'). F; is the matrix formed by replacing e'ilm in F, by we'i and Cr is the rotated Winograd core based on we' 2 , w =
The permutation p has been explicitly described in the preceding discussion.
Every factorization of F; into a product of two M x M matrices F; = CA produces a factorization of Fp by the tensor product manipulations
Fp = (C (C Cr))(A e (A 0 Ir,))Q-1(A(r) Im)Q
= (C e (c ® cr))(2-1(ir A)(Ar Im)Q
= (C e (c ® cr))Q-1(A(T) ® A)Q
which is summarized in the next corollary.
Corollary 11.1 If F.; = CA, then
Fp (C ED (C 0 Cr))Q-1(A(r) 0 A)Q.
In many applications, A is a matrix whose coefficients are talcen from {0, 1, -1} and C is the tensor product of rotated Winograd cores.
196 11. MFTA: Composite Size
11.3 Product of Three Distinct Primes
We will now apply the results of the preceding section to design multiplica-tive FT algorithms for transform size N = pqr based on the multiplicative FT algorithm for M = pq developed in the preceding chapter, where p, q and r are distinct primes. Throughout we will use the notational con-ventions established in section 2. Set p' = p - 1, q' = q - 1, r' r - 1, w = e'IN and v = e27"im. Consider the complete system of idempotents
f2, f31 for the factorization N = pqr:
1 mod p, 0 mod 4, fi 0 mod r,
h 0 mod p, h 1 mod q, h 0 mod r, f3 0 mod p, f3 0 mod q, f3 1 mod r.
The set fel, e21 with ei = fi mod M and e2 = f2 mod M is the complete system of idempotents for the factorization M pq, and the set {el, e'21 with ec = f2 and e'2 = f3 is the complete system of idempotents for the factorization N = Mr. Throughout the discussion we suppress dependence on the choice of generators for U(p), U(q) and U(r).
Choose the exponential permutation 7r of Z/M corresponding to the factorization M = pq. By the fundamental factorization,
F, = (1 e cp e (c, ® Cq))Q-1(A(q) A(p))Q,
where 6', and Cq are rotated Winograd cores based on v" and ve2 and Q -= ip ED P(pq1,q/). Then
= (1 e c; e c,' e (c; ® C ))Q-1 (A(q) A(p))Q ,
where Cp' is the rotated Winograd core based on w"''2 = wfl and Cq' is
the rotated Winograd core, based on we'le2 wf2 . Since Cr' is the rotated Winograd core based on we2 = wf3, we have the next result by the corollary of section 2.
Theorem 11.2 Suppose that { f2, f31 is the complete system of idem-potents for the factorization N = pqr for distinct primes p, q and r. Then there exists a permutation p of ZIN such that
Fp = C pQ-1(A(r) A(q) A(p))Q,
where
Cp=1A)CpeC,e(Cp0C0eCre(Cp0C,)03(C,OCr)ED(Cp®Cq®C,-)
with rotated Winograd cores Cp, Cq and Cr based on 'Loh , wf2 and wh , w = e27"/N and Q = (Ir e P(pgl,q1)))(41 P(Mr',r')). ,
The permutation p is constructed from 7r and {el, as described in the preceding section.
Variants can be produced by the same tensor product manipulations described in the preceding chapters. We will state results without proofs.
11.4 Variants 197
11.4 Variants
Denote by ffi, f2, f3} the complete system of idempotents for the factor-ization N = pqr for distinct primes p, q and r. Denote by Cp, Cq and Cr the rotated Winograd cores based on wfl, wh and wf3, w =
Define the multiplicative factor
C(p,q,r) = C(p, q) ED (C(p,q) 0 Cr),
where C(p, q) is based on Cp and Cq as defined in the summary of chapter 10.
Define the diagonal factor
D(p, q,r) = D(p, q) e (D(p, q) D(r))
and the block diagonal factor
Y (p, q, r) = Y (p, q) ED (Y (p, q) Y (r)),
where D(p, q) and Y (p, q) are defined in the summary of chapter 10. Define the FT factors
F(p,q,r) = F(p, q) (F(p, q) F(r))
H(p,q,r) = H(p, q) e (H (p, q) 0 (F(2) 0 Iry 2)),
where F(p, q) and H (p, q) are defined in the summary of chapter 10. Define the preaddition factors
A(p, q, r) Q-1 (A(r) A(q) A(p))Q,
B(p, q, r) = Q' (B(r) B(q) B(p))Q,
Bi(p,q,r) = Q-1 (Bi(r) Bi(q) 0 Bi(p))Q,
where A(p), B(p) and Bi(p) are defined in chapter 9 and Q = (I, 0 (Ip ED P(pq' , q'))(I m ED P(Mr' , r')).
Fundamental Factorization
Fp = C(p,q,r)A(p,q,r).
Rader FFT I
Fp = F(p,q,r)D(p, q,r)F(p,q,r)A(p,q,r).
Rader FFT II
Fp = F(p,q,r)D(p,q,r)B(p,q,r)F(p,q,r).
198 11. MFTA: Composite Size
Rader FFT III
= H(p, q,r)Y (p, q,r)H(p,q,r)A(p, q, r).
Rader FFT IV
Fp = H (p, q, r)Y (p, q,r)Bi(p, q,r)H(p,q, r).
These results easily generalize to an integer N equal to the product of an arbitrary number of distinct prime factors. The only change is that the multiplicative matrices are rotated Winograd cores based on raising w = e271-2/N to powers given by the complete system of idempotents of the factorization of N into this product.
11.5 Transform Size: 12
Consider the permutation 7r of Z/4,
ir = (0, 2, 1, 3).
F, admits the factorization F, = C A,
where
[100 0 [I. 1 1 1 010 0 1 1 —1 —1
C = A = 0 0 1 i ' 1 —1 0 0 0 0 1 —i 0 0 1 —1
{9,4} is the complete system of idempotents for the factorization 12 = 4 x 3. Set w = e2"1112. Since w9 = —i and w4 = e27"/3, we have, in the fundamental factorization for 12 = 4 x 3, F", = C* A and C3 = C(3). By the corollary of section 2,
= (C* (C* C(3)))Q-1(A(3) A)Q,
where Q = ED P(8, 2). The permutation p of Z/12 is
p = (0, 6, 9, 3; 4, 8, 10, 2; 1, 5, 7 , 11).
11.6 Transform Size: 4p, p odd prime
Choose ir = (0, 2, 1, 3) and F„ = C A as in the preceding section. Denote by e'21 the complete system of idempotents for the factorization N = 4p.
11.7 Transform Size: 60 199
Set w = eri N . In the fundamental factorization for N = 4p, if we'i then = F, while if we'i = —i, then F.„' = C* A. We have, by the corollary of section 2,
(c e (c Cp))Q-1(A(p) A)Q , we' = FP
= (C* 133, (C* Cp))Q-1(A(p) A)Q, wel =
where Cp is the rotated Winograd core based on w4 and Q = 14 ED13(4p1, p'). The permutation p = (S,T) is given by
S = {7r(ai)e'l : 0 _< < 4} = (0,2e11,
T = (To,T2771)T3)
with Tj = {jel + zkei2 : 0 < k < p'1,
where z is a generator of LI (p).
11.7 Transform Size: 60
Choose the permutation 71" of Z/12 given in section 4,
= 06934810215711).
As shown in section 4,
F, = (C* e (c* C(3)))Q-1(A(3) A)Q
with C, A and Q as defined in section 4. {25,36} is the complete system of idempotents for the factorization 60 =
12 x 5. Set w = e2'/6°. Then
w25 = e27riA, w36 = e27rit
F,' is formed by replacing e'/12 in F, by e27riA. Since
e2iri _
ezn-i/3 _ = (eza-in )*,
C* remains unchanged while C(3) is changed to C*(3). We have
= (C* e (c* c*(3)))Q-1(A(3) 0 A)Q.
Since C5 is the rotated Winograd core based on e2'it,
C5 = SC(5),
200 11. MFTA: Composite Size
where 0 0 0 1 1 0 0 0
S 0 1 0 0 • 0 0 1 0
By the corollary of section 2,
Fp = (C' e (C' ® SC(5)))W1 (A(5) 0 A(3) 0 A)Qi,
where C' = C* (C* 0 C* (3)) and Qi --= (/5 Q)(/12 ED P(48, 4)). {45, 40,36} is the complete system of idempotents for the factorization
60 = 4 x 3 x 5. Since
40 27ri a 36 27ri
W45 = - i W = e 3 W = e 5 1
we have
C4 =- C* (4), C3 -= C* (3)1 C5 = SC(5),
which, by the theorem of section 3, implies that
F„ = (C'' e (C" SC(5)))Q-1 (A(5) 0 A(3) A(4)))Q,
where C" = 1 ED C*(4) C*(3) ED (C*(4) 0 C* (3)) and Q = (15 0 (14 ED P(8, 2))) (Ii2 ED P(48, 4)).
Tables of Arithmetic Counts
Table 11.1 Fp CpAp.
Factor R.A. R.M.
Cp 60 64
Ap 68 0
Fp 128 64
Table 11.2 F„ = CpAp.
Factor R.A. R.M.
Cp 4(p — 1)(4p — 5) + 4 16(p — 1)2
Ap 4(7p — 4) 0
Fp 4(4p2 — 2p + 2) 16(p — 1)2
R.A. R.M.
20p - 8
P(P+ 5)
14(p - 1)1
4(p - 1)2 + f4(p - 1)}
Tables of Arithmetic Counts
Table 11.3 Fp = CpAp.
. R.A. R.M.
128 64
368 256
736 576
1856 1600
Le 11.4 C = HY H (N = 4p).
R.A. R.M.
4(p2 - 4p + 6) 4(p - 1)2
8(P - 1) 0
4(p2 + 2) 4(p - 1)2
able 11.5 F, = HY Bill.
['able 11.6 F, = HY BiH.
R.A. R.M.
96 16+0}
200 64+06}
336 144+04}
704 400+1401
- the number of real additions. ,he number of real multiplications.
202 11. MFTA: Composite Size
References
[1] Blahut, R. E. Fast Algorithms for Digital Signal Processing, Addison-Wesley, 1985, Chapters 6 and 8.
[2] Lu, C. Fast Fourier Transform Algorithms for Special N's and the Implementations on VAX, Ph.D. Dissertation, The City University of New York, Jan. 1988.
[3] Nussbaumer, H. J. Fast Fourier Transform and Convolution Algo-rithms, Second Edition, Springer-Verlag, 1982, Chapter 7.
Problems
1. For p =- 3, q = 5 and r = 7, find the complete system of idempotents for N = pqr.
2. Find the ordering of Z/105 by the idempotents of problem 1.
3. Define the matrices Cs, Cs and C7 in F(105).
4. Derive the 4p algorithm for p -= 5 in detail as in the example of N = 12 given in section 5.
5. Derive four variants of the N = 20 algorithm.
6. Find prime p with the property that the Fir' has to be written as F,', = C*(4)A(4).
7. Prove the formulas given in tables 11.2, 11.4 and 11.5.
12 MFTA: p2
12.1 Introduction
Multiplicative prime power FT algorithms will be derived. Although mul-tiplicative indexing will play a major role as in the preceding chapters, the multiplicative structure of the underlying indexing ring is significantly more complex, and this increased complexity will be reflected in the resulting algorithms.
Two different algorithms are given for the case p2 and examples are presented in detail. In section 2, we start with the example of 9. The general case p2 will be given in detail in section 3. An extension to the case pk is given in section 4 by the example of 27.
12.2 An Example: 9
Z/9 is not a field, but the unit group U(9) is a cyclic group of order 6 having generator 2:
U(9) = {2k : 0 < k < = (1,2,4,8,7,5).
Order U(9) by powers of 2 as shown.
204 12. MFTA: p2
Set w = 0'0 and
W W2 W4 W8 W7 W5 -
W2 W4 W8 W7 W5 W
C(9) = W4
W8 W8
W7 W7
W5 W5
W
W
W2 W2
W4 •
W7 W5 W W2 W4 W8
W5 W W2 W4 W8 W7
C(9) is a 6 x 6 skew-circulant matrix having the form
x(9) X*(9) 1 C(9) = [ x*(9) X(9) J
where w W2 W4
X(9) -= w2 w4 ws I . W4 W8 W7
We call C(9) the Winograd core based on the generator 2 of U(9). Consider the sets Do, Di and D2 defined by
Do = U(9),
= 3U(9) = {3, 6},
D2 = {0},
ordered as shown. The collection {Do, Di, D2} is a partition of Z/9. The permutation
= (D2, Di, Do) = (0; 3, 6; 1, 2, 4, 8, 7, 5)
is called the ezponential permutation of Z/9 based on the generator 2 of U(9).
Denote by 1,, the vector of size rn of all l's, and by E(rn,n) = 1,, 0 the m x n matrix of all l's. Set E(rn) -=- E(m,m,). The FT matrix F, is given by
1 F, = [12 E(2) l& 0 C(3)1,
16 13 0 C(3) C(9)
where C(3) is the Winograd core based on the generator 2 of U(3). Computing the action of F, requires computing four C(3) a,ctions and
one C(9) action. The goal is to reduce the mimber of multiplications by distributing these actions across a preaddition stage. However, direct coni-putation shows that C(9)16 = 06, a result we will prove in general in the next section. Since C(9) cannot be factored across 4, we must handle C(9) separately.
12.2 An Example: 9 205
Denote by 0,, the vector of size 7n of all O's and 0(m, n) = 0 Oti, the m x n matrix of all O's. Set 0(m) = 0(m, m). Define
1 = [12 E(2) 0 C(3) .
06 0(6, 2) C(9)
Algorithm I for computing F,x:
• Compute u =
• Compute
v = C(3) [ xl . X2
• Compute
F,x = u 03 ± [ 03 1
I_ x016 13 0 v
Algorithms for computing F: will be discussed below. The cost of the final two stages is 14 additions and 2 complex multiplications.
F: admits a factorization into the product of a preaddition matrix and a multiplication matrix.
Theorem 12.1 = (1 C (3) e C(9))A(9),
where 1 lt
A(9) = - 12 -E(2) /2 . 06 0(6, 2) /6
Variants of the factorization can be derived by the usual tensor product arguments. Define the diagonal matrices
D(3) = F(2)-1C(3)F(2)-1,
D(9) = F(6)-1C(9)F(6)-1
Since
lt6F(6)-1 = [ 1 0 0 0 0 0
1000001 F(2)(4 /2)F(6)-1 = [
L000100j'
we have the next result.
206 12. MFTA: p2
Theorem 12.2 Suppose that
F = e F(2) e F(6).
Then = F(1 ED D(3) ED D(9))B(9)F,
where 1 1010 0000
-2-2010 0000 B(9) 0 0000 0100 .
06 0(6, 2) 16
The block diagonalization method extends. Since
(1'3 0 F(2))(F(2)-1 0 /3) = [ 01 01 01 01 _ 01. 01 i ,
we have the next result.
Theorem 12.3 Suppose that
H = 1 ED F(2) e (F(2) 0 /3).
Then
= H-1(1 e (-1) ED a e (X(9) + X*(9))
isB(X(9) - X*(9))Bi (9)H,
where a = v - v2 , v = e2"4/3 and
- 1 10 1110 0 0 -2 -20 1110 00
Bi(9) = 0 0 0 0 0 0 1 —11 .
06 0(6, 2) /6
12.3 The General Case: p2
Fix an odd prime p. Set w = e21"/P2 and p' = p - 1. Z/p2 is no longer a field, but the unit group U(p2) is a cyclic group of order s = pp'. Choose a generator z of U(p2) throughout this section. Then
U(p2) = tzk : 0 < k < sl.
12.3 The General Case: p2 207
Order U(p2) by the parameter k. The s x s skew-circulant matrix
w w. . . . wzs-1 _i wz w.2 . . we w
C(p2)—
— .:-' . we-2 W W
is called the Winograd core based on the generator z of U(p2). Since zs/2 —1 mod p2, C(p2) has the form
C(p2 ) = [ X(P2) X* (P2) x*(p2) x(p2) ] •
Define the subsets Do, Di and D2 of Z/p2 by
Do = U(P2),
Di = pU(P2) = {Pzk : 0 _< k < pi},
D2 = {0}.
The collection {Do, Di, D2} is a partition of Z/p2. The permutation 7r of Z/p2 given by
7r = (D2, Di, Do)
is called the exponential permutation of Z/p2 based on the generator z of U(p2). The FT matrix F„ is given by
1 lt P' its
F„ = 1p, E(p') lpt 0 C(p) , [
is ip 0 C(p) C(p2) (12.1)
where C(p) is the Winograd core based on the generator z mod p of U(p). A direct computation of F„ requires p + 1 C(p) computations and C(p2)
computations. The goal is to reduce the number of multiplications by dis-tributing Winograd cores across a preaddition stage. However, the following result shows that this cannot be completely accomplished.
Theorem 12.4 If w = e21rilP2 , then
E wk _ E wk2 _ o.
kEU(p2) kEU(p2)
Proof Since k EU(p2) if and only if k + p E U(p2), we have
E wk _ wp E wk,
kEU(p2) kEU(p2)
E wk2 _ w2p E wk2,
,u(p2) kEU(p2)
which can be the case only if the theorem holds.
208 12. MFTA: p2
Corollary 12.1
C(p2)1, = Os C(p2)(1, =
From the corollary C(p2) cannot be factored across 1, and 1p 0 C(p) and must be handled separately. We will describe two methods.
Define ipt,
=[1,/ E(p') 11; C(p)i. Os 0(s, p') C (p2)
Algorithm II for computing F,x:
• Compute u =-
• Compute xi
v = C(p)
L:1
• Compute 0
F,x = F„-Ex + [ [ v
The algorithm computes the action of Fir by computing the a,ctions of C(p) and F: plus 2s additions. Algorithms for computing F: will be given below. The computation of v can be carried out by the methods of the previous chapter using the fact that C(p) is a skew-circulant matrix and is a block 2 x 2 skew-circulant matrix.
Algorithms for computing F: can be derived using the methods of the previous section. Since C(p)1,,, = —1p,, we have the following result.
Theorem 12.5 = (1 ED C(p) C(p2))A(p2),
where 1
A(p2) = —E(p') 14,0 4,1. Os 0(s pi) is
Set F(p' , s) = 1 ED F(p') ED F(s).
Since C(p) and C(p2) are skew-circulant matrices, we have
C(p) = F(p')D(p)F(p'),
12.3 The General Case: p2 209
C(p2) = F(s)D(p2)F(s),
where D(p) and D(p2) are diagonal matrices. For the discussion, set
F = F(p' , s).
Then = F(1 e D(p) e D(p2))B(p2)F,
where B(p2) = FA(p2)F-1.
Denote by e(n) the vector of size n having the 0-th component equal to 1 and all other components equal to O. Set e(m,n) = e(m) (e(n))t and e(rn) = e(tn, rn).
Theorem 12.6
F(p')(lpt Ip,)F(s)-1 = 0 et ,
where e = e(P)
Proof By the Cooley-Tukey factorization for s = pp' ,
F(s)-1 = (F(p)-1 1-,,,)T(Ip F(p')-1)P(s,p),
where T is a diagonal matrix whose first p' diagonal elements are all equal to 1. Then
F(p')(4 Ip,)F(s)-1 = (4 F(p'))F(s)-1
= (et 0 F(p'))T(Ip F(p')-1)P(s,p)
= (F(II) e 0“P')2)(ip F(P')-1)P(s,P) = e 0(01)2)P(S,P)
= 0 et .
From the preceding theorem, we have the next result.
Theorem 12.7
= F(p' , s)(1 ED D(p) D(p2))B(p2)F(p', s),
where 1 et
I. et
2 B(P2) = [ —P'el —P'e(P') ip' 0 et 1,
Os 0(s, p') Is
where ei = e(P'), e2 = e(8) and e = e(P).
210 12. MFTA: p2
The action of B (p2) requires three additions and one rational multiplication by —p'. Up to the computation of the two F(p' , s) factors, the action of F: requires three additions, one rational multiplication and p2 — 1 complex multiplications.
A related approach begins by defining
1 4, = [1p, E(p') 4 C (p)1.
1, lp C (p) 0(s)
Algorithm III for computing Fn-x:
• Compute u = F:+x.
• Compute x,
v = c(p2)
X.5_1
• Compute
F,x = u + [ °vP -
The action of C(p2) has been separated out. Since C (p2) is skew-circulant and 2 x 2 block skew-circulant its action can be computed by the diagonalization and block diagonalization methods of the preceding chapter.
The computation of the action of F:± is handled in much the same way as that of F:
Theorem 12.8
F.;1-+ = (1 ED C(p) (/, C(p))Ai(p2),
where 1 lt its
A' (p2) = —1,, —E(p') 4 0 p, . —1, 1, 0 Ip, 0(s)
Theorem 12.9 Set
(p' , s) = F(p') (/, F(p')).
Then F:+ = F' (p' , s)(1 D(p) D(p2)).131 (p2)F' (p' , s),
12.3 The General Case: p2
211
where
1 el lpt 0 el g(p2) = —p' e(p') 0 1,
—p lp ei lp /p, 0(s)
where el -= e(P').
The action of BV2) requires 3p additions. The block diagonalization method can also be applied. For n even, set
L.(n) [
I — 0 Q9 In/2
and f (n) = f(n) (f(n))t. Denote by in, the vector of size m whose r-th component is (-1)r and set
-t(n) [ 0 „ 7n/2.
'
Since
(F(2) 0 /2)(12; 0 Ily)(F(2) 142)
= ((lpt F(2))(F (2)-1 0 Ip)) Ipy2
= [ ]0 I
ft p'/2
with f =- f2P and i= PP, we have the next result.
Theorem 12.10 Set
H(p' , s) = 1 (33 (F(2) 0 Ipy2) ED (F(2) 0 /s/2)•
Then
= 1-1(P1 , s)-1 X (P2 ,P)131(P2)11(P' s),
where
X(p2 , p) = le(X (p)+X*(p))e(X(p)— X*(p))
(13,(X(p2) X*(p2)) e (x(p2)- x*(p2))
and 1
[
fl ct '2
Bi(P2) = —2fi. —2f (p') [-fftt]® 472
03 0(s, p') .18
with fi = f (ii ) , f2 = f (8) and f = f(2P).
212 12. MFTA: p2
12.4 An Example: 33
Take 2 as a generator of the unit group U(27) of Z/27,
U(27) = {2k : 0 < k < 181.
The indexing set Z/27 is partitioned by the sets
Do = U(27),
= 3U(27) = {3, 6, 12, 24, 21, 151,
D2 = 9U(27) = {9, 181,
D3 = {0}.
Order U(27) by the parameter k. The permutation 7T of Z/27 defined by
7r = (D3, D2, Di, Do)
is the exponential permutation of Z/27 based on 2. The FT matrix Fir can be written as
7, _ [ 1 11 1A it 18
12 E(2) E(2, 6) A 0 C(3) F —
16 E(6, 2) E(3) 0 C(3) 4 0 C(9) '
118 lo 0 C(3) 13 0 C(9) C(27)
where C(3) and C(9) are the Winograd cores based on the generators 2 mod 3 and 2 mod 9 of U(3) and U(9), and
w = e2iri/27 C(27) = [wzi+k],
is called the Winograd core based on the generator 2 of U(27). Define
_
1 1 1 lti8
F++ 12 E(2) E(2, 6) 4 0 C(3) 71" — 16 E(6, 2) E(3) 0 C(3) 0(6, 18) •
_ 118 lo 0 C(3) 0(18, 6) 0(18)
Algorithm IV for computing F,x:
• Compute u Fj-±x.
• Compute
v = 0(9, 3) 13 0 C(9)
S3
11 0 C(9) 1 s.4 C(27)
x26
12.4 An Example: 33
F,x = F:+x [ °v3
1 e C(3) e (13 ® C(3)) e (19 ® C (3)) A,
1 -12 -16 118
—E(2) —E(6, 2) 16 0 12
—E(2, 6) E(3) 12
0(18, 6)
118 11 0 /2 0(6, 18) 0(18)
le preceding section can be used to derive add
metic Counts
hble 12.1 Algorithm I F„(9).
:tor R.A. R.M.
C 144 160
A 36 0
(9) 180 160
(9) 204 176
Lble 12.2 Algorithm IV F, (9).
:tor R.A. R.M.
H 16 0
T 0 8
(9) 56 36+0}
(9) 88 44+181
214 12. MFTA:
Table 12.3 F, = HDBH (N =27).
Factor R.A. R.M.
H 52 0
D 0 26
B 450 414
F„ 554 440
R.A.— the number of real additions. R.M.— the number of real multiplications.
Referen.ces
[1] Blahut, R. E. Fast Algorithms for Digital Signal Processing, Addison-Wesley, 1985, Chapters 4 and 8.
[2] Heideman, M. T. Multiplicative Complexity, Convolution, and the DFT, Springer-Verlag, 1988, Chapter 5.
[3] Lu, C. Fast Fourier Transform Algorithms For Special N's and The Implementations On VAX, Ph.D. Dissertation, The City University of New York, Jan. 1988.
[4] Lu, C. and Tolimieri, R. "Extension of Winograd Multiplicative Al-gorithm to Transform Size N=p2g, p'gr and Their Implementation", Proc. ICASSP 89, 19(D.3), Scotland.
[5] Nussbaumer, H. J. Fast Fourier Transform and Convolution Algo-rithm,s, Second Edition, Springer-Verlag, 1982.
[6] Tolimieri, R., Lu, C. and Johnson, W. R. "Modified Winograd FFT Algorithm and Its Variants for Transform Size N=pn and Their Im-plementations," Advances in Applied Mathem,atics, 10 pp. 228-251, 1989.
[7] Winograd, S. "On Computing the Discrete Fourier Transform", Proc. Nat. Acad. Sci. USA, 73(4), April 1976, pp. 1005-1006.
[8] Winograd, S. "On Computing the Discrete Fourier Transform", Math. of Computation, 32, Jan. 1978, pp. 175-199.
Problems 215
Problems
1. List all of the elements of the unit group U(25) of Z/25.
2. Reindex Z/25 by its orbits Do, Di and D2.
3. Write the explicit matrix of FB.(25) and derive algorithm I and its variants.
4. Find the arithmetic counts for each of the variants derived in problem 3.
5. Derive the algorithm II for F,(25) and find its arithmetic counts.
6. Find a generator for the unit group of U(125) = U(53).
7. Find the system of idempotents lei, e2} for Z/75 = Z/523.
8. Order the unit group U(75) using the idempotents found in problem 9.
9. Derive in detail the algorithm for F, (75) following the procedures of section 4.
10. Derive the Good-Thomas algorithm for a 75-point Fourier transform.
11. Compare the arithmetic counts of the results of problems 9 and 10.
13 Periodization and Decimation
13.1 Introduction
The ring structure of Z/N provides important tools for gaining deep in-sights into algorithm design. The fundamental partition of the indexing set Z/pm, a major step in the Rader-Winograd FT algorithm of the preceding chapter, was based on the unit group U(pm).
We will now adopt the point of view that N-point data is a complex-valued function having the set Z/N as domain of definition. Denote by
L(ZIN)
the set of all complex-valued functions on Z/N and regard L(ZIN) as a complex vector space under the following rules of addition and scalar multiplication. For f, g E L(ZIN) and a E C,
(f + g)(j) = f (j) + g(j),
(af)(j)= a(f(j)), 0 5 j < N.
The ideas of this chapter are best described on the function theoretic level and constitute a part of abelian harmonic analysis. In this section, we will redefine the Fourier transform as a linear operator of the vector space L(Z/N) whose matrix relative to the standard basis is the N-point FT F(N).
In the next section, subspaces of L(ZIN) will be introduced correspond-ing to subgroups of Z/N. For a subgroup B, we define the subspace
218 13. Periodization and Decimation
of B-periodic functions and the subspace of B-decimated functions. The main theorem, proved in section 3, establishes an important duality be-tween these subspaces determined by the FT. This duality plays a role in both the Cooley-Tukey algorithms and the Rader-Winograd algorithms, and provides the key to understanding the global structure of many one-dimensional and multidimensional FT algorithms.
Denote the linear span of a subset S of a vector space V by Ln(S). For simplicity, we introduce the following convention. Suppose that S is a subset of a vector space V and determines a basis of Ln(S). If Y is a linear transformation of V such that Ln(S) is Y-invariant, then we say that M is the m,atrix of Y with respect to S whenever M is the matrix of the restriction of Y to Ln(S) with respect to S.
The set of functions : 0 < / < NI
with 1, k =1,
ei(k) = { 0, k
0 k < N
is a basis of L(ZIN) called the standard basis. If f E L(ZIN), we can write
N-1
f = E f Met, (13.1) t=o
and we call the vector of size N
f (0)
f =[ f(ISI —1)
the standard representation of f. An inner product on L(Z/N) is defined by the formula
N-1
(f '9) = E f(i)g*U), (13.2) i=o
where * denotes complex conjugation. The standard basis is an orthonorrnal basis relative to this inner product in the sense that
1 j — k (ei,ek) =10: j 0 j,k < N.
The space L(ZIN) viewed as an inner product spa.ce with inner product (13.2) is denoted by L2 (Z/N).
We will define another basis of L(Z/N) bearing some relationship to the ring structure of Z/N.
13.1 Introduction 219
A function
X Z/N Cx,
where Cx denotes the multiplicative group of nonzero complex numbers, is called an additive character if the following condition holds:
x(/ + k)= x(l)x(k), 0 < k < N,
with addition / + k taken mod N. In mathematical language, an addi-tive character is a homomorphism from the additive group Z/N into the multiplicative group Cx.
The additive characters on Z/N can be described as follows. Denote the subgroup of Cx consisting of all N-th roots of unity by UN •
Theorem 13.1 An additive character x of ZIN is a homomorphism of the additive group Z/N into the multiplicative group UN and is uniquely determined by x(1) by the formula
x(j) = x(1)i , 0 < j < N.
The group UN is a cyclic group that has the element
w = e2iritIV
as a generator. For each k,0 < k < N, we define the mapping
Xk Z/N UN
by setting
Xk(i) = wk3, 0 < j < N. (13.3)
By theorem 13.1, the set
{Xk : 0 < k < N} (13.4)
is the set of additive characters on Z/N.
Theorem 13.2 The set of additive characters on ZIN is an orthogonal basis of L2(Z/N), and for any additive character x on ZIN we have
iiXii2 = (X, X) =
Proof If v is an N-th root of unity, then
N-i E _ 0, , 1, N, v =1.
1=o
Take 0 < , k < N . By (13.2) and (13.3), we have that
N-1
(X1)Xk) = E Wr") - 1°N 11 kk7 r=0
220 13. Periodization and Decimation
This implies that the set (13.4) is an orthogonal subset of L(Z/N). Since there are N distinct elements in (13.4), the same as the dimension of L(ZIN), the set (13.4) is a basis of L(ZIN), completing the proof of the theorem.
The FT of Z/N can now be defined as the linear operator F(N) of L(Z/N) satisfying the following condition:
F(N)(e3) = x3, 0 < j < N.
By (13.1), we have that
N-1
F(N)(f)= E f f E L(ZIN). i=o
Setting g =F(N)(f), we have
g = F(N)f,
which implies that the matrix of the FT of Z/N relative to the standard basis is F(N).
There are useful properties of the linear operator F(N) that will be needed throughout this chapter. They can be proved directly or by using the corresponding properties of the matrix F(N). We list them without proof:
F(N)4 = N2/, (13.5)
1- the identity operator on L(Z/N), and
F(N)2(e3) = eN_,, 0 j < N (13.6)
(f,g)=11N(F(N)(f),F(N)(g)), f,g E L(ZIN). (13.7)
13.2 Periodic and Decimated Data
Set w = eri/N. Since every subgroup B of Z/N has the form
B = rZIN,
where r is a divisor of N:
B = Irk :0 < k < N = rs,
we have BZIN c B.
13.2 Periodic and Decimated Data 221
The ring structure of Z/N gives rise to the bilinear pairing
k) = wlk, 0 < l,k < N. (13.8)
The product, lk, can be taken either mod N or in Z since wN = 1. Direct computation shows that the-bilinear pairing (13.8) satisfies the following three properties. For 0 < k, m < N,
• (l + k, m) = (l,m)(k,m),
• (lk,m) = (l,m)k,
• (l,k) = (k,l).
The dual B± of a subgroup B of Z/N is defined by
= 1/ E Z/N : (l,k) = 1, for all k E B}.
is a subgroup of ZIN.
Example 13.1 Take N = 6 and B = 2Z/6. Then
B L = 3Z/6.
Example 13.2 Take N = 9 and B = 3Z/9. Then Bi = B.
Theorem 13.3 If B = rZIN, then .13-1- = sZIN, N =rs.
Proof Since N =rs and wN = 1, we have that
B-L sZIN.
Conversely, if k E B±, then Wkr = 1, which implies that s divides k and
Bi c sZIN,
completing the proof of the theorem.
Corollary 13.1 (B±)± = B.
Suppose that B = rZ/N, N = rs throughout this section. A function f E L(A) is called B-periodic if the following condition is satisfied:
f(a + b) = f(a), a E A,b E B.
A B-periodic function f is uniquely determined by the vector of values
g=
f (0) 1
f(r —1)
222 13. Periodization and Decimation
by the formula
f (l + rk) = 0 < < r, 0 < k < s.
In vector notation, f = 19 g.
Define 7r : Z/N —> Z/r
by setting 7r(x) = x', where x x' mod r and 0 < x' < r. The mapping ir is a ring-homomorphism of Z/N onto Z/r. For g E L(ZIr), define
f = ir*(g) E L(ZIN)
by the formula
7r*(g)a = g(r(a)), a E Z/N.
Observe that f is B-periodic. Every B-periodic function is of this form, and we have the following result.
Theorem 13.4 The mapping
7r* : L(Z/r) —> L(ZIN)
is a linear isomorphism of L(ZIr) onto the space of B-periodic functions in L(ZIN), B = rZ/N,.
A function f E L(ZIN) is called B-decimated if we have, for a E Z/N
f(a) = 0, a B.
If f is B-decimated, then
f (0) -
f (r) f = e(r).
_ f((s — 1)r)
Take h E L(ZIs) and define
f = cr* (h) E L(Z I N)
by setting f (kr) = h(k) 0 k < s,
and f equal to 0 otherwise.
Theorem 13.5
o-* : L(ZI s) L(ZIN)
is a linear isomorphism of L(ZIs) onto the space of B-decimated functions in L(ZIN).
13.3 FT of Periodic and Decimated Data 223
13.3 FT of Periodic and Decimated Data
We come to the first major result of this chapter: the duality between spaces of periodic functions and spaces of decimated functions determined by the FT. This duality is tbe function theoretic analog of theorem 13.1. Throughout this section, we set w = e2"i/N and
(1, = wik 0 < 1. k < N.
Take B =rZIN,n=rs. Suppose that f E L(ZIN) is B-periodic. Since every k, 0 < k < N, can be written uniquely in the form,
k = k' + b, 0 < < r, b E B,
we have r-1
F(N)f (1) = E E f +b)(k',1)(b,1), k'=0 bEB
implying by B-periodicity that
r-1
F(N)f(1) E f(k)(k,l)E(b,1), 0 <1 < N.
k=0 bEB
We want to compute
-y(/) =- E(b,/), 0 < / < N. bEri
There are two cases to consider. If / E then (b,/) = 1, b E B and we have
7(0 =- s, / E
proving that T-1
F(N)f(1) = s E f(k)(k,1), 1 E B-L. k=0
If / B-L, then there exists c E B such that (c, /) 0 1. Since
(c,1)7(1) =- E(c + b,l) = E 0,1) = 7(1), bEB bEB
we have -y(/) = 0, / B-L, proving that
F(N)f (1) = 0, 1 B-
By the corollary to theorem 13.3, Bi = sZ/N, which implies that
r-1
F(N)f (1s) = s E f(k)(k,l)s, 0 <1 < r. k=0
224 13. Periodization and Decimation
Since (k,1)8 = e2wilk/r, we have
= s E fooe2,riikir,
F(N)f(ts) r-1
k=0
proving the next result.
Theorem 13.6 Suppose that B = rZIN, N = rs and f is B-periodic.
Then F(N)f is Bi—decimated and, on , is given by
F(N)f (0) f (0) F(N)f (s) f(1)
= sF(r) . .
_ F(N)f ((' r — 1)s) f(r —1)
Observe that computing the n-point FT of B-decimated data can be carried out using one r-point FT.
Since the space of B-periodic functions and the space of B-L-decimated functions have the same dimension r, theorem 13.6 implies the next result.
Corollary 13.2 The FT of ZIN maps the space of B-periodic functions isornorphically onto the space of .8i-decimated functions.
Consider a B-decimated function g E L(Z/N). We are still assuming that B=rZIN. By definition,
F(N)g(1) = E g(k)(k,1), 0 <1 < N. kEB Since / E B1 implies that (k, /) = 1 for all k E B, replacing / by / + s in the equation, we have
F(N)g(1± s)=F(N)g(1), 0 <1 < N,
implying that F(N)g is /31-periodic and
F(N)g(1)= E gfrk)e2„iikis, < < 8, k=0
proving the next result.
Theorem 13.7 Suppose that B = rZIN, N = rs and g is B-decirnated. Then F(N)g is Bi-periodic and is given by the matrix formula
F(N)g(0) g(0) F(N)g(1)
— F(s)[ g(r) 1-
F(N)g.(s —1) g((s —1)r)
13.4 The Ring Zip"i 225
Corollary 13.3 The FT of ZIN maps the space of B-decimated functions isornorphically onto the space of Bi-periodic functions.
The preceding two theorems express the duality on the function spaces deterrnined by the FT. They will serve to give the global structure of sev-eral 'multiplicative' FT algorfthms. We take up this topic in detail in the following chapters.
These theorems can be viewed as the first step in the N-point Cooley-Tukey algorithm corresponding to the divisor r.
13.4 The Ring Zip"'
The results of the preceding section will be applied to the special case of N = , m > 1 and p an odd prime. The group Z pin has a unique maximal subgroup pZIpm and every subgroup of Zip"' has the form pkZ/pm, 0 < k < m. We have
(0) = pniZIpm Cpm-1Z1prn c • •• c pZIpm c Zip'.
For / < k, denote by
L(pk,p1), 0 < k,1 < m,
the subspace of all f E L(ZIpm) satisfying the following two conditions:
• f is pkZ/pm-decimated,
• f is p/Z/pm-periodic.
L(p,pm) is the subspace of all pZipm-decimated functions and L(1,p) is the subspace of all pZ/pm-periodic functions.
By theorem 13.3, the dual of pkZIpm is prn—kZ/pm. Applying theorems 13.6 and 13.7, we have the following result.
Theorem 13.8 The FT of ZIpm maps the subspace L(pk,pi) onto the subspace L(pn",pr"), 0 < k <1 < m.
The subspace L(p, pm-1) is especially important. By theorem 13.8, it is invariant under the FT of Z/pm in the sense that
F(N)L(p,r-1) = L(p,pm-1)
N-1/2F(N) is a unitary operator of L2(Z/pm). Denote the orthogonal complement of L(p,pm-1) in .0(Z/prn) by W:
W = {f E L2(ZIpm): (f,g) = 0, for all g E L(p,pm-1)}.
W satisfies the following two properties:
226 13. Periodization and Decimation
• L2(ZIpm) = w e L(p,pm-1).
• F(N)W = W.
These properties imply that we can study the action of the FT of L(ZIpm) by independently studying the action on L(p,pm-1) and on W.
The set of functions in L(ZIpm)
{Ek : 0 < k < pm-1},
defined by Eo eo
= (11; 0 ipm-1) e.1 ,
_1 eN-1
is a basis of L(1,pm-1) and the subset
lEpk : 0 < k < Pm-21
is a basis of L(p,pm-1).
Theorem 13.9 If N = pm, m > 2 and p an odd prime, then the matrix of F(pm) with respect to the basis
{Epk : 0 < k < Prn-21
is pF(p').
Proof The function Epk, 0 < k < pm-2, is equal to 1 on the set
S = {pk + rpm-1 : 0 < r < p}
and vanishes otherwise. Set w = e2m/P- and v = e2mi/P- 2 . Since L(p,pm-1) is F(pm)-invariant, F(pm)Ekp vanishes off of pZ/pm. On pZIpm we have
F(pm)Ekp(tp spm-1) = F(pm)Ekp(/p)
p-1
= E r=0 pv/k, 0 < 1,k < pm-2,
implying that
pm — 2
F (pr/b)Ekp = p E v lk v =_ e27rilpm-2
1=o
proving the theorem.
Problems 227
Problems
1. Prove that if x is a,n additive character of Z/N, then x(1) is an N-th root of unity (without relying on theorem 13.1).
2. Prove that the set of additive characters of Z/N is a group under the product rule:
(XXi)(k) = X(k)X1(k), k E Z/N,
where x and x' are additive characters of Z/N.
3. Prove formula (13.5).
4. Prove formula (13.6) and describe the matrix of F2 relative to the standard basis.
5. Prove that the dual of a subgroup is a subgroup.
6. Write down the collection of all B-periodic functions on Z/N, where N = 6 and B = 2ZI6.
7. Repeat problem 6 with n = 27 and B = 3Z/27.
8. Write down the collection of all B-decimated functions on Z/N, where N = 6 and B = 3Z/6.
9. Repeat problem 8 with n = 27 and B = 9Z/27.
10. Verify F(p3)L(p,p2) = L(p,p2).
14 Multiplicative Characters and the FT -
14.1 Introduction
Fix an odd prime p throughout this chapter. For m > 1, consider the subspace
L(p,pm-1) c L(Z Ipm).
In the preceding chapter, we proved that L(p,pm-1) is FV)-invariant and described the action of F(pm) on L(p, pm-1). Denoting the orthogonal complement of L(p, pm-1) in L2(Z/ pm) by W, we have
L(Z/pm) = W ED 14,pm - 1),
and W is F(r)-invariant. We will describe the action of F(pin) on W. Suppose that m > 1 and set U(pm) = U(ZIpm). A multiplicative
character x of Z/pm is a homomorphism
x : U(pm) Cx
from the multiplicative group U(pm) into the multiplicative group C of nonzero complex numbers:
x(ab) x(a)x(b), a, b E U(pm)
The product ab is taken in U(r) and is multiplication mod pm .
Example 14.1 U(7) is a cyclic group of order 6. Set u = e27"16. A multiplicative character x of Z/7 is completely determined by its value on
230 14. Multiplicative Characters and the FT
the generator 3 of U(7) by the formula
x(3k) = x(3)k, 0 < k < 6.
Since 36 1 mod 7, x(3) is a 6-th root of unity. There exist exactly six multiplicative characters of Z/7 defined by the following table.
Table 14.1 Multiplicative characters of Z/7.
1 3 2 6 4 5 xo 1 1 1 1 1 1
Xi 1 U U2 U3 U4 U5
X2 1 u2 u4 u6 u8 u10
X3 1 U3 u6 u9 u12 u15
X4 1 U4 //8 u12 u16 u20
X5 1 u5 u10 u15 u20 u25
In general, U(p) is a cyclic group of order t = p — 1. Set u = e27"it. A multiplicative character x of Z/p is completely determined by its value on a generator z of U(p) by the formula
x(zk) = x(z)k, 0 < k < t.
Since zt 1 mod p, x(z) is a t-th root of unity. For 0 < / < t, define the multiplicative character xi of Z/p by setting
xi(z) = u .
There exist exactly t multiplicative characters on Z/p given by the following table:
Table 14.2 Multiplicative characters of Z/p.
1 z z2 • - zt-1 xo 1 1 1 1 xi 1 u u2 Ut- 1
xtli I i ut--1 u2(t — i ) u(t — i)(t — i)
Denote by 0(p) the set of multiplicative characters of Z/p. Extend the domain of definition of a multiplicative character x of Z/p to all Z/p by setting x(0) = O. A group multiplication is placed on U(p) by the rule
(sx')(a) = x(a)x1(a), a E Z/p, x, E 0(p).
Since skxi = xic±i with k+1 taken mod p, each generator of U(p) determines a group-isomorphism from U(p) onto U(p).
14.1 Introduction 231
el(p) is called the multiplicative character group of Zip. The identity element of U(p) is xo, which is usually called the principal multiplicative character of Z/p.
Denote by ti(p) and ex (p) the vectors of functions of size t given by
The el xi ez
ii(p) -= . , (p) = .
xt—i ezt-i
By table 14.1, ii(p) = F(t)ex (p),
which is the multiplicative analog of the result that F(N) maps the standard basis of L(ZIN) onto the basis (Z/Nr of additive characters.
Suppose that m > 1. U(pm) is a cyclic group of order t = pm-1(p-1). Set u = e'i/t. A multiplicative character x of Zip' is completely determined by its value on a generator z of U(pm) by
x(zk) -= x(z)k , 0 < k < t.
Since zt 1 mod pm, x(z) is a t-th root of unity. There are exactly t multiplicative characters xi, 0 < / < t, defined by
Xi (Z) = U .
Table 14.3 Multiplicative characters of Zip'.
1 z Z2 Zt-1
Xci 1 1 1 1
Xi 1 U U2 Ut-1
Xt-1 1 Ut-1 U2(t-1) u(t-1)(t-1)
Denote by I-1(pm) the set of multiplicative characters on Z/pm. Extend the domain of definition of a multiplicative character x of Zip' by setting x(a) = 0 whenever a 0 U(pm). A group multiplication is placed on U(pm) by the rule
(xx')(a) = x(a)x'(a), a E Zlpm , x, x' E 0-(pm)
Each generator of U(pm) determines a group-isomorphism between CI(pm) and U(pm). U(pm) is called the multiplicative character group of Z/pm. The identity element of U (pm) is xo, which is usually called the principal multiplicative character of Z/p"1.
232 14. Multiplicative Characters and the FT
The definitions of ii(p) and ex (p) easily extend to definitions of ii(pm) and ex (pm), and we have
ii(pm) = F(t)ex (pm)
There are several important formulas involving multiplicative characters that will be used repeatedly throughout this work.
Theorem 14.1 For x E (I(pm),
E x(k) = {t' x = x°' 0, otherwise.
kEU (pm)
Proof The case x = xo is trivial. Suppose that x xo and take ko in U(pm) such that
x(ko) 1.
As k runs over U(prn), kok runs over U(prn) and we have
E x(ko k) = E x(k). IcCEJ (pm) kEU (pm)
Since x(kok) x(ko)x(k), we also have
E x(kok) =x(ko) E x(k), lecU(pm) kEU(pm)
and the theorem follows.
Denote by (f,g) the inner product of two functions f, g E L(ZIpm),
(f , g) = E f (a)g* (a).
aEZ/pm
Corollary 14.1 For two distinct multiplicative characters x, y of Zip',
(s,y) = O.
(x,x) = t.
Corollary 14.2 0-(pm) is an orthogonal subset of L2(Z/pm)
14.2 Periodicity
14.2.1 Periodic Multiplicative Characters
Subspaces of L(ZIpm) will be constructed from sets of multiplicative characters satisfying certain periodicity conditions. Z/p has only trivial
14.2 Periodicity 233
subgroups, so periodicity plays no role. However, we will distinguish two subsets of U(p). Set
'17.4) = Ixol,
7(p) = (I(p)— fro} (set difference).
x E 177(p) is called a primitive multiplicative character of Z/p. Suppose that rn > 1. For 1 < k < m, define
0-(pm,pk) = Ix E (I(pm): x is pkZIpm — periodic}.
Cf(pm,pk) is a subgroup of (1(pm), and we have
ue,p) c u(prn,p2) c • c pnl) = (viz)
Form the set differences,
1:74m,P) = Nprn,p) — Ixo f/(prn pk) = Nprn. pk) _ 6-(pm, pk —1), 2 < k < m.
(I(prn) is the disjoint union
Nprn) = xo} uln=i c/(prn,pk)
The set
1-7(Pm) = f/(Pm,Pm)
is called the set of primitive multiplicative characters on ZIpm. 174"1) is the set of multiplicative characters that are not periodic with respect to any subgroup of Z/pm.
Example 14.2 Order U(9) exponentially relative to the generator 2,
1, 2, 4, 8, 7, 5,
and form the 3 x 2 array 1 2 4 8 . 7 5
Two elements of U(9) are equal mod 3 if and only if they Hein the same column of the array. A multiplicative character x of Z/9 is in U(9, 3) if and only if x is constant on the columns of the array. This will be the case if and only if x(4) = 1. Consider xi E U(9) relative to the generator 2. Since
xi (4) = e2wit/3,
xi is in (/(9, 3) if and only if 3 / and we have
0-(9,3) = Ixo,x3}.
234 14. Multiplicative Characters and the FT
Example 14.3 Order U(27) exponentially relative to the generator 2 and form the 9 x 2 array
.
Two elements in U(27) are equal mod 3 if and only if they lie in the same column of the array. x is in U(27,3) if and only if x is constant on the columns of the array. This will be the case if and only if x(4) = 1. Consider
E U(27) relative to the generator 2. Since
xi (4) = e27`i119,
xi is in 0(27,3) if and only if 9 I / and we have
0 (27, 3) = {xo, xs }.
Form the 3 x 6 array,
1 2 4 8 16 5 10 20 13 26 25 23 .
19 11 22 17 7 14
Two elements in U(27) are equal mod 9 if and only if they lie in the same column of the array. x is in U(27,9) if and only if x is constant on the columns of the array. This will be the case if and only if x(10) = 1. Consider
E U(27) relative to the generator 2. Since
xi (10 ) = e i / 3
xi is in (/(27, 9) if and only if 3 I / and we have
0(27,9) = x3, xs, xs, xi2, xis/.
Suppose that m > 1. Throughout this section, set t = pm-1(p — 1), the order of U (pm). Fix a generator z of U(pm). Denote by xi, 0 < / < t, the multiplicative characters of Zip' based on z. We will describe U(pm , pk).
Set tk — pk-1(p 1), the order of U(pk). Since z mod pk generates U(pk),
tk — 1 z = mod pk
with tk the smallest positive power of z having this property.
1 2 4 8
16 5 10 20 13 26 25 23 19 11 22 17
7 -
14
14.2 Periodicity 235
Form the Pm- k X tk array,
z 1 Z • Ztk —1
Ztk
k Zt-1
with r = pm-k - 1. Two elements in U(prn) are equal mod pk if and only if they lie in the same column of the array. x is in U(pm,pk) if and only if x is constant on the columnsof the array. This will be the case if and only if x(z") = 1. Consider xi E U(pm). Since
= e27Tiiipm-k Xi (Ztk )
xi is in (1(Pm Pk) if and only if Pm- k I 1, proving the following result.
Theorem 14.2
1-1(pm,pk) = {xi :0 </<tandpm-k1/1.
Corollary 14.3
1'74m) =- {x/ : 0 < / < t and (1,p1n) =
cl(pm,pk) = {xi : 0 < / < t and (/,pm) =
Example 14.4 Relative to any generator U(81),
0(81,3) = IXO, X27}
Cr (81 ,9) = {X0, X9, X18, X271 X36, X 451
(81 , 27) = X3, X6, X9, X12, • • - X51}
and
fr(81) = {xi, X2) X 4, - • , X52, X53}
fr(81 , 27) = { X0, /3, X6, • • • X39, X42}
fir' (81, = {X13, X9, X18, X27, X36, X45}
T7(81,3) = ISO, X27} •
14.2.2 Periodization and Decimation
Consider the ring-homomorphism
7r(pm,pk) : Z/pm -+ Z/pk
defined by 7r(pm,pk)a = a mod pk, a E Z/pm. For f E L(Z/pk), define
7*(Pm,Pk)f C L(Z/Pm)
236 14. Multiplicative Characters and the FT
by 7* (pm , pk) f (a) = f (a mod pk), a E Zip'.
The periodization map
ir* (pm , pk) : L(ZIpk) L(ZIpm)
is a linear isomorphism of L(ZIpk) onto the subspace of all pk Z / pm-periodic functions in L(Z/pm). Since r(pm,pk) restricts to a group-homomorphism of U(pm) onto U(pk), we have the next result.
Theorem 14.3 The periodization map 71-* (pm ,pk) restricts to a group isomorphism
7r.i. (pm, pk Npk) ErT(pm ,,pk)
and to a bijection from 1-'7(pk) onto I- 7(pm ,pk).
= e27rok, Throughout we fix a generator z of U(pm) and reference all dependent
constructions with respect to z. Set u = e'zit and uk where
t = pm-1(p — 1) and tk _= pk-1(p _._ 1). Denote by 4Pk), 0 < / < tk, the multiplicative characters of U(pk) relative to the generator z mod pk of
U(pk). Set xt = 4P—).
Example 14.5
71*(27, 3)
7*(27, 9)
49)
X(9) 1
X(9) 2
X 3(9)
X4(9)
X(9) 5
Xg
X3
X6
Xg
X12
''' ....1.5
(3) X 0
X(3) 1
Xg
'^ ...,6
Example 14.6
7*(81,3)
7*(81, 9) 7*(81, 27) X0(9)
x(9) i (9)
X 2
X(9) 3
X(9) 4 (9)
X5
X0
X6
X 18
'" •,27
X36
X45
(27) X0
X(27) 1
X2(27)
• •
(27) X 17
X0
X3
X6
.
.
X51
X0(3)
(3) X i
XO
X27
Since 7* (pm,pk)4Pk)(z) = =
Uipm-k
1
we have the next result.
14.3 F(p) of Multiplicative Characters 237
Theorem 14.4
rrt k\ (Pk) ,p = xipm-k, 0 <1 < tk•
Consider the group-homomorphism -
cr(Pm,Pk) Z/Pk Z/Prn
given by o-(prn,pk)a = pni-ka, a E Zlpk. For f E L(ZIpk), define
u*(Pm ,Pk)f c L(z/r)
by setting o-* (pm , pk) f = 0 off of Pm-kZ/pm and
o-* (pm , pk) f (pm- k a) = f (a), a c pk
The decimation map
o-* (pm ,pk) : L(ZIpk) —> L(ZIpm)
is a linear isomorphism of L(ZIpk) onto the subspace of all pn" Z/prn-decimated functions in L(ZIpm). The decimation map will be used to describe the FT of multiplicative characters in the following sections.
For 1 < k < rn,, the set 1 + pkZIpm
is a subgroup of U(pn1). In fact,
1 +pkZIpm = {a EU(pm): 71-(pm,pk)a =1}.
If x E 1.7(pm,pk), then x(1 + pkZIpm) = 1. The converse also holds.
Theorem 14.5 X E 0(pm,pk) if and only if x(1+ pkZIpm) = 1.
Proof Suppose that x(l+pkZIpm)= 1. For a E U(prn) and b E pkZ/pm,
x(a + b) = x(a(1+ a-lb))
= x(a)x(1+ a-lb)
= x(a),
proving the theorem.
14.3 F(p) of Multiplicative Characters
Throughout this section, fix a generator z of U(p) and set v = e27"779 and u e27ri/(p-1) .
238 14. Multiplicative Characters and the FT
Theorem 14.6
F(p)xo(a) = {P 1' a = °' -1, otherwise.
Proof By definition,
p— 1
F(p)xo(a) =- E Vak
k=
If a = 0, then the sum on the right is p - 1. If 0 < a < p, then, since
p-1 E vak = 05
k=0
F(p)so(a) = -1, completing the proof.
Theorem 14.7 If x E c7(p), then
F(p)x = Gp(x)x*
where p-1
G p(x) = F(p)x(1) = E Z(k)IJk.
k=
Proof By definition,
p-1
F(p)x(a) = E x(k)vak = E x(k)vak.
k=1 xEU(p)
If a = 0, then
F(p)x(0) = E x(k) = O.
kEU(p)
Suppose now that 0 < a < p. Since a is invertible mod p, we have
E x(k)vak = E x(c1-1 k)vk kEU(p) kEU(p)
= x(a-1) E X(k)Vk
kEU (p)
= Gp(x)x* (a),
using x* (a) = x(a').
G p(x) is called the Gauss sum of the multiplicative character x of Z p. Since p1/2F(p) is unitary, we have
(F(p)x,F(p)x) = p(x , x) = IGp(x)12 (x* , x*) ,
proving the next corollary.
14.4 F(r) of Multiplicative Characters 239
Corollary 14.4 If x E 1-/.(p), then IGp(x)I2 = p.
For 0 < / < p — 1,
p-2
Gp(Xi) = E xi(zk)v-k k=0
p-1 lk zk = E 'IL V ,
k=0
proving the next result.
Theorem 14.8
vz Cp(xl) F(p —1)
VZP-2 Gp(xp_2)
Recall the definition of the Winograd core C(p) as the (p — 1) x (p — 1) skew-circulant matrix having 0-th row,
v,vz,...
Since IGp(x)I2 = p whenever x E 1-7(p), we have the next result.
Corollary 14.5 C(p) is invertible and
- —1 -
Gp(xi) F(p —1)C(p)F(p —1) = diag
_ Gp(xp_2) _
Example 14.7 G3(xi) = w _ w2, w = eri/3
Example 14.8
G5 (x i ) = w — w4 +02 _ w3 ) ,
G5 (X2 ) = w w4 — (w2 w3)
G 5(X 3) = W — w4 — j(w2 w3) w = e27ri / 5
14.4 F(pm) of Multiplicative Characters
44.1 Primitive Multiplicative Characters
Suppose that rn > 1. Throughout this section we fix a generator z of U(pm) and set t = pm-1(p— 1), w = 62'/P— and u = e2"/t. First we compute the FT of primitive multiplicative characters of Zip'.
240 14. Multiplicative Characters and the FT
Theorem 14.9 If x E V(pm), then
F(r)x = Gpm(x)x* ,
where Gy.(x) = F(pm)x(1).
Proof Suppose that a E U(r). Since a in invertible mod pm, we have
F(pm)x(a) = E X (k)Wak
kEU (pm)
— E x(a-lk)wk IcCU (pm)
= x(a-1) E X(k)Wk
kEU (pm)
= Gpm(x)x* (a).
Suppose that a ce U(r) and write
a = pa', 0 < a' < pm-1.
Since x cl 0-(pm ,r-1), x(c) 1 for some c E 1 + pm-1ZIpm. From pc ...—_- p mod pm, we have 01' = wpa'b _ wpca'b _ wabc, b E U(pm) and
F(r)a = E x(b)wabc
b.u(,)
— E s(c_iowab
bEU (p"9
= x(c-1)F(pm)(a).
Since x(c) 1 implies that
F(pm)(a) = 0,
the proof is complete.
Gpm(x) is called the Gauss sum of the primitive multiplicative character x of Zip'. Arguing as above, we have the following corollary.
Corollary 14.6 If x E 1-7(pm), then 1Gpm(x)12 = Pm
14.4.2 Nonprimitive Multiplicative Characters
Consider x E V(r,pk), 0 < k < m. By theorem 14.3, the periodization map 71-*(pm,pk) defines a bijection from V(pk) onto V(pm,pk) and we can write
x = r*(Pm,Pk)Y, Y E 17(Pk).
14.4 F(pm) of Multiplicative Characters 241
The decimation map ce(pm,pk) isomorphically maps L(ZIpk) onto the spa,ce of pn"Zipm-decimated functions. Define gx L(ZIpm) by
= cf*(pm,pk)g.
gx vanishes off of pm-kU(pm) and on pm- k U (pm),
g x(pm- k a) = x(a), a E U (pm).
Theorem 14.10 If x E 1-7(pm,pk), 1 < k < m, then
F(pm)x = pm-kGpk(g)gx.,
where x = 7r* ,Pk)Y, Y E fl(Pk)•
Proof Since x is pkZ/pm-periodic, F(pm)x is pm-kZ/pm-decimated. By theorem 13.6,
F(pm)x(pni-kl) = pn"F(pk)y(1), 0 <1 < pk
From theorem 14.9,
F(pk)y(1) = G pk (y)y* (1), 0 < / < .
Since x(/) = y(/), 0 < / < pk , we have
F(r)x(prn-kl) = pm-kGrk(g)gx.(prn-k1),
completing the proof.
In general, for x E 0(pm), define
Gpm(x)=F(r)x(1).
For 0 < / < t,
Gpm (xi) = E Ulk Wzk
k=0
implying that [ Gpm (x0) 1 w
[ w.' I . Gp,,,(xi) = F(t)
Gpm(st-i) wzt-i
By theorem 14.10 and corollary 14.6, we have
Gpm (xi) 0, p yt
Gpm(st)= 0, pll.
242 14. Multiplicative Characters and the FT
Since the Winograd core C (pm) is the skew-circulant matrix having 0-th TOW,
—1 w7 wz7 7 7 7 7 wzt
F(t)C(pm)F(t) = diag(Gpm(xt))o<t<t•
Consider the principle multiplicative character so of Zip'. Since ro is pZ/pm-periodic, F(pm)xo is pm-1 ZIpm-decimated. Define the pm-1Z/pm-decimated function go by setting
go(ipm—i) _ —(P — 1), / = 0, 1, 1 < / < p.
Theorem 14.11 F(pm)xo = —Prn lgo •
Proof Since ez7r4b/p F(r)x0(/pm-1) = E
bEU (pm)
the theorem follows.
14.5 Orthogonal Basis Diagonalizing F (p)
The formulas for the FT of the multiplicative characters will be used to de-compose W into the orthogonal direct sum of two-dimensional FT-invariant subspaces on which FT is simple to describe.
Fix a generator z of U(p). Set t = p — 1 and write
0-(P) — {xo, xi, • • • , xt—i}, t = p— 1,
with respect to z. Since the function eo E L(ZIp) taking the value 1 at the point 0 and vanishing on U(p) is orthogonal to 0(p), the set
is orthogonal and by dimension is a basis of L2(ZIp). Denote by W(0) the subspace spanned by
b(0) = leo, t-1/2sol.
Since F(p)eo = eo + xo
and F(P)xo = teo — xo,
W(0) is F(p)-invariant and the matrix of F(p) with respect to the basis b(0) is given by
[ 1 t1/2 Pi/2m(0) t112 --1
14.5 Orthogonal Ba.sis Diagonalizing F(p) 243
For 1 < k < t/2, denote by W(k) the subspa.ce spanned by
b(k) = It-1/2xk,t-1/2p-1/2F(P)xkl.
Since F(p)xk is a constant multiple of ek` =- xp_k, b(k) is orthogonal. In general,
F2(p)f (a) = pf (-a), f E L(ZIp), a E Zlp,
implying that
F2(P)xk = PXk(-1)Xk = P(-1)k X k •
The subspace W(k) is F(p)-invariant and the matrix of the F(p) with respect to b(k) is p1/2M(k), where
M(k) = [ ? (-01)k ] .
Denote by W(t/2) the one-dimensional subspace spanned by
b(t/2)= ft-1/2st/2}.
Since 4/2 = xt/2, W(t/2) is F(p)-invariant and F(p) acts on W(t/2) by the scalar multiple
[Gp(st/2)] -
Gp(xt/2) is called the Legendre symbol mod p and is given by
p1/2 p = 1 mod 4,
P1/2M(t/2) = GP(xt/2) = { p1/2i, p 3 mod 4,
implying that 1 p 1 mod 4
M(t12)= { i,' p -=-- 3 mod 4.
Since leo, xi,— , xt} is an orthogonal basis of L2(Z/p), the set
b(0) U b(1) U - - - U b(t/2)
is an orthonormal basis of L2(Z/p) and the spaces W(k), 0 < k < t, are pairwise orthogonal.
Theorem 14.12 L2(Z/p) is the orthogonal direct sum
ti2
L2 (Z lp) = E eW (k) k=0
of F(p)-invariant subspaces W(k). The rnatrix of F(p) relative to the orthonorrnal basis
b(0) U b(1) U • • • Ub(t/2)
244 14. Multiplicative Characters and the FT
is the matrix direct sum t/2
p1/2 E em(r),
k=0
where 1 t1/2 1
M(0) = P-1/2 t1/2 _1 _I
r (-1>k M(k) = 1 0
, 1 < k < t/2,
1 p 1 mod 4, M(t/2)=
p 3 mod 4.
We will diagonalize the matrices M(k), 0 < k < t/2, by orthogonal matrices in the sense given by the formulas below. Set
0+=_ ri
L -1 '
1 [ 1 i 0 - —
i 1
0+ and 0- are orthogonal matrices. For 1 < k < t/2, direct computation shows that
0+ M(k)(0±)-1 = [ 01 _Oil, k even,
M(k)(0-)-1 = [ k odd.
The orthogonal matrix diagonalization of M(0) is more difficult to write. Set
1 a = (-) 2 ,
b = )I
We can rewrite M(0) as
M(0) = [ab ba]
Set [-N./E a v±?,---pa
Oo = ÷,a. 0+ a] •
14.6 Orthogonal Basis Diagonalizing F(r) 245
By direct computation,
oom(o)cV = [ 1 0 0 —1
Form the matrix direct sum
t/2
0 = EED0k, k=0
where, for 1 < k < tI2,
{0+ , k even, Ok = 0-1, k odd,
Ot12 = 11.
By the preceding discussion, 0 is an orthogonal matrix diagonalizing the matrix given in theorem 14.12. Applying 0 to the basis
b(0) U b(1) U • • U b(tI2),
we construct an orthonormal basis diagonalizing F(p).
14.6 Orthogonal Basis Diagonalizing F(pnl)
14.6.1 Orthogonal Basis of W
Suppose that m > 1. Fix a generator z for U(pm) throughout this section. Set t = pm-1(p — 1) and u = e2'ilt. Observe that U(pm) is the disjoint union
O(Pm) = qPm) U O(Prn,Pni-1).
We require the following result.
Theorem 14.13 For x E fi(pm) and y E 0(pr n,pm-1), we have x x* and x y*.
Proof Write x = xi and y = sk relative to z. Since x E 17(pm) and y E U(pm , pm- 1), we have p yt and plk. If x = x*, then U2/ = 1, implying pll, a contradiction. If x = y*, then Iti+k = 1 and pl(1 + k), which, since plk, implies that pl/, a contradiction.
Denote by W the orthogonal complement of L(p,prn-1) in L(ZIpm). Since U(pm) is supported on U(pm) and L(p,prn-1) is supported on pZipm,
(1(prn) c W.
The F(r)-invariance of W implies that
F(pm)0(r) = {F(r)x : x (r)} C W.
246 14. Multiplicative Characters and the FT
Theorem 14.14 (/(pm) UF(pm)Npm, pm-1) is an orthogonal basis of W . The linear span of V (pm) and the linear span of
f (Pm Pm-1) U F (Pm )0 (Pm Pm —1)
are F(pm)-invariant.
Proof Since (pm , pm-1) is orthogonal, F(pm)(/(pm,pm-1) is orthog-onal. (1(pm) and F(pm)0(pm,pm-1) have disjoint supports, implying that
O(Pm) U F(Pm)0(PmIr-1)
is orthogonal. Since the order of this set is
prri 1 (p 1) ± pni 2(p 1) pin prn
the same as the dimension of W, the first part of the theorem is proved. By theorem 14.9, Ln(V(pm)) is F(pm)-invariant. Since
F(N)2 f (a) = N f (—a)
and
x(—a)= x(-1)x(a), x E (I (pm) ,
we have that the subspace spanned by
Nr,Pm-1) u CP171)0(Pm,Pm-1)
is F(pm)-invariant, completing the proof of the theorem.
14.6.2 Orthogonal Diagonalizing Basis
For x E Cf(pm), denote by W(x) the linear span of the set
b(x) = fx,p—ml2F(pm)x}.
Theorem 14.15 For x E (I(pm), b(x) is an orthogonal basis of the F(prn)- invariant subspace W(x) and the matrix of F(pm) with respect to b(x) is
M (x ) = pm/2 [ 0 x(- 1)
1 0 •
Proof Suppose that x E fi(pm). Since x x*, the orthogonality of Npm) implies b(x) is orthogonal.
Suppose that x E U(pm,pm-1). The support of x is disjoint from the support of F(pm)x and again b(x) is orthogonal.
The proof follows from F2(pm)x = pmx(-1)x.
References 247
Theorem 14.16 For distinct x, y (1.(pm), W (x) and W(y) are orthogonal unless x E V (pm) and y = x*.
Proof Since x and y are orthogonal, F(pm)x and F(pm)y are orthogonal. Suppose that x E V (pm). F(pm)x is a multiple of x* which implies that y is orthogonal to F(pm)x-unless y = x*. If y E V(pm), then, as just argued, x is orthogonal to F(pm)y unless y = x*. If y E U(pm,pm-1), then x is orthogonal to F(pm)y since their supports are disjoint. The theorem holds for x E V(pm).
Suppose that x, y e 0.(pm,pm-1). Then x is orthogonal to F(pnly and y is orthogonal to F(pm)x since these functions have disjoint supports, completing the proof.
Since x E fl(pm) implies that x x*, we can select a subset of i/‘.±(prn) of V(pm) such that:
• x E 1-7+(pm) implies that x* 4;1 I-4(pm).
• x E f./(pm) implies that x E 1^7±(pm) or x* E 14(pm).
Set
B + = 1-T+ (Pm ) U NPin Pm -1).
Since the order of ITT(pm, pm-1) is pm-2(p - 1) and the order of f/".+(pm) is pm-2(p _ 1)2 / z we have that the order of B+ is (pm - pm-2)/2. The dimension of the linear span of UxEB+b(s) is pin - pm-2, which is the same as the dimension of W, proving the next result.
Theorem 14.17 W is the orthogonal direct sum,,
w= E ew(x),
and the matrix of F(pm) with respect to the basis UxEB+b(x) is the matrix direct sum,
prn/2 E e [ x(-1) o
sEB+
We can argue as before to find an orthonormal basis diagonalizing the restriction of F(pm) to W. Since the restriction of F(pm) to L(p, pm-1) has a matrix representation pF(P'), we can use an induction argument to find an orthonormal basis of L(Z/pm) diagonalizing F(pm) [5].
References
[1] Tolimieri, R. "Multiplicative Characters and the Discrete Fourier Transform", Adv. in Appl. Math. ,7, 1986, pp. 344-380.
248 14. Multiplicative Characters and the FT
[2] Auslander, L., Feig, E. and Winograd, S. "The Multiplicative Com-plexity of the Discrete Fourier Transform", Adv. in Appl. Math. , 5, 1984, pp. 31-55.
[3] Rader, C. "Discrete Fourier Transforms When the Number of Data Samples is Prime", Proc. IEEE, 56, 1968, pp. 1107-1108.
[4] Winograd, S. Arithmetic Cornplexity of Computations, CBMS Re-gional Conf. Ser. in Math., Vol. 33, Soc. Indus. Appl. Math., Philadelphia, 1980.
[5] Tolimieri, R. "The Construction of Orthogonal Basis Diagonalizing the Discrete Fourier Transform", Adv. in Appl. Math. , 5, 1984, pp. 56-86.
Problems
Find a generator z for the unit group U of Z/11, and order U exponentially relative to z.
2. Define the ten multiplicative characters of Z/11.
3. Find a generator z for the unit group U of Z/13, and order U exponentially relative to z.
4. Define the 12 multiplicative characters of Z/13.
5. Find a generator z for the unit group U of Z/32, and order U ex-ponentially relative to z. Define the six multiplicative characters of Z/32.
6. Find a generator z for the unit group U of Z/52, and order U ex-ponentially relative to z. Define the 20 multiplicative chara.cters of Z/52.
7. Determine the Gauss sums Ck = Gi(xk), k = 1, 2, 3, 4, 5 for p
15 Rationality
Multiplicative character theory provides a natural setting for developing the complexity results of Auslander, Feig and Winograd [1]. The first reason for this is the simplicity of the formulas describing the action of FT on multiplicative characters. We will now discuss a second important property of multiplicative characters. In a sense defined below, the spaces spanned by certain subsets of multiplicative characters are rational subspaces. As a consequence, we will be able to rationally manipulate the FT matrix F(pm) into block diagonal matrices where each block action corresponds to some polynomial multiplication modulo a rational polynomial of a special kind. This is the main result in the work of Auslander, Feig and Winograd. Details from the point of view of multiplicative character theory can be found in [2].
Although these results proceed in a straightforward fashion, the notation becomes complicated. After some preliminary general definitions we will derive in detail some examples. A function f E L(Z IN) is called a rational vector if the standard representation of f is an N-tuple of rational numbers. A subspace X of L(Z /N) is called a 'rational subspace if X has a basis consisting solely of rational vectors. Such a basis is called a rational basis of X.
250 15. Rationality
Throughout this chapter we will use the following notation. For a vector a of size N, we set
0 0 - • aN_i 0
skew-diag(a) =
0 al ao • • • 0
15.1 An Example: 7
Set u = e2"i/6. Relative to the generator 3 of U(7), the set of primitive multiplicative characters
fi(7) = {xi, X2, X3, X4, X51
is given by the following table:
0 1 3 2 6 4 5
xi 0 1 u u u' u x2 0 1 u2 U4 u6 u8 u10
X3 0 1 U3 U6 u9 u12 u15
X4 0 1 U4 U8 u12 u16 u20
X5 0 1 U5 u10 u15 u20 u25
A rational basis will be constructed for Ln(c (7)). Define the set of functions
R(7) = rk : 0 < k < 5}
by
=- es, — es. (15.1)
These functions are given by the following table:
0 1 3 2 6 4 5 r0010000-1 ri001000-1 r2000100-1 r3000010-1 r4000001-1
At the point 0, each of these functions takes on the value O. We will now show that, for each x E V(7),
4
= E x(33)r). (15.2) 3=0
15.1 An Example: 7 251
By definition, the formula holds at the points
1, 3, 2, 6, 4.
At the point 5 35 mod 7, we have
4 4
E x(3-1)7.3(5) = - Ex(3j). i=o i=o
Since x xo, 5
Ex(3j) = 0, i=o
implying that (15.2) holds at the point 5. Using (15.2) and xk(33) = uki, we have
xi -
[
-ro x2 ri
ci(7) = x3 = X(7) r2 = X(7)R(7), x4 r3 X5 _ - r4
where [1 U U2 U3 U4
1 U2 u4 u6 u8
X(7) = 1 u3 u6 u9 u12 . 1 U4 u8 u12 u16
1 U5 u10 u15 u20
The matrix X(7) is a Vandermonde matrix and is nonsingular. It follows that R(7) is a rational basis of Ln(V(7)) and X(7) is the change of basis matrix.
Consider the restriction of F(7) to Ln(c7(7)). Since
F(7)xj = G7(xi)x;`, 1 < j < 6,
the matrix of F(7) with respect 1-7(7)) is the skew-diagonal m,atrix
G7(X1) -
G7(X2)
G(7) = skew-diag G7(x3) G7(x4)
_ G7(x5)
and the matrix of F(7) with respect to R(7) is
X(7)G(7)X(7)-1.
252 15. Rationality
Completing R(7) to the rational basis
eo, eo + xo; R(7)
of L(Z/7), the matrix of F(7) relative to this rational basis is the matrix direct sum
e [X(7)G(7)X(7)-1]. (15.3)
[ 07
Two matrices A and B are called rationally related if
A = QiBQ2,
where Qi and Q2 are rational nonsingular matrices. In classical complex-ity theory, rational multiplications are free and rationally related matrices have the same multiplicative complexity. In particular, F(7) is rationally related to (15.3) and the multiplicative complexity of F(7) is equal to the multiplicative complexity of X(7)G(7)X(7)-1.
15.2 Prime Case
The methods used in the example extend in a straightforward manner to any prime p. Take a generator z of U(p). Set t = p - 1 and u = ervt. Consider the set of primitive multiplicative characters
qP)= {xk 1 k < tl.
Define R(p) = Irk : 0<k<t-11,
by setting rk = ezk - eze-i, 0 < k < t - 1.
Theorem 15.1 R(p) is a rational basis of Ln(fl(p)).
Proof We claim that, for x E 1-7(p),
t-2
X = Es(zi)ri. 1-o
The expansion holds at zk, 0 < k < t- 1 by definition. We must show that
t-2 x(zt-1) Ex(zi),„,(2_1)
/=0
t-2
= - Es(zi). i=0
15.2 Prime Case 253
Since t_i
Ex(zi)= 0,
i=0
the claim is proved. Placing r = sk, 1 < k <""t in the expansion, we have
t-2
Xk Eukin,
i=o
which in matrix form can be written as
V(P) = X(P)R(P),
where
1 u
[
ut—i 1
1 U2 u2(t-1)
X(p) = : .
1 it' • • • u(t-2)(t-1)
Since the Vandermonde matrix X(p) is nonsingular, R(p) is a rational basis of Ln(V(p)).
The matrix X(p) is the change of basis matrix between the rational basis R(p) and V (p). Since
G(p) = skew-diag [Gp(rk)11<k<t
is the matrix of F(p) with respect to V(p), we have the next result.
Theorem 15.2 The matrix of F(p) with respect to R(p) is X (p)G(p)X (p)-1
Completing R(p) to the rational basis
R = feo,eo xo,R(P)},
we have from Theorem 14.3 the next result.
Theorem 15.3 The matrix of F(p) relative to the rational basis R is the matrix direct sum
[ 14 [X(p)G(p)X(p)-1]
It follows that F(p) is rationally related to the matrix in Theorem 15.3 and has multiplicative complexity equal to the multiplicative complexity of X (p)G (p) X (p)— 1 .
A linear isomorphism P of L(Z In) is called a permutation if relative to the standard basis P is a permutation matrix. The skew-diagonal matrix
254 15. Rationality
G(p) can be replaced by a diagonal matrix by introducing the permutation transformation P defined by
P(e0) = eo,
P(ezk) = ez-k, 0 < k < t = p —1.
Since P(xk) = ek` , 0 < k < t,
we have the next result.
Theorem 15.4 The matrix of PF(p) relative to the rational basis R is the matrix direct surn
[01 id (x(P)D(P)x(P)-11,
where D(p) = diag [G,(x011<k<t •
15.3 An Example: 32
Set u = e27ri/6. Relative to the generator 2 of U(9), consider the set of primitive multiplicative characters
1/(9) = {xi, x2, x4, x51.
Define the set of functions
R(9) = fro, ri, r2, r3}
by ro = — e77
Ti = e2 — es,
r2 = e4 — e75
r3 = e8 — e8.
We claim that 3
lk Sk = Eu k = 1,2,4,5. (15.4)
i=o
By definition, (15.4) holds at the points
0, 3, 6, 1, 2, 4, 8.
15.3 An Example: 32 255
At the point 7 24 mod 9,
xk (7) = u4k,
3
E u/kri cry u2k k = 1, 2,4, 5. i=o
Since u 1,
1 ± u2k ± u4k = 0, k = 1,2,4, 5,
implying that (15.4) holds at the point 7. The same argument shows that (15.4) holds at the point 5, completing the proof of the claim.
Set [1 u u2 u3 1 n2 u4 u6
X(9) = 1 U4 u8 u12 •
1 U5 u10 u15
Then
xi re
1^7(9) = [r2 = X(9) [7.1 X(9)R(9).
x4 r2
x5 r3
Since X(9) is nonsingular, R(9) is a rational basis of Ln(17'(9)). Complete R(9) to the rational basis of L(ZI9)
R = If xo, gxo x3, gx3 R(9)},
where f = eo e3 es.
0 3 6 1 2 4 8 7 5 f 111 0 0 0 0 0 0
xo 0 0 0 1 1 1 1 1 1 go -2 1 1 0 0 0 0 0 0
X3 0 0 0 1 -1 1 -1 1 -1
93 0 1 -1 0 0 0 0 0 0
Since the matrix of F(9) with respect to 1^/(9) is the skew-diagonal matrix
[
Go(xi)
G(9) = skew-diag G,9(,X2),, ugkX4) 1 ' Gg (X5)
256 15. Rationality
the matrix of F(9) with respect to R(9) is X(9)G(9)X(9)-1. Since F(9)f = 3f, results from Theorem 14.3 prove that the matrix of F(9) relative to the rational basis R is the matrix direct product sum
[31 e (_3) [o3 G [X(9)G(9)X(9)-11
15.4 Transform Size: p2
Set t = p(p — 1) and u = e2'2/t. Fix a generator z of U(p2). Throughout this section, all constructions will be based on the generator z and on the generator z mod p of U (p).
Denote the multiplicative characters on Z/p2 by xk, 0 < k < t and the multiplicative characters on Z/p by yk, 0 < k < p — 1.
Consider the set of primitive multiplicative characters
(p2) = Ixk : 0 < k < t, and (p, k) = 11.
The dimension of Lri(i7"' (p2)) is s = (p — 1)2. Define the collection of functions
R(p2) = fri2) : 0 < k < s}
by setting, for 0 < < s,
r(2) =e1=e 0<1' < p — 1, /1 / mod (p — 1). z zs-F/ -
Theorem 15.5 R(p2) is a rational basis for Ln(f (p2)).
Proof We claim that, for all x E 1-7(p2),
(2) x = E x(z )7- i=o
Since both sides vanish off of U(2), we must show that the expansion holds at all points of U(p2). The expansion holds at zl, 0 < / < s, by definition. At z8 we must show that
s_i x(e) = Ex(z1)42)(e)
t=o p-1
-= - Es(zi(P-2)), (=.
which is the case since x(zP-1) is a nontrivial p-th root of unity. The same argument shows that the expansion holds at all points in U(p2), proving the claim.
15.4 Transform Size: p2 257
Placing x =- xk, 0 < k < t, and (p, k) = 1 in the expansion, we have
s-i
rk = E uk142),
which in matrbc form canle written as
fro92) = x(p2)R(p2),
where x032) _ rukti
.10<k<t, (p,k)=1, 0<1<s •
Since the Vandermonde matrix X(p2) is nonsingular, the proof is complete.
The matrix X(p2) is the change of basis matrix between the rational basis R(p2) and V(p2). Since
G(p2) = skew-diag [Gp2(xk)]o<k<t (P, k) = 1,
is the matrix of F(p2) with respect to the basis 17-(p2), we have the next result.
Theorem 15.6 The matrix of F(p2) with respect to R(p2) is
x(p2)G(p2)x(p2)-1.
Consider
q/32, P) = Ispk :1-k<P- 11.
The periodization map R-*(p2,p) bijectively maps fr. (p) onto ii(p2,p). Consider the rational basis R(p) of Ln(V(p)) and define
R(P2, P) = 7r* (P2, P)R(P).
Setting R(p)=Irk:0<k<p-1} as defined in (15.1), we have
R(p2,p) = fri) : 0 < k < p - 21,
where r11) = 7*(P2,Ark. The functions 4,1), 0 < k < p - 2, vanish off of U(p2), are pZ/p2-periodic and are defined by the values given in the following table.
1 Z • • ZP-3 ZP-2 (1)
ro 1 0 • - 0 —1
r(') 0 1 0 0 -1
rpw3 I 0 ... 0 1 -1
258 15. Rationality
Since 1-/(p)= X(p)R(p), we have
1-7(P2,P) = X(P)R(P2,P),
proving the next result.
Theorem 15.7 R(p2,p) is a rational basis of Ln(fl(p2,p)) with X(p) the change of basis matrix between R(p2,p) and V(p2,p).
The decimation map o-*(p2,p) maps L(ZIp) isomorphically onto the space of pZ/p2-decimated functions in L(Z/p2). Define
FV(P2,P) = cr*(P2,P)CI(P)
and
S(P2,P) = u* (P2, P)R(P).
Then FV(p2,p)=Igpk:1<k<p— 11,
where gpk is pZ/p2-decimated,
gpk(Pa) = xpk(a), a E U(p2)
and S(p2,p)={41):0<k<p— 11,
where (i)
sk = epzk — epzp-2, 0 < k < p — 2.
Since 177(p) = X(p)R(p), we have
FV(P2,P) = X(P)S(P2, P),
proving the next result.
Theorem 15.8 S(p2,p) is a rational basis of FV(p2,p) with X(p) the
change of basis matrix between S(p2,p) and FV(p2,p).
The set
W(P2, P) = Cr (P2 ) P) U FV(p2,p)
is a basis of Ln(W(p2,p)). Since
F(p2)xpk = PGp(M)94
F(p2)gpk = Gp(yk)xpk,
we have that Ln(W (p2 , p)) is F(p2)-invariant and the matrix of F(p2) with respect to W(p2,p) is
r 0 G(p) 1 . G(P2,P) = I_ pG (p) 0 j
15.4 Transform Size: p2 259
Theorem 15.9 R(p2 , p)US(p2, p) is a rational basis of the F (p2 ) -invariant subspace Ln(W(p2,p)) and the matrix of F(p2) with respect to this rational basis is
x(p2,p)G(p2,p)x(p2,p)-1,
where X(p2,p) = X(p) CB X(p).
Completing R(p2,p) U S (p2 , p) U R(p2) to the rational basis
R = If ; xo, 9zoi R(P2 ,P), S(P2 ,P); R(P2)}
of L(Z/p2), where f = Er02 eip, we have the next result.
Theorem 15.10 The matrix of F(p2) with respect to the rational basis R is the matrix direct sum,
[P] e (-P) lo ED [X (P2 , AG(P2, P)X-1(P2 ,P)]
e[x (P2)G(P2)x (P2 )- 11, where
X (p2 , p) = X (p) ED X (p).
Define the permutation matrix P by the formulas
P(e0) = eo, P(ezk)= ez-k, 0 < k < t,
P(ep,k) = epz-k, 0 < k < p - 1.
For 0 < k < t and (p, k) = 1, we have
PF(p2)xk = Gp2 (X0X1c.
For 0 < k < p -1,
P.F(p2)xpk = PGp(Yk)gpk
PF(p2)gpk = Gp(A)xpk•
Also, P(x13) = xo, P(9zo) -= gx,, and P(f) f , proving the next result.
Theorem 15.11 The matrix of PF(p2) relative to the rational basis R is the matrix direct sum,
0 1 11)1E1 (-P)[i 0] e [x(P2, p)D(p2, p)x-1(P2,
VX(p2)D(p2)X(p2)-11,
where 0 D(p)
D(P2,P) = [ pmp) 0
D(p2) = diag(Gp2(xk))o<k<t, (v,k)=1.
260 15. Rationality
15.5 Exponential Basis
The results of this chapter can be referred to as the exponential basis. First take z as a generator of U(p) and set
Ex(p)={ezk : 0<k <p— 1}
and L(p) = [4_2 - 4_2].
Then R(p) = L(p)Ex (p).
For a generator z of U(p2), set
Ex (p2) = { ezk : 0 < k < , t = p(p — 1)
and L(P2) = [is — lp_i 0 4_1], s = (p — 1)2.
Then R(p2) L(p2)Ex (p2)
and R(p2 , p) = L(p))Ex (p2).
15.6 Polynomial Product Modulo a Polynomial
Choose distinct complex numbers uk, 1 < k < n and form the polynomial
g(x) = (x - 14)•
k=1
The quotient polynomial ring C[xil g(x) is an n-dimensional vector space with basis
{x/ : 0 < / < n}.
For each d) E C[x]I g(x), define the linear transformation 7-(0) of C[x]Ig(x) by
1-(0)1,b = (blb, 7,1) C C[x]l g(x).
By the polynomial CRT there exists a basis of ideinpotents {Ek : 1 < k < n} for C[x]I g(x) satisfying the following two properties:
• For E g(x), we can write
= E O(Uk)Ek•
k=1
15.6 Polynomial Product Modulo a Polynomial 261
• For cb, c CHIg(x), we have
= E k)0(Uk)E k • k=1
-
These two properties imply the next result.
Theorem 15.12 For 0 E C[x]Ig(x), the matrix of r(0) relative to the idenapotent basis is the diagonal matrix
diag (0(nkni<k<n•
Since xl = Enk__4 UtEk, we have
1 _E .1
X E2 Z
Xn-1
with - 1
U
Z =
un-1 _ the change of basis matrix.
Theorem 15.13 For 0 E C[x]Ig(x), the inatrix of r(0) relative to the basis {x1 :0 < < n} is
Z-ldiag (0010)1<k<7,Z.
The preceding theorems will be used to interpret X(p)D(p)X(p)-1 and X(p2)D(p2)X(p2)-1 as polynomial multiplications in a quotient polynomial ring.
Fix a generator z of U(p) and denote by xi, 0 < / < p -1, the multiplica-tive characters of Z/p with respect to z. Set u = e21"/(P-1). Take uk = uk, 1 <k<p-1 in the discussion above. Then
p-2
(x) = ll(x _uk) =1, x ± • • • ± XP-2
k=1
and Z = X(p)t.
1 U2
n-1 U2
1 -
Un
n-1
262 15. Rationality
Define the Gauss polynomial Op by
p-2
(19p(X) = Ew.ksk, w= e2lri/P
k= 0
Since Op(uk) = Gp(xk), 1 < k < p - 1, we have the next result.
Theorem 15.14 The matrix of the linear transformation r(Op) of C[x]Igi(x) relative to the basis {xl :0 < < p - 2} is
(X(P)D(P)X(P)-1)t•
Fbc a generator z of U(p2) and denote by xi, 0 < / < t = p(p - 1) the = e2iri/t and multiplicative characters of Z/p2 with respect to z. Set u
g2(x) (x - 0<k<t, (p,k)=1
in the discussion above. Then
Z = X(p2)t.
Define the Gauss polynomial
t-i
,Pp2(x) = Ew-kxk, w=e2-0,2
k=0
Since (42(u/1 = Gp2 (Xk), 0 < k < t, (p, k) = 1, we have the next result.
Theorem 15.15 The matrix of the linear transformation r(Op2) of C[x]Ig2(x) relative to the basis {xl :0 < < t -1} is
(x(p2) D(p2)x(p2)-1)t.
15.7 An Example: 33
Set u = e21"118. Take 2 as a generator of U(27). Consider the set of primitive multiplicative characters
fI(27) = : 0 < k < 18, (3, k) =11.
Set Ex (27) = fezk : 0 < k <181.
A rational basis R(27) = : 0 < <121
15.7 An Example: 33 263
for Ln(c7(27)) is defined by the matrix formula
R(27) = L(27)Ex (27),
where 1427) = [112 - 12 0 16 } •
Direct computation shows that
1-7(27) = X(27)R(27),
where X(27) = [ukti
i0<k<18,(3,k)=1, 0</<12
Consider the sets
c7(27, 9) = {X3, X16, X125 X15 b
FV (27 , 9) = a a 1 {,X3 ,I6 7
To be explicit, we repeat the tables describing 1-7(27,9) and FV(27,9):
x3 1 1
2
u 4
u 8 16
u' 5
u x6 1 u2 U4 u6 u8 u10
X12 u4 u8 u12 u16 u20
X15 1 U5 u10 u15 u20 u25
3 6 12 24 21 15
gx3 1 u u u' u
gx,
gxi2
1 1
u2 U4
u4 u8
us u12
us u16
uis u20
gzi5 1 U5 u10 u15 u20 u25
We see from the first table that -17(27, 9) is constructed by periodiz-ing 1-7(9) mod 9 in Z/27 and from the second table that FV(27,9) is constructed by decimating V(9) to 3U(27) in Z/27. As a result, the set R(27, 9), formed by periodizing R(9) mod 9, is a rational basis of Ln(V(27, 9)). The set S(27,9), formed by decimating R(9) to 3U(27) is
a rational basis of Ln(FV(27,9). Then
1-7(27,9) = X(9)R(27,9)
FV(27, 9) = X(9)5(27, 9).
Form the rational basis of L(Z/27),
R= Ifo, f2 xo, gs9 R(27, 9), S(27, 9), R(27)},
264 15. Rationality
where
fo = eo + es +
= e3 + ei2 +
f2 = es + els + e24.
Theorem 15.16 The matrix of F(27) relative to the rational basis R is given by
3F(3) 6
0 11
(-3) [3 0 v 31 9 [0 1]
0
ED[X(27, 9)G(27, 9)X (27, 9)-1] [X(27)G(27)X(27)-11,
where X(27,9)
G(27,9) = = [
X(9) e X(9)
0 G(9) 3G(9) 0
].
References
[1] Auslander, L., Feig, E. and Winograd, S. "The Multiplicative Com-plexity of the,Discrete Fourier Transform", Adv. in Appl. Math. , 5, 1984, pp. 31-55.
[2] Tolimieri, R. "Multiplicative Characters and the Discrete Fourier Transform", Adv. in Appl. Math. , 7, 1986, pp. 344-380.
[3] Rader, C. "Discrete Fourier Transforms When the Number of Data Samples is Prime", Proc. IEEE, 56, 1968, pp. 1107-1108.
[4] Winograd, S. "Arithmetic Complexity of Computations", CBMS Regional Conf. Ser. in Math. SIAM, 33, Philadelphia, 1980.
[5] Tolimieri, R. "The Construction of Orthogonal Basis Diagonalizing the Discrete Fourier Transform", Adv. in Appl. Math., 5, 1984, pp. 56-86.
Index
Additive algorithms, 147 character, 219 stage, 160
Agarwal-Cooley algorithm, 84 algebra, 19 auto-sort
algorithm, 80, 86
Basis canonical, 54 of idempotents, 260 orthonormal, 218 rational, 249 standard, 30, 218
bilinear, 28 algorithms, 121
binary bit representation, 74 bit-reversal, 73, 74 block
diagonal, 170 factor, 197 factors, 189
diagonalization method, 164
Canonical basis, 54 character
additive, 219 multiplicative, 229
group, 231 characteristic
field, 25 Chinese remainder theorem, 1 chracter
multiplicative group, 231
circulant matrix, 105
circulant matrix, 105 common divisor, 2
greatest, 2, 16 polynomial, 14
commutation theorem, 1, 39 complete systems of idempotents, 1 congruent, 6 conjugate, 56 constant polynomial, 14 convolution
cyclic, 103 linear, 101 theorem, 101, 107
two-dimensional, 142 Cooley-Tukey radix-2 FFT
algorithm, 76
266 Index
core Winograd, 204, 207, 212
cyclic convolution, 103, 138, 139
two-dimensional, 138 two-dimensional, matrix
description, 140 shift matrix, 105
cyclotomic polynomials, 129
Data transposition, 83 decimated, 222 decimation
in frequency, 76 in time, 76 map, 237
degree, 13 diagonal
block, 170 factor, 197 factors, 189
diagonalization method block, 164 partial, 164
direct product group, 11 ring, 7, 21
divide, 2, 14 divisibility condition, 2 divisor
common, 2 mro, 174
dual, 221
Euler quotient function, 12 exponential
ordering, 156 permutation, 156, 174, 182, 204,
207, 212 extension field, 17
Field characteristic, 25
FT factor, 189, 197 fundamental factorization, 149, 160
Gauss polynomial, 262 sum
of primitive multiplicative character, 240
Gauss sum, 238 generator, 12 Good-Thomas Prime Factor
algorithm, 91 greatest common divisor, 2 group direct product, 11
Homomorphism, ring, 6
Ideal, 2 idempotent, 8, 21
basis of, 260 system of, 8, 21
inner product, 218 inverse matrix, 57 irreducible, 14
Legendre symbol, 243 linear convolution, 101
Matrix inverse, 57 symmetric, 57
mixed radix auto-sorting FFT algorithm, 86 factorization, 82 FFT, 82
monic polynomial, 14 multiple, 2, 14 multiplication, ring, 19 multiplicative
character, 229, 231 group, 231 primitive, 233 principal, 231
characters set of, 233
factor, 197 factors, 189 stage, 160
Order, 11 of matrix, 56
ordering, exponential, 156 orthonormal basis, 218
Parallel operation, 32
partial diagonalization method, 164 Pease FFT, 78 Pease FT, 77 perfect shuffle, 33 periodic, 221 periodization map, 236 permutation, 253
exponential, 156, 174, 182, 204, 207, 212
stride, 33 preaddition
factor, 189, 197 matrix, 149, 169
prime factorization theorem, 4 polynomial, 14
prime number, 2 primitive multiplicative character,
233 Gauss sum of, 240
principal multiplicative character, 231
Quotient, 2, 14
Rational basis, 249 reducing polynomial, 113 register, vector length, 45 relatively prime, 2
polynomial, 14 remainder, 2, 14 ring
direct product, 7, 21 homomorphism, 6 multiplication, 19
Index 267
rotated core, 149 FT, 94, 95, 97
rotated Winograd core, 175, 182
Segmenting, 29 skew-circulant, 155 skew-diagonal matrix, 251 standard
basis, 30, 218 representation, 218
Stockham FFT, 80 stride permutation, 33 subfield, 17 symmetric matrix, 57 system of idempotents, 8, 21
Tensor product, 28 of matrices, 31
twiddle factor, 58
Unit group, 6
Vandermonde matrix, 112, 251 vector
length register, 45 operation, 32
Winograd core, 157, 169, 204, 207, 212
rotated, 149, 175, 182 large FT algorithm, 148 small FT Algorithm, 147
Zero divisor, 174