Kendall SwExecution Time of Symme tric Eigensolv ers b y Kendall Sw enson Stanley B.S. (Purdue Univ...

Execution Time of Symmetric Eigensolvers

by

Kendall Swenson Stanley

B.S. (Purdue University) 1978

A dissertation submitted in partial satisfaction of the

requirements for the degree of

Doctor of Philosophy

in

Computer Science

in the

GRADUATE DIVISION

of the

UNIVERSITY of CALIFORNIA at BERKELEY

Committee in charge:

Professor James Demmel, ChairProfessor William KahanProfessor Phil Collela

Fall 1997

The dissertation of Kendall Swenson Stanley is approved:

Chair Date

Date

Date

University of California at Berkeley

Fall 1997


Copyright Fall 1997

by


1

Abstract


by


Doctor of Philosophy in Computer Science

University of California at Berkeley

Professor James Demmel, Chair

The execution time of a symmetric eigendecomposition depends upon the application, the

algorithm, the implementation, and the computer. Symmetric eigensolvers are used in a

variety of applications, and the requirements of the eigensolver vary from application to

application. Many di�erent algorithms can be used to perform a symmetric eigendecom-

postion, each with di�ering computational properties. Di�erent implementations of the

same algorithm may also have greatly di�ering computational properties. The computer on

which the eigensolver is run not only a�ects execution time but may favor certain algorithms

and implementations over others.

This thesis explains the performance of the ScaLAPACK symmetric eigensolver,

the algorithms that it uses, and other important algorithms for solving the symmetric eigen-

problem on today's fastest computers. We o�er advice on how to pick the best eigensolver

for particular situations and propose a design for the next ScaLAPACK symmetric eigensolver

which will o�er greater exibility and 50% better performance.

Professor James DemmelDissertation Committee Chair

iii

To the memory of my father. My most ambitious goal is to be as good a father as

he was to me.

iv

Contents

List of Figures viii

List of Tables x

I First Part 1

1 Summary - Interesting Observations 21.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2 Software overhead and load imbalance costs are signi�cant . . . . . . . . . . 81.3 E�ect of machine performance characteristics on PDSYEVX . . . . . . . . . . 101.4 Prioritizing techniques for improving performance. . . . . . . . . . . . . . . 111.5 Reducing the execution time of symmetric eigensolvers . . . . . . . . . . . . 121.6 Jacobi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.7 Where to obtain this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Overview of the design space 152.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.1 Parallel abstraction and languages . . . . . . . . . . . . . . . . . . . 172.3.2 Algorithmic blocking . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.3 Internal Data Layout . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3.4 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.3.5 Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.3.6 Operating Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.4.1 Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.4.2 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.4.3 Parallel computer con�guration . . . . . . . . . . . . . . . . . . . . 24

2.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.5.1 Input matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.5.2 User request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.5.3 Accuracy and Orthogonality requirements. . . . . . . . . . . . . . . 29

v

2.5.4 Input and Output Data layout . . . . . . . . . . . . . . . . . . . . . 292.6 Machine Load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.7 Historical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.7.1 Reduction to tridiagonal form and back transformation . . . . . . . 302.7.2 Tridiagonal eigendecomposition . . . . . . . . . . . . . . . . . . . . . 322.7.3 Matrix-matrix multiply based methods . . . . . . . . . . . . . . . . . 392.7.4 Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3 Basic Linear Algebra Subroutines 433.1 BLAS design and implementation . . . . . . . . . . . . . . . . . . . . . . . . 433.2 BLAS execution time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.3 Timing methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.4 The cost of code and data cache misses in DGEMV . . . . . . . . . . . . . . . 503.5 Miscellaneous timing details . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4 Details of the execution time of PDSYEVX 52

4.1 High level overview of PDSYEVX algorithm . . . . . . . . . . . . . . . . . . . 524.2 Reduction to tridiagonal form . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.2.1 Householder's algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 534.2.2 PDSYTRD implementation (Figure 4.4) . . . . . . . . . . . . . . . . . 574.2.3 PDSYTRD execution time summary . . . . . . . . . . . . . . . . . . . 71

4.3 Eigendecomposition of the tridiagonal . . . . . . . . . . . . . . . . . . . . . 724.3.1 Bisection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.3.2 Inverse iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.3.3 Load imbalance in bisection and inverse iteration . . . . . . . . . . . 734.3.4 Execution time model for tridiagonal eigendecomposition in PDSYEVX 744.3.5 Redistribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.4 Back Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5 Execution time of the ScaLAPACK symmetric eigensolver, PDSYEVX on

e�cient data layouts on the Paragon 815.1 Deriving the PDSYEVX execution time on the Intel Paragon (common case) . 835.2 Simplifying assumptions allow the full model to be expressed as a six term

model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.3 Deriving the computation time during matrix transformations in PDSYEVX on

the Intel Paragon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.4 Deriving the computation time during eigendecomposition of the tridiagonal

matrix in PDSYEVX on the Intel Paragon . . . . . . . . . . . . . . . . . . . . 855.5 Deriving the message initiation time in PDSYEVX on the Intel Paragon . . . 865.6 Deriving the inverse bandwidth time in PDSYEVX on the Intel Paragon . . . 865.7 Deriving the PDSYEVX order n imbalance and overhead term on the Intel

Paragon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 865.8 Deriving the PDSYEVX order n2p

p imbalance and overhead term on the Intel

Paragon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

vi

6 Perfomance on distributed memory computers 88

6.1 Performance requirements of distributed memory computers for running PDSYEVXe�ciently . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 886.1.1 Bandwidth rule of thumb . . . . . . . . . . . . . . . . . . . . . . . . 896.1.2 Memory size rule of thumb . . . . . . . . . . . . . . . . . . . . . . . 896.1.3 Performance requirements for minimum execution time . . . . . . . 926.1.4 Gang scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

6.2 sec:gang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 946.2.1 Consistent performance on all nodes . . . . . . . . . . . . . . . . . . 94

6.3 Performance characteristics of distributed memory computers . . . . . . . . 956.3.1 PDSYEVX execution time (predicted and actual) . . . . . . . . . . . . 95

7 Execution time of other dense symmetric eigensolvers 987.1 Implementations based on reduction to tridiagonal form . . . . . . . . . . . 98

7.1.1 PeIGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 987.1.2 HJS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 997.1.3 Comparing the execution time of HJS to PDSYEVX . . . . . . . . . . . 1017.1.4 PDSYEV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

7.2 Other techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1067.2.1 One dimensional data layouts . . . . . . . . . . . . . . . . . . . . . . 1067.2.2 Unblocked reduction to tridiagonal form . . . . . . . . . . . . . . . . 1087.2.3 Reduction to banded form . . . . . . . . . . . . . . . . . . . . . . . . 1097.2.4 One-sided reduction to tridiagonal form . . . . . . . . . . . . . . . . 1107.2.5 Strassen's matrix multiply . . . . . . . . . . . . . . . . . . . . . . . . 111

7.3 Jacobi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1127.3.1 Jacobi versus Tridiagonal eigensolvers . . . . . . . . . . . . . . . . . 1127.3.2 Overview of Jacobi Methods . . . . . . . . . . . . . . . . . . . . . . 1137.3.3 Jacobi Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1147.3.4 Computation costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1147.3.5 Communication costs . . . . . . . . . . . . . . . . . . . . . . . . . . 1217.3.6 Blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1247.3.7 Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1257.3.8 Storing diagonal blocks in one-sided Jacobi . . . . . . . . . . . . . . 1267.3.9 Partial Eigensolver . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1267.3.10 Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1287.3.11 Pairing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1297.3.12 Pre-conditioners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1317.3.13 Communication overlap . . . . . . . . . . . . . . . . . . . . . . . . . 1327.3.14 Recursive Jacobi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1327.3.15 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1337.3.16 Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

7.4 ISDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1347.5 Banded ISDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1357.6 FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

vii

8 Improving the ScaLAPACK symmetric eigensolver 137

8.1 The next ScaLAPACK symmetric eigensolver . . . . . . . . . . . . . . . . . . 1378.2 Reduction to tridiagonal form in the next ScaLAPACK symmetric eigensolver 1388.3 Making the ScaLAPACK symmetric eigensolver easier to use . . . . . . . . . . 1418.4 Details in reducing the execution time of the ScaLAPACK symmetric eigensolver141

8.4.1 Avoiding over ow and under ow during computation of the House-holder vector without added messages . . . . . . . . . . . . . . . . . 142

8.4.2 Reducing communications costs . . . . . . . . . . . . . . . . . . . . . 1438.4.3 Reducing load imbalance costs . . . . . . . . . . . . . . . . . . . . . 1448.4.4 Reducing software overhead costs . . . . . . . . . . . . . . . . . . . . 145

8.5 Separating internal and external data layout without increasing memory usage146

9 Advice to symmetric eigensolver users 148

II Second Part 150

Bibliography 151

A Variables and abbreviations 169

B Further details 172

B.1 Updating v during reduction to tridiagonal form . . . . . . . . . . . . . . . 172B.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173B.1.2 Updating v without added communication . . . . . . . . . . . . . . . 173B.1.3 Updating w with minimal computation cost . . . . . . . . . . . . . . 174B.1.4 Updating w with minimal total cost . . . . . . . . . . . . . . . . . . 177B.1.5 Notes to �gure B.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 178B.1.6 Overlap communication and computation as a last resort . . . . . . 179

B.2 Matlab codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180B.2.1 Jacobi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

C Miscellaneous matlab codes 181

C.1 Reduction to tridiagonal form . . . . . . . . . . . . . . . . . . . . . . . . . . 181

viii

List of Figures

1.1 9 by 9 matrix distributed over a 2 by 3 processor grid with mb = nb = 2 . . 41.2 Processor point of view for 9 by 9 matrix distributed over a 2 by 3 processor

grid with mb = nb = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.1 Performance of DGEMV on the Intel PARAGON . . . . . . . . . . . . . . . . . . 463.2 Additional execution time required for DGEMV when the code cache is ushed

between each call. The y-axis shows the di�erence between the time requiredfor a run which consists of one loop executing 16,384 no-ops after each call toDGEMV and the time required for a run which includes two loops one executingDGEMV and one executing 16,384 no-ops. . . . . . . . . . . . . . . . . . . . . 48

3.3 Additional execution time required for DGEMV when the code cache is ushedbetween each call as a percentage of the time required when the code iscached. See Figure 3.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.1 PDSYEVX algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.2 Classical unblocked, serial reduction to tridiagonal form, i.e. EISPACK's

TRED1(The line numbers are consistent with �gures 4.3, 4.4 and 4.5.) . . . 554.3 Blocked, serial reduction to tridiagonal form, i.e. DSYEVX( See Figure 4.2 for

unblocked serial code) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.4 PDSYEVX reduction to tridiagonal form ( See Figure 4.3 for further details) 584.5 Execution time model for PDSYEVX reduction to tridiagonal form (See Fig-

ure 4.4 for details about the algorithm and indices.) . . . . . . . . . . . . . 594.6 Flops in the critical path during the matrix vector multiply . . . . . . . . . 67

6.1 Relative cost of message volume as a function of the ratio between peak oat-ing point execution rate in Mega ops, mfs, and the product of main memorysize in Megabytes, M and network bisection bandwidth in Megabytes/sec,mbs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.2 Relative cost of message latency as a function of the ratio between peak oating point execution rate in Mega ops, mfs, and main memory size inMegabytes, M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

7.1 HJS notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

ix

7.2 Execution time model for HJS reduction to tridiagonal form. Line numbersmatch Figure 4.5(PDSYEVX execution time) . . . . . . . . . . . . . . . . . . 105

7.3 Matlab code for two-sided cyclic Jacobi . . . . . . . . . . . . . . . . . . . . 1157.4 Matlab code for two-sided blocked Jacobi . . . . . . . . . . . . . . . . . . . 1167.5 Matlab code for one-sided blocked Jacobi . . . . . . . . . . . . . . . . . . . 1177.6 Matlab code for an ine�cient partial eigendecomposition routine . . . . . . 1187.7 Pseudo code for one-sided parallel Jacobi with a 2D data layout with com-

munication highlighted . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1197.8 Pseudo code for two-sided parallel Jacobi with a 2D data layout, as described

by Schrieber[150], with communication highlighted . . . . . . . . . . . . . . 121

8.1 Data redistribution in the next ScaLAPACK symmetric eigensolver . . . . . 1388.2 Choosing the data layout for reduction to tridiagonal form . . . . . . . . . 1398.3 Execution time model for the new PDSYTRD. Line numbers match Figure 4.5(PDSYTRD

execution time) where possible. . . . . . . . . . . . . . . . . . . . . . . . . 140

B.1 Avoiding communication in computing W � V T v . . . . . . . . . . . . . . . 174B.2 Computing W � V Tv without added communication . . . . . . . . . . . . . 175B.3 Computing W � V Tv with minimal computation . . . . . . . . . . . . . . . 176B.4 Computing W � V Tv on a four dimensional processor grid . . . . . . . . . . 178

x

List of Tables

3.1 BLAS execution time (Time = �i + number of ops � i in microseconds) . . 45

4.1 The cost of updating the current column of A in PDLATRD(Line 1.1 and 1.2in Figure 4.5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.2 The cost of computing the re ector (PDLARFG) (Line 2.1 in Figure 4.5) . . . 634.3 The cost of all calls to PDSYMV from PDSYTRD . . . . . . . . . . . . . . . . . 664.4 The cost of updating the matrix vector product in PDLATRD(Line 4.1 in Fig-

ure 4.5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.5 The cost of computing the companion update vector in PDLATRD (Line 5.1 in

Figure 4.5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694.6 The cost of performing the rank-2k update (PDSYR2K) (Lines 6.1 through 6.3

in Figure 4.5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704.7 Computation cost in PDSYEVX . . . . . . . . . . . . . . . . . . . . . . . . . . 774.8 Computation cost (tridiagonal eigendecomposition) in PDSYEVX . . . . . . . 784.9 Communication cost in PDSYEVX . . . . . . . . . . . . . . . . . . . . . . . . 794.10 The cost of back transformation (PDORMTR) . . . . . . . . . . . . . . . . . . 80

5.1 Six term model for PDSYEVX on the Paragon . . . . . . . . . . . . . . . . . . 825.2 Computation time in PDSYEVX . . . . . . . . . . . . . . . . . . . . . . . . . . 855.3 Execution time during tridiagonal eigendecomposition . . . . . . . . . . . . 855.4 Message initiations in PDSYEVX . . . . . . . . . . . . . . . . . . . . . . . . . 865.5 Message transmission in PDSYEVX . . . . . . . . . . . . . . . . . . . . . . . . 865.6 �(n) load imbalance cost on the PARAGON . . . . . . . . . . . . . . . . . . . . 87

5.7 Order n2pp load imbalance and overhead term on the PARAGON . . . . . . . . 87

6.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 956.2 Hardware and software characteristics of the PARAGON and the IBM SP2. . . 966.3 Predicted and actual execution times of PDSYEVX on xps5, an Intel PARAGON.

Problem sizes which resulted in execution time of greater than 15% greaterthan predicted are marked with an asterix. Many of these problem sizeswhich result in more than 15% greater execution time than expected wererepeated to show that the unusually large execution times are aberrant. . . 97

xi

7.1 Comparison between the cost of HJS reduction to tridiagonal form and PDSYTRDon n = 4000; p = 64; nb = 32. Values di�ering from previous column areshaded. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

7.2 Fastest eigendecomposition method . . . . . . . . . . . . . . . . . . . . . . . 1127.3 Performance model for my recommended Jacobi method . . . . . . . . . . . 1187.4 Estimated execution time per sweep for my recommended Jacobi on the

PARAGON on n=1000, p=64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1207.5 Performance models ( op counts) for one-sided Jacobi variants. Entries

which di�er from the previous column are shaded. . . . . . . . . . . . . . . 1227.6 Performance models ( op counts) for two-sided Jacobi variants . . . . . . . 1237.7 Communication cost for Jacobi methods (per sweep) . . . . . . . . . . . . . 124

A.1 Variable names and their uses . . . . . . . . . . . . . . . . . . . . . . . . . . 170A.2 Variable names and their uses (continued) . . . . . . . . . . . . . . . . . . . 171A.3 Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171A.4 Model costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

xii

Acknowledgements

I thank those that I have worked with during my wonderful years at Berkeley. Doug

Ghormley taught me all that I know about emacs, X, and tcsh. Susan Blackford, Clint

Whaley and Antoine Petitet patiently answered my stupid questions about ScaLAPACK. I

thank Bruce Hendrickson for numerous insights. Mark Sears and Greg Henry gave me the

opportunity to test out some of my ideas on a real application. Peter Strazdins' study of

software overhead convinced me to take a hard look at code cache misses. Ross Moore gave

me numerous typesetting hints and suggestions. Beresford Parlett helped me with the

section on Jacobi. Oliver Sharp helped convince me to ask Jim Demmel to be my advisor

and gave some early help with technical writing. I am indebted to the members of the

ScaLAPACK team whose e�ort made ScaLAPACK, and hence this thesis, possible.

My graduate studies would not have been possible were it not for my friends and

family who encouraged me to resume my education and continued to support me in that

decision, especially my wife (Marta Laskowski), Greg Lee, and Marta's parents Michael and

Joan. I also thank Chris Ranken for his friendship; my parents for bringing me into a loving

world and teaching me to love mathematics; and Howard and Nani Ranken who proved, by

example, that the two boby problem can be solved and inspired Marta and I to pursue the

dream of two academic careers in one household.

I thank the members of my committee for their help and advice. I thank my

advisor for allowing me the luxury of doing research without worrying about funding1 or

machine access at UC Berkeley2 and the University of Tennesse at Knoxville3 I thank Prof.

Kahan for his sage advice, not just on the technical aspects, but also on the non-technical

aspects of a research career and on life itself. I thank Phil Colella for his interest in my

work and for reading my thesis on extremely short notice.

Most importantly, I thank my wife for her love and never ending support and I

thank my daughter for making me smile.

1This work was supported primarily by the Defense Advanced Research Projects Agency of the Depart-ment of Defense under contracts DAAL03-91-C-0047 and DAAH04-95-1-0077, and with additional supportprovided by the Department of Energy grant DE-FG03-94ER25206. The information presented here doesnot necessarily re ect the position or the policy of the Government and no o�cial endorsement should beinferred.

2National Science Foundation Infrastructure grant Nos. CDA-9401156 and CDA-8722788.3The University of Tennessee, Knoxville, acquired the IBM SP2 through an IBM Shared University

Research Grant. Access to the machine and technical support was provided by the University of Tennessee/Oak Ridge National Laboratory Joint Institute for Computational Science.

1

Part I

First Part

2

Chapter 1

Summary - Interesting

Observations

The symmetric eigendecomposition of a real symmetric matrix is: A = QDQT ,

where D is diagonal and Q, is orthonormal, i.e. QTQ = I . Tridiagonal based methods

reduce A to a tridiagonal matrix through an orthonormal similarity transformation, i.e.

A = ZTZT , compute the eigendecomposition of the tridiagonal matrix T = UDUT and,

if necessary, transform the eigenvectors of the tridiagonal matrix back into eigenvectors of

the original matrix A, i.e. Q = ZU . Non-tridiagonal based methods operate directly on

the original matrix A.

I am interested in understanding and minimizing the execution time of dense sym-

metric eigensolvers, as used in real applications, on distributed memory parallel computers.

I have modeled the performance of symmetric eigensolvers as a function of the algorithm,

the application, the implementation and the computer. Some applications require only a

partial eigendecomposition, i.e. only a few eigenvalues or eigenvectors. Di�erent implemen-

tations may require di�erent communication or computation patterns and they may use

di�erent libraries and/or compilers. This thesis concentrates on the O(n3) cost of reduction

to tridiagonal form and transforming the eigenvectors back to the original space.

I have modeled the execution time of the ScaLAPACK[31] symmetric eigensolver,

PDSYEVX, in detail and validated this model against actual performance on a number of

distributed memory parallel computers. PDSYEVX, like most ScaLAPACK codes, uses calls

to the PBLAS[41, 140] to perform basic linear algebra operations such as matrix-matrix

3

multiply and matrix-vector multiply in parallel. PDSYEVX and the PBLAS use calls to the

Basic Linear Algebra Subroutines, BLAS[63, 62], to perform basic linear algebra operations

such as matrix-matrix multiply and matrix-vector multiply on data local to each processor,

and calls to the Basic Linear Algebra Communications Subroutines, BLACS[169, 69], to move

data between the processors. The level one BLAS involving only vectors and perform O(n)

ops on O(n) data, where n is the length of the vector. The level two BLAS involve one

matrix and one or two vectors and perform O(n2) ops on O(n2) data, where the matrix is

of size n�n. The level three BLAS involve only matrices and perform O(n3) ops on O(n2)

data and o�er the best opportunities to obtain peak oating point performance through

data re-use.

PDSYEVX uses a 2D block cyclic data layout for all input, output and internal

matrices. 2D block cyclic data layouts have been shown to support scalable high perfor-

mance parallel dense linear algebra codes[32, 30, 124] and hence have been selected as the

primary data layout for HPF[110], ScaLAPACK[68] and other parallel dense linear algebra

libraries[98, 164]. A 2D block cyclic data layout is de�ned by the processor grid (pr by pc),

the local block size (mb by nb) and the location of the (1; 1) element of the matrix. In

this thesis, we will assume that the (1; 1) element of matrix A, i.e. A(1; 1) is mapped to

the (1; 1) element of the local matrix in processor (0; 0). Hence, A(i; j) is stored in element

(b i�1mb�pr cmb+ mod (i� 1;mb� pr)+1; b j�1

nb�pc cnb+ mod (j� 1; nb� pc)+1) on processor

( mod (b i�1mbc; pr); mod (b j�1

nbc; pc)). Figures 1.1 and 1.2, reprinted from the ScaLAPACK

User's Guide[31] shows how a 9 by 9 matrix would be distributed over a 2 by 3 processor

grid with mb = nb = 2. In general, we will assume that square blocks are used since this is

best for the symmetric eigenproblem, and we will use nb to refer to both the row block size

and the column block size.

All ScaLAPACK codes including PDSYEVX in version 1.5 use the data layout block

size as the algorithmic blocking factor. Hence, except as noted, we use nb to refer to

the algorithmic blocking factor as well as the data layout block size. Data layouts, and

algorithmic blocking factors are discussed in Section 2.3.3.

PDSYEVX calls the following routines:

PDSYTRD Performs Householder reduction to tridiagonal form.

PDSTEBZ Computes the eigenvalues of a tridiagonal matrix using bisection.

PDSTEIN Computes the eigenvectors of the tridiagonal matrix using inverse iteration and

4

Figure 1.1: 9 by 9 matrix distributed over a 2 by 3 processor grid with mb = nb = 2

9 x 9 matrix partitioned in 2 x 2 blocks

a11a12

a21a22

a13a14

a23a24

a19

a29

a15a16

a25a26

a17a18

a27a28

a43a44

a34a33

a41a42

a32a31

a49

a39

a48a47

a38a37

a45a46

a35a36

a52a51

a61a62

a54a53

a63a64

a55a56

a65a66

a58a57

a67a68

a59

a69

a71a72

a81a82

a73a74

a83a84

a75a76

a85a86

a79

a89

a77a78

a87a88

a91a92 a93a94 a95a96 a97a98 a99

NB

MB

M

N

Gram-Schmidt reorthogonalization.

PDORMTR Transforms the eigenvectors of the tridiagonal matrix back into eigenvectors of

the original matrix.

My performance models explain performance in terms of the following application param-

eters:

n The matrix size.

m The number of eigenvectors required.

e The number of eigenvalues required. (e � m)

the following machine parameters:

p The number of processors (arranged in a pr by pc grid as described below).

� The communication latency (secs/message).

� The inverse communication bandwidth (secs/double precision word). This means that

sending a message of k double precision words costs: �+ k�.

5

Figure 1.2: Processor point of view for 9 by 9 matrix distributed over a 2 by 3 processorgrid with mb = nb = 2

0

0

1

1

2 x 3 process grid point of view

2a11a12

a21a22

a13a14

a23a24

a19

a29

a15a16

a25a26

a17a18

a27a28

a43a44

a34a33

a41a42

a32a31

a49

a39

a48a47

a38a37

a45a46

a35a36

a52a51

a61a62

a54a53

a63a64

a55a56

a65a66

a58a57

a67a68

a59

a69

a71a72

a81a82

a73a74

a83a84

a75a76

a85a86

a79

a89

a77a78

a87a88

a91a92 a93a94 a95a96a97a98 a99

1; 2; 3 Time per op for BLAS1, BLAS2 and BLAS3 routines respectively.

�1; �2; �3; �4 Software overhead for BLAS1, BLAS2, BLAS3 and PBLAS routines respectively.

This means that a call to DGEMM(a BLAS3 routine) requiring c ops costs: �3+c 3. See

Chapter 3 for details on the cost of the BLAS. The cost of the PBLAS routine PDSYMV

is shown in Table 4.3.

My model also uses the following algorithmic and data layout parameters:

pr The number of processor rows in the processor grid.

pc The number of processor columns in the processor grid.

nb The data layout block size and algorithmic blocking factor.

These and all other variables used in this thesis are listed in Table A.1 in Ap-

pendix A.

The rest of this chapter presents the most interesting results from my study of the

execution time of symmetric eigensolvers on distributed memory computers. Section 1.1

6

describes the algorithms commonly used for dense symmetric eigendecomposition on dis-

tributed memory parallel computers. Section 1.2 describes how software overhead and load

imbalance costs are signi�cant. Section 1.3 explains the two rules of thumb for ensuring

that a distributed memory parallel computer can achieve good performance on a dense lin-

ear algebra code such as ScaLAPACK's symmetric eigensolver. Section 1.4 explains that it

is important to identify which techniques o�er the greatest potential for improving perfor-

mance across a wide range of applications, computers, problem sizes and distributed memory

parallel computers. Section 1.5 gives a synopsis of how execution time of the ScaLAPACK

symmetric eigensolver could be reduced. Section 1.6 explains the types of applications on

which Jacobi can be expected to be as fast as, or faster than, tridiagonal based methods.

The rest of my thesis is organized as follows. Chapter 2 provides an introduction

and a historical prospective. Chapter 3 explains the performance of the Basic Linear Algebra

Subroutines (BLAS). Chapter 4 contains my complete execution time model for ScaLAPACK's

symmetric eigensolver, PDSYEVX. Chapter 5 simpli�es the execution time model by concen-

trating on a particular application on a particular distributed memory parallel computer,

the Intel Paragon. Chapter 6 explains the performance requirements of distributed memory

parallel computers and discusses the execution time of PDSYEVX. Chapter 7 explains the

performance of other dense symmetric eigensolvers. Chapter 8 provides a blueprint for re-

ducing the execution time of PDSYEVX. Chapter 9 o�ers concise advice to users of symmetric

eigensolvers.

1.1 Algorithms

There are many widely disparate symmetric eigendecomposition algorithms. Tridi-

agonal reduction based algorithms for the symmetric eigendecomposition require asymptot-

ically the fewest ops and have been historically the fastest and most popular[83, 79, 129,

153, 86, 145, 134, 50].

Iterative eigensolvers, e.g. Lanczos and conjugate gradient methods, are clearly

superior if the input matrix is sparse and only a limited portion of the spectrum is needed[49,

119]. Iterative eigensolvers are out of the scope of this thesis.

Even for tridiagonal matrices, there are several algorithms worthy of attention

for the tridiagonal eigendecomposition. The ideal method would require at most O(n2)

oating point operations, O(n) message volume and O(p) messages. The recent work of

7

Parlett and Dhillon[136, 139] renews hope that such a method will be available in the

near future. Should this e�ort hit unexpected snags, other better known methods, such

as QR[79, 86, 93], QD[135], bisection and inverse iteration[83, 102] and Cuppen's divide

and conquer algorithm[50, 66, 147, 88] will remain common. Parallel codes have been

written for QR[39, 8, 76, 125], bisection and inverse iteration[15, 75, 54, 81] and Cuppen's

algorithm[82, 80, 141]. ScaLAPACK o�ers parallel QR and parallel bisection and inverse

iteration codes and Cuppen's algorithm[50, 66, 88], which has recently replaced QR as

the fastest serial method[147], has been coded for inclusion in ScaLAPACK by Fran�coise

Tisseur. Algorithms for the tridiagonal eigenproblem are discussed in Section 2.2, and

parallel tridiagonal eigensolvers are discussed in Section 7.1.

A detailed comparison of tridiagonal eigensolvers would be premature until Parlett

and Dhillon complete their prototype.

This thesis concentrates on the O(n3) cost of reduction to tridiagonal form and

transforming the eigenvectors back to the original space. Hendrickson, Jessup and Smith[91]

showed that reduction to tridiagonal form can be performed 50% faster than ScaLAPACK

does. Lang's successive band reduction[116], SBR, is interesting at least if only eigenval-

ues are to be computed. But the complexity of SBR has made it di�cult to realize the

theoretical advantages of SBR in practice. A performance model for PDSYEVX, ScaLAPACK's

symmetric eigensolver, section 7.1.2. is given in Chapter 4. By restricting our attention to

a single computer, and to the most common applications, the model is further simpli�ed

and discused in Chapter 5.

Jacobi requires 4-20 times as many oating point operations as tridiagonal based

methods, hence the type of problems on which Jacobi will be faster will always be lim-

ited. Jacobi is faster than tridiagonal based methods[125, 2] on small spectrally diagonally

dominant matrices1 despite requiring 4 times as many ops because it has less overhead.

However, on large problems tridiagonal based methods can achieve at least 25% e�ciency

and will hence be faster than any method requiring 4 times as many ops. And, on matrices

that are not spectrally diagonally dominant, Jacobi requires 20 or more times as many ops

as tridiagonal based methods - a handicap that is simply too large to overcome. Jacobi's

method is discussed in Section 7.3.

Methods that require multiple n by n matrix-matrix multiplies, such as the Invari-

1Spectrally diagonally dominant means that the eigenvector matrix, or a permutation thereof, is diago-nally dominant.

8

ant Subspace Decomposition Approach[97] (ISDA), and Yau and Lu's FFT based method[174]

require roughly 30 times as many oating point operations as tridiagonal based methods

and hence may never be faster than tridiagonal based methods. The ISDA for solving

symmetric eigenproblems is discussed in Section 7.4.

Banded ISDA[26] is an improvement on ISDA that involves

an initial bandwdith reduction. Banded ISDA[26] is nearly a tridiagonal method

and o�ers performance that is nearly as good, at least if only eigenvalues are sought. How-

ever since a banded ISDA code requires multiple bandwidth reduction each of which requires

a back transformation, if even a few eigenvectors are required, a banded ISDA code must ei-

ther store the back transformations in compact form or it will perform an additional O(n3)

ops. No code available today stores and applies these backtransformations in compact

form. At present, the fastest banded ISDA code starts by reducing the matrix to tridiag-

onal form and is neither the fastest tridiagonal eigensolver, nor the easiest to parallelize.

Banded ISDA is discussed in Section 7.5.

In conclusion, reduction to tridiagonal form combined with Parlett and Dhillon's

tridiagonal eigensolver is likely to be the preferred method for eigensolution of dense matrices

for most applications.

In the meantime, until Parlett and Dhillon's code is available, we believe that

PDSYEVX is the best general purpose symmetric eigensolver for dense matrices. It is available

on any machine to which ScaLAPACK has been ported2, it achieves 50% e�ciency even

when the ops in the tridiagonal eigensolution are not counted3 and it scales well, running

e�ciently on machines with thousands of nodes. It is faster than ISDA and faster than

Jacobi on large matrices and on matrices that are not spectrally diagonally dominant.

1.2 Software overhead and load imbalance costs are signi�-

cant

In PDSYEVX, it is somewhat surprsing but true that software overhead and load

imbalance costs are larger than communications costs. In its broadest de�nition, software

overhead is the di�erence between the actual execution time and the cost of communication

2Intel Paragon, Cray T3D, Cray T3E, IBM SP2, and any machine supporting the BLACS, MPI or PVM3Our de�nition of e�ciency is a demanding one: total time divided by the time required by reduction

to tridiagonal form and back transformation assuming that these are performed at the peak oating point

execution rate of the machine. i.e. time=( 103

n3

p 3)

9

and computation. Software overhead includes saving and restoring registers, parameter

passing, error and special case checking as well as those tasks which prevent calls to the

BLAS involving few ops from being as e�cient as calls to the BLAS involving many ops:

loop overhead, border cases and data movement between memory hierarchies that gets

amortized over all the operations in a given call to the BLAS. The cost of any operation

which is performed by only a few of the processors (while the other processors are idle) is

a load imbalance cost.

Because software overhead is as signi�cant as communication latency, the three

term performance model introduced by Choi et al.[40] and used in my earlier work[57],

which only counts ops, number of messages and words communicated, does not adequately

model the performance of PDSYEVX. In addition to these three terms a fourth term, which

we designate �, representing software overhead costs is required.

Software overhead is more di�cult to measure, study, model and reason about

than the other components of execution time. Measuring the execution time required for a

subroutine call requiring little or no work measures only subroutine call overhead, parameter

passing and error checking. For the performance models in this thesis, we measure the

execution time of each routine across a range of problem sizes (with code cached and data

not cached) and use curve �tting to estimate the software overhead of an individual routine.

Because we perform these timings with code cached but data not cached, this gives an

estimate of all software overhead costs except code cache misses.

We use times with the code cached and data for our performance models because,

for most problem sizes, the matrix is too large to �t in cache but it is less clear whether code

�ts in cache or not. It is easy to compute the amount of data which must be cached, but

there is no portable automatic way to measure the amount of code which must be cached.

Furthermore, the data cache needs, for typical problem sizes, are much larger than code

cache needs, hence while it is usually clear that the data is not cached the code cache needs

and code cache size are much closer.

A full study of software overhead costs is out of the scope of this thesis and remains

a topic for future research. The overhead and load imbalance terms in the performance

model for PDSYEVX on the PARAGON are explained in Sections 5.7 and 5.8.

10

1.3 E�ect of machine performance characteristics on PDSYEVX

The most important machine performance characteristic is the peak oating point

rate. Bisection bandwidth essentially de�nes which machines ScaLAPACK can perform well

on. Message latency and software overhead, since they are O(n) terms are important

primarily for small and medium matrices.

Most collections of computers fall into one of two groups: those connected by

a switched network whose bisection bandwdith increases linearly (or nearly so) with the

number of processors and those connected by a network that only allows one processor to

send at a time. All current distributed memory parallel computers that I am aware of have

adequate bisection bandwdith4 to support good e�ciency on PDSYEVX. On the other hand,

no network that only allows one processor to send at a time can allow scalable performance

and none that I am aware of allows good performance with as many as 16 processors. As

long as the bandwidth rule of thumb (explained in detail in Section 6.1.1) holds, bandwidth

will not be the limiting factor in the performance of PDSYEVX.

Bandwidth rule of thumb: Bisection bandwidth per processor5 times the square root

of memory size per processor should exceed oating point performance per processor.

Megabytes/sec

processor�pMegabytes

processor>

Mega ops/sec

processor

assures that bandwidth will not limit performance.

Assuming that the bandwidth is adequate, we consdier next the problem size per

processor:

If the problem is large enough, i.e. (n2=p) > 2 � (Megaflops=processor), then

PDSYEVX should execute reasonably e�ciently. This rule (explained in detail in Section 6.1.2

can be restated as:

Memory size rule of thumb: memory size should match oating point performance


4Few distributed memory parallel computers o�er bandwidth that scales linearly with the number ofprocessors but most still have adequate bisection bandwidth.

5Bisection bandwidth per processor is the total bisection bandwidth of the network divided by the numberof processors.

11

Megabytes

processor>

Mega ops/sec

processor

assures that PDSYEVX will be e�cient on large problems.

If the problem is not large enough, lower order terms, as explained in Chap-

ter 4 will be signi�cant. Unlike the peak op rate which can be substantially independent

of main memory performance, lower order terms (communication latency, communication

bandwidth, software overhad and load imbalance) are strongly linked to main memory

performance.

PDSYEVX can work well on machines with large slow main memory (on large prob-

lems) and or machines with small fast main memory (on small problems). Most distributed

memory parallel computers have su�cient memory size and network bisection bandwidth

to allow PDSYEVX to achieve high e�ciency on large problem sizes. The Cray T3E is one of

the few machines that has su�cient main memory performance to allow PDSYEVX to achieve

high performance on small problem sizes. The e�ect of machine performance characteristics

on PDSYEVX is discussed in Chapter 6.

1.4 Prioritizing techniques for improving performance.

One fo the most importatn uses of performance modeling is to identify which

techniques o�er the most promise for performance improvement, because there are too

many performance improvement techniques to allow one to try them all. One technique

that appeared to be important early in my work, optimizing global communications, now

appears less important in light of the discovery that software overhead and load imbalance

are more signi�cant than earlier thought. Here we talk about general conclusions; details

are summarized in Section 1.5, and elaborated in Chapters 7 and 8.

Overlapping communication and computation, though it undeniably increases per-

formance, should be implemented only after every e�ort has been made to reduce both

communications and computations costs as much as possible. Overlapping communication

and computation has proven to be more attractive in theory than in practice because not

all communication costs overlap well and communication costs are not the only impediment

to good parallel performance.

12

Although Strassen's matrix multiplication has been proven to o�er performance

better than can be achieved through traditional methods, it will be a long time before

a Strassen's matrix multiply is shown to be twice as fast as a traditional method. A

typical single processor computer would require 2-4 Gigabytes of main memory to achieve

an e�ective op rate of twice the machine's peak op rate6 and 2-4 Terabytes of main

memory to achieve 4 times the peak op rate. Strassen's matrix multiplication will get

increasing use in the coming years, because achieving 20% above \peak" performance is

nothing to sneeze at, but Strassen's matrix multiply will not soon make matrix multiply

based eigendecomposition such as ISDA faster than tridiagonal based eigendecomposition.

1.5 Reducing the execution time of symmetric eigensolvers

PDSYEVX can be improved. It does not work well on matrices with large clusters

of eigenvalues. And, it is not as e�cient as it could be[91], achieving only 50% of peak

e�ciency on PARAGON, Cray T3D and Berkeley NOW even on large matrices. On small

matrices it performs worse. Parlett and Dhillon's new tridiagonal eigensolver promises to

solve the clustered eigenvalue problem so we concentrate on improving the performance of

reduction to tridiagonal form and back transformation.

Input and output data layout need not a�ect execution time of a parallel sym-

metric eigensolver because data redistribution is cheap. Data redistribution requires only

O(p) messages and O(n2=p) message volume per processor. This is modest compared to

O(n log(p)) messages and O(n2=pp) message volume per processor required by reduction

to tridiagonal form and back transformation.

Separating internal and external data layout actually decreases minimum execution

time over all data layouts. Separating internal and external data layouts allows reduction

to tridiagonal form and back transformation to use di�erent data layouts. It also allows

codes to concentrate only on the best data layout, reducing software overhead and allowing

improvements which would be prohibitively complicated to implement if they had to work

on all two-dimensional block cyclic data layouts.

Separating internal and external data layouts increases the minimum workspace

requirement7 from 2:5n2 to 3n2. However with minor improvements in the existing code,

6A dual processor computer would require twice as much memory.7Assuming that data redistribution is not performed in place. It is di�cult to redistribute data in place

13

and without any changes to the interface, internal and external data layout can be separated

without increasing the workspace requirement. See Section 8.5.

Lichtenstein and Johnson[124] point out that data layout is irrelevant to many

linear algebra problems because one can solve a permuted problem instead of the original.

This works for symmetric problems provided that the input data is distributed over a square

processor grid and with a row block size is equal to the column block size.

Hendrickson, Jessup and Smith[91] demonstrated that the performance of PDSYEVX

can be improved substantially by reducing load imbalance, software overhead and commu-

nications costs. Most of the ine�ciency in PDSYEVX is in reduction to tridiagonal form.

Software overhead and load imbalance are responsible for more of the ine�ciency than the

cost of communications. Hence, it is those areas that need to be sped up the most. Pre-

liminary results[91] indicate that by abandoning the PBLAS interface, using BLAS and BLACS

calls directly, and concentrating on the most e�cient data layout, software overhead, load

imbalance and communications costs can be cut in half. Strazdins has investigated reducing

software overheads in the PBLAS[161], but it remains to be seen whether software overheads

in the PBLAS can be reduced su�ciently to allow PDSYEVX to be as e�cient as it could be.

PDSYEVX performance can be improved further if the compiler can produce e�cient code on

simple doubly nested loops, implementing merged BLAS Level 2 operations (like DSYMV and

dsyr2).

For small matrices, software overhead dominates all costs, and hence one should

minimize software overhead even at the expense of increasing the cost per op. An unblocked

code has the potential to do just that.

Although back transformation is more e�cient than reduction to tridiagonal form,

it can be improved. Whereas software overhead is the largest source of ine�ciency in re-

duction to tridiagonal form, communications cost and load imbalance are the largest source

of ine�ciency in back transformation. Load imbalance is hard to eliminate in a blocked

data layout in reduction to tridiagonal form because the size of the matrix being updated

is constantly changing (getting smaller), but in back transformation, all eigenvectors are

constantly updated, so statically balancing the number of eigenvalues assigned to each

processor works well. Therefore the best data layout for back transformation is a two-

dimensional rectangular block-cyclic data layout. The number of processor columns, pc,

between two arbitrary parallel data layouts. If e�cient in-place data redistribution were feasible, separatinginternal and external data layout would require only a trivial increase in workspace.

14

should exceed the number of processor rows by a factor of approximately 8. The optimal

data layout column block size is: dn=(pc k)e for some small integer k. The row blocksize

is less important in back transformation, and 32 is a reasonable choice, although setting it

to the same value as the column block size will also work well if the BLAS are e�cient on

that block size and pr < pc. Many techniques used to improve performance in LU decom-

position, such as overlapping communication and computation, pipelining communication

and asynchronous message passing can also be used to improve the performance of back

transformation. Of these techniques, only asynchronous message passing (which eliminates

all local memory movement) requires modi�cation to the BLACS interface. The modi�cation

to the BLACS needed to support asynchronous message passing would allow forward and

backward compatibility.

All of these methods are discussed in Chapter 8.

1.6 Jacobi

A one-sided Jacobi method with a two-dimensional data layout will beat tridi-

agonal based eigensolvers on small spectrally diagonally dominant matrices. The simpler

one-dimensional data layout is su�cient for modest numbers of processors, perhaps as many

as a few hundred, but does not scale well. Tridiagonal based methods, because they require

fewer ops, will beat Jacobi methods on random matrices regardless of their size on large

(n > 200pp) matrices even if they are spectrally diagonally dominant. Jacobi also remains

of interest in some cases when high accuracy is desired[58]. Jacobi's method is discussed in

Section 7.3.

1.7 Where to obtain this thesis

This thesis is available at: http://www.cs.berkeley.edu/stanley/thesis

15

Chapter 2

Overview of the design space

2.1 Motivation

The execution time of any computational solution to a problem is a single-valued

function (time) on a multi-dimensional and non-uniform domain. This domain includes

the problem being solved, the algorithm, the implementation of the algorithm and the

underlying hardware and software (sometimes referred to collectively as the computer). By

studying one problem, the symmetric eigenproblem, in detail we gain insight into how each

of these factors a�ects execution time.

Section 2.2 discusses the most important algorithms for dense symmetric eigende-

composition on distributed memory parallel computers. Section 2.3 discusses the e�ect that

the implementation can have on execution time. Section 2.4 discusses the e�ect of various

hardware characteristics on execution time. Section 2.5 lists several applications that uses

symmetric eigendecomposition and their di�ering needs. Section 2.6 discusses the direct and

indirect e�ects of machine load on the execution time of a parallel code. Section 2.7 outlines

the most important historical developments in parallel symmetric eigendecomposition.

2.2 Algorithms

The most common symmetric eigensolvers which compute the entire eigendecom-

position use Householder reduction to tridiagonal form, form the eigendecomposition of the

tridiagonal matrix and transform the eigenvectors back to the original basis. Algorithms

that do not begin by reduction to tridiagonal form require more oating point operations.

16

Except for small spectrally diagonally dominant matrices, on which Jacobi will likely be

faster than tridiagonal based methods, and scaled diagonally dominant matrices on which

Jacobi is more accurate[58], tridiagonal based codes will be best for the eigensolution of

dense symmetric matrices. See Section 7.3 for details.

The recent work of Parlett and Dhillon o�ers the promise of computing the tridi-

agonal eigendecomposition with O(n2) ops and O(p) messages. Should some unexpected

hitch prevent this from being satisfactory on some matrix types, there are several other

algorithms from which to choose. Experience with existing implementations shows that for

most matrices of size 2000 by 2000 or larger, the tridiagonal eigendecomposition is a modest

component of total execution time.

Reduction to tridiagonal form and back transformation are the most time con-

suming steps in the symmetric eigendecomposition of dense matrices. These two steps

require more ops (O(n3) vs. O(n2)), more message volume (O(n2pp) vs. O(n2)) and

more messages (O(n log(p)) vs. O(p)) than the eigendecomposition of the tridiagonal ma-

trix. Since the cost of the matrix transformations (reduction to tridiagonal form and back

transformation) grows faster than the cost of tridiagonal eigendecomposition, the matrix

transformations are the dominant cost for larger matrices.

Reduction to tridiagonal form and back transformation require di�erent commu-

nication patterns. Reduction to tridiagonal form is a two-sided transformation requiring

multiplication by Householder re ectors from both the left and right side. Two sided reduc-

tions require that every element in the trailing matrix be read for each column eliminated,

hence half of the ops are BLAS2 matrix-vector ops and O(n log(p)) messages are required.

Equally importantly, two-sided reductions require signi�cant calculations within

the inner loop, which translates into large software overhead. Indeed on the computers

that we considered, software overhead appears to be a larger factor in limiting e�ciency of

reduction to tridiagonal form than communication.

Back transformation is a one-sided transformation with updates than can be

formed anytime prior to their application. Hence back transformation requires O(n=nb)

messages (where nb is the data layout block size) and far less software overhead than re-

duction to tridiagonal form.

Chapters 4 and 5 discuss the execution time of reduction to tridiagonal form and

back transformation, as implemented in ScaLAPACK, in detail.

17

2.3 Implementations

2.3.1 Parallel abstraction and languages

There are three common ways of expressing parallelism in linear algebra codes:

message passing, shared memory and calls to the BLAS. Message passing programs tend

to keep communication to a minimum, in part because the communication is speci�ed

directly. Shared memory codes can outperform message passing codes when load imbalance

costs outweigh communication costs[118]. All calls to the BLAS o�er potential parallelism

though the potential for speedup varies. ScaLAPACK uses message passing while LAPACK

exposes parallelism through calls to the BLAS.

In some cases, recent compilers are able to identify the parallelism in codes that

may not have been written speci�cally for parallel execution[172, 171]. However, experience

has shown that programs designed for sequential machines rarely exhibit the properties

necessary for e�cient parallel execution, hence some research into parallelizing compilers

has switched its emphasis to parallelizing codes which are written in languages such as

HPF[94, 110] which allow the programmer to express parallelism and allow some control

over data layout.

Codes written in any standard sequential language, such as C, C++ or Fortran can

achieve high performance, especially if the majority of the operations are performed within

calls to the BLAS. If the ops are performed within codes written in the language itself, the

execution time will depend upon the code and the compiler more than on the language used.

If pointers are used carelessly in C, the compiler may not be able to determine the data

dependencies exactly and may have to forgo certain optimizations[172]. On the other hand,

carefully crafted C codes, tuned for individual architectures and compiled with modern

optimizing compilers can result in performance that rivals that of carefully tuned assembly

codes[23, 168].

2.3.2 Algorithmic blocking

A blocked code is one that has been recast to allow some of the ops to be per-

formed as e�cient BLAS3 matrix-matrix multiply ops[6, 4]. Typically a block of columns is

reduced using an unblocked code followed by a matrix-matrix update of the trailing matrix.

The algorithmic blocking factor is the number of columns (or rows) in the block column.

18

In serial codes, data layout blocking does not exist and hence the algorithmic blocking fac-

tor is referred to simply as the blocking factor. In ScaLAPACK version 1.5, the algorithmic

blocking factor is set to match the data layout blocking factor.

2.3.3 Internal Data Layout

Most of the ops in blocked dense linear codes involve a rankk update, i.e. A0 =

A+B �C where A 2 Rm;n, B 2 Rm;k, C 2 Rk;n and m;n are O(n) and k is the algorithmic

blocking factor (a tuning parameter typically much smaller than n or m). A may be

triangular and B and/or C may be transposed or conjugate transposed. Hence internal

data layout must support good performance on such rank k updates.

A is typically updated in place, i.e. the node which owns element Ai;j computes

and stores A0i;j . This is called the owner computes rule and is motivated by the high cost

of data movement relative to the cost of oating point computation. If k is large enough a

3D data layout is more e�cient[1] [12], and performance can be improved further by using

Strassen's matrix multiply[157] [96] [70]. Some dense linear algebra codes, including LU,

can be recursively partitioned[165] resulting in large values of k for the majority of the

ops. Nonetheless, though a 3D data layout might be best for a recursively partitioned LU,

reduction to tridiagonal form is most e�cient with a modest algorithmic blocking factor

and hence it is more e�cient to update A in place and we will make that assumption for

the rest of this discussion.

If A is to be updated in place, a 2D layout minimizes the total communication

requirement for rank k updates. The elements of B and C which must be sent to each node

are determined by the elements of A owned by that node. The node that owns element Ai;j

must obtain a copy of row i of B and column j of C. The number of elements of matrices

B and C that a given node must obtain is k times the number of rows and columns of A

for which the node owns at least one element. If a node must own r2 elements, the number

of elements of B and C which must be obtained is minimized if the node owns a square

submatrix of A corresponding to r rows and r columns. In a 2D layout, the processors are

arranged in a rectangular grid. Each row of the matrix is assigned to a row of the processor

grid. Each column is assigned to a column of the processor grid.

The common ways of assigning the rows and columns to the processor grid in a

2D layout are: block, cyclic and block-cyclic. For the following descriptions, we will assume

19

that we are distributing n rows of A over pr processor rows. In a cyclic layout, row i is

assigned to processor row i mod pr. In a block layout, row i is assigned to processor row

b i�1dn=prec. In a block-cyclic data layout, row i is assigned to processor row b i�i

nbmod prc,

where nb is the data layout block-size. The block-cyclic data layout includes the other two

as special cases.

Block-cyclic data layouts simplify algorithmic blocking and are used in most paral-

lel dense linear algebra libraries[68] [98, 164]. However, by separating algorithmic blocking

from data blocking it is usually1 possible to achieve high performance from a cyclic data

layout[91, 140, 44, 158].

One-dimensional data layouts require O(n2) data movement per node (compared

to O(n2=pp) for 2D data layouts) and are generally less e�cient. However, there are certain

situations in which 1D data layouts are preferred. If the communication pattern is strictly

one-dimensional (i.e. only along rows or columns) a 1D data layout requires no communi-

cation. Furthermore, some applications, such as LU, require much more communication in

one direction than the other2. Hence, for modest numbers of processors it may be better

to use a 1D data layout.

A square processor grid can greatly simplify symmetric reductions - allowing lower

overhead codes. Furthermore, I believe that pipelining and lookahead (see section 2.4.2)

can only be used e�ectively on symmetric reductions (such as Cholesky and reduction from

generalized to standard form) when a square processor grid is used3.

All existing parallel dense linear algebra libraries use the same input data layout

as the internal data layout. In Chapter 8 I will demonstrate that this is not necessary to

achieve high performance and that in fact performance can be improved by using a di�erent

data layout internally than the input and output data layout.

1Block-cyclic data layouts still maintain an advantage over cyclic data layouts on machines with highcommunication latency, especially in those algorithms, such as Cholesky and back transformation, thatrequire only O(n=nb) messages, where nb is the data layout block-size.

2LU with partial pivoting requires O(n log(p)) messages within the processor columns but only O(n=nb)messages within the processor rows[31, 40, 30]. The total volume of communication however is similar inboth directions.

3Pipelining and lookahead cannot be used in reduction to tridiagonal form because of its synchronousnature.

20

2.3.4 Libraries

Software libraries can improve portability, robustness, performance and software

re-use. ScaLAPACK is built on top of the BLAS and BLACS and hence will run on any system

on which a copy of the BLAS[63, 62] and BLACS[169, 69] can be obtained.

Libraries, and their interface, have both a positive and a negative e�ect on per-

formance. The existence of a standard interface to the BLAS means that by improving the

performance of a limited set of routines, i.e. the BLAS, one can improve the performance of

the entire LAPACK and ScaLAPACK library and other codes as well. Hence, many manufac-

turers have written optimized BLAS for their machines. In addition, Bilmes et al.[23, 168]

have written a portable high performance matrix-matrix multiply and two other research

groups have written high performance BLAS that depend only on the existence of a high

performance matrix-matrix multiply[51, 103, 104]. Portable high performance BLAS o�ers

the promise of high performance on LAPACK and ScaLAPACK codes without the expense of

hand coded BLAS.

However, adhering to a particular library interface necessarily rules out some possi-

bilities. The BLACS do not support asynchronous receives, a costly limitation on the Paragon.

The BLAS do not meet all computational needs[108], especially in parallel codes[91], hence

the programmer is faced with the choice of reformulating code to use what the BLAS o�ers

or avoiding the BLAS and trusting the compiler to produce high performance code. Fur-

thermore, the interface itself implies some overhead, at the very least a subroutine call

but typically much more than that[161]. Strazdins[161] showed that software overhead in

ScaLAPACK accounts for 15-20% of total execution time even for the largest problems that

�t in memory on a Fujitsu VP1000.

2.3.5 Compilers

Compiler code generation is relatively unimportant to LAPACK and ScaLAPACK

performance, because these codes are written so that most of the work is done in the calls

to the BLAS. By contrast, EISPACK is written in Fortran without calls to the BLAS and hence

its performance is dependent on the quality of the code generated by the Fortran compiler.

Lehoucq and Carr[35] argue that compilers now have the capability to perform

many of the optimizations that the LAPACK project performed by hand. Although no com-

pilers existing today can produce code as e�cient as LAPACK from simple three line loops,

21

the compiler technology exists[149, 115, 148].

Today, most compilers are able to produce good code for single loops, reducing the

performance advantage of the BLAS1 routines. Soon compilers will be able to produce good

code for BLAS2 and even BLAS3 routines. This will require us to rethink certain decisions,

especially where the precise functionality that we would like is lacking. There will be an

awkward period, probably lasting decades, during which some but not all compilers will be

able to perform comparably to the BLAS.

2.3.6 Operating Systems

Operating systems are largely irrelevant to serial codes such as LAPACK but they

can have a signi�cant impact on parallel codes. Consider, for example, the broadcast

capability inherent in Ethernet hardware. That capability is not available because the

TCP/IP protocol does not allow access to that capability. Furthermore, at least 90% of

the message latency cost is attributable to software and the operating system often makes

it di�cult to reduce the message latency cost. Part of the NOW[3] project involves �nding

ways to reduce the large message latency cost inherent in Unix operating systems through

using user-level to user-level communications, avoiding the operating system entirely.

2.4 Hardware

2.4.1 Processor

The processor, or more speci�cally the oating point unit, is the fundamental

source of processing power or the ultimate limit on performance, depending on your point

of view. The combined speed of all of the oating point units is the peak performance,

or speed of light, for that computer. For many dense linear algebra codes, the number of

oating point operations cannot be reduced substantially and hence the goal is to perform

the necessary ops as fast (i.e. as close to the peak performance) as possible.

Floating point arithmetic

The increasing adherence to the IEEE standard 754 for binary oating point

arithmetic[7] bene�ts performance in two ways: it reduces the e�ort needed to make codes

22

work across multiple platforms and it allows one to take advantage of details of the under-

lying arithmetic in a portable code. The developers of LAPACK had to expend considerable

e�ort to make their codes work on machines with non-IEEE arithmetic, notably older Cray

machines. By contrast, the developers of ScaLAPACK chose to concentrate on IEEE standard

754 conforming machines allowing them not only to avoid the hassles of old Cray arithmetic,

but also to check the sign bit directly when using bisection[54] to compute the eigenvalues

of a tridiagonal matrix.

Consistent oating point arithmetic is also important for execution on heteroge-

neous machines. Demmel et al.[54] discuss ways to achieve correct results in bisection on a

heterogeneous machine. I have proposed having each process compute a subset of eigenval-

ues, chosen by index, sharing those eigenvalues among all processes and then having each

process independently sort the eigenvalues[55].

Ironically the one place where the IEEE standard 754 allows some exibility has

caused problems for heterogeneous machines. The IEEE standard 754 allows several options

for handling sub-normalized numbers, i.e. numbers that are too small to be represented as

a normalized number. During ScaLAPACK testing it was discovered that a sub-normalized

number could be produced on a machine that adheres to the IEEE standard 754 completely

and that when this number is then passed to the DEC Alpha 21064 processor, the DEC

Alpha 21064 processor does not recognize them as legitimate numbers and aborts. To �x

this would have required xdr to be smart enough to recognize this unusual situation4 or

make one of the processors work in a manner di�erent from its default5.

2.4.2 Memory

The slower speed of main memory (as compared to cache or registers) a�ects

performance in three ways. It reduces the performance of matrix-matrix multiply slightly

and greatly complicates the task of coding an e�cient matrix-matrix multiply. It bounds

from below the algorithmic blocking factor needed to achieve high performance on matrix-

matrix multiply. And, it limits the performance of BLAS1 and BLAS2 codes.

The last two factors listed above combine in an unfortunate manner: slow main

memory increases the number of BLAS1 and BLAS2 ops and reduces the rate at which they

are executed. The number of BLAS1 and BLAS2 ops are typically O(n2 nb), where nb is the

4This would slow down xdr, possibly signi�cantly.5This too would result in slower execution.

23

algorithmic blocking factor, which as stated above, must be larger when main memory is

slow. The ratio of peak oating point performance to main memory speed is large enough

on some machines that the O(n2 nb) cost of the BLAS1 and BLAS2 ops can no longer be

ignored.

Improving the load balance of the O(n2 nb) BLAS1 and BLAS2 ops.

In a blocked dense linear algebra transformation, such as LU decomposition,

Cholesky or QR, there are O(n2 nb) BLAS1 and BLAS2 ops[30, 53]. PDSYEVX includes two

blocked dense linear algebra transformations: Reduction to tridiagonal form, PDSYTRD, is

described in Section 4.2 and back transformation, PDORMTR, is described in Section 4.4.

In ScaLAPACK version 1.5, the O(n2 nb) BLAS1 and BLAS2 ops are performed by

just one row or column of processors. This leads to load imbalance and causes these ops

to account for O(n2nbpp ) execution time. If these ops can be performed on all p processors,

instead of just one row or column, they will account for only O(n2 nb

p ) execution time.

There are two ways to spread the cost of the O(n2 nb) BLAS1 and BLAS2 ops

over all the processors: take them out of the critical path or distribute them over all

processors. Transformations such as LU, and back transformation (applying a series of

householder vectors) can be pipelined, allowing each processor column (or row) to execute

asynchronously. Pipelining in turn allows lookahead, a process by which the active column

performs only those computations in the critical path before sending that data on to the

next column[32].

Distributing the BLAS1 and BLAS2 ops over all of the processors, as discussed in

the last paragraph, requires a di�erent data distribution, a di�erent broadcast and a sig-

ni�cant change to the code. The di�erence can be best illustrated by considering LU. In a

2D blocked LU, LU is �rst performed on a block of columns, and the resulting LU decom-

position is broadcast, or spread across, to all processor columns. One way to broadcast k

elements to p processors is to combine a Reduce scatter (which takes k elements and sends

k=p to each processor) with an Allgather (which takes k=p elements from each processor and

spreads them out to all processors giving each processor a copy of all k elements). There

are three ways to perform LU on this column block of data: 1) Before the column block is

broadcast to all processors (as ScaLAPACK does) in which case only the current column of

processors is involved in performing the column LU and the Reduce scatter and Allgather

24

combine to broadcast the block LU decomposition. 2) After the broadcast, in which case

the Reduce scatter and Allgather combine to broadcast the block column prior to the LU

decomposition - all processor columns would have a copy of the block column and each pro-

cessor column could perform the column block LU redundantly. 3) After the Reduce scatter

but before the Allgather. In this case, the Reduce scatter operates on the column block

prior to the LU decomposition but the Allgather operates on the block column after the

LU decomposition. All processors can be involved in the LU decomposition.

In HJS, Hendrickson, Jessup and Smith's symmetric eigensolver[91, 154] discussed

in Section 7.1.2, the BLAS1 and BLAS2 ops are analogously distributed over all of the

processors.

Lookahead does not improve performance unless the execution of the code is

pipelined, i.e. proceeds in a wave pattern over the processes. Two-sided reductions, like

tridiagonal reduction, do not allow pipelining. And, pipelining may be limited on reductions

of symmetric or Hermitian matrices (such as Cholesky)6.

Memory size

The amount of main memory limits the size of the problem that can be executed

e�ciently, while the amount of virtual memory limits the size of the problem that can be

run at all. ScaLAPACK's symmetric eigensolvers, PDSYEVX and PDSYEV require roughly 4n2

and 2n2 double precision words of virtual memory space respectively. However, both can

be run e�ciently provided that physical memory can contain7 the n2=2 elements of the

triangular matrix A. Ed D'Azevedo[52] has written an out-of-core symmetric eigensolver

for ScaLAPACK and studied the performance of PDSYEV and PDSYEVX on large problem sizes.

2.4.3 Parallel computer con�guration

I will discuss primarily distributed memory computers with one processor per

node, discussing shared memory computers (SMPs), clusters of workstations and clusters

of shared memory computers only brie y.

Four machine characteristics are important for distributed memory computers:

peak oating point performance, software overhead, communication latency and commu-

6I believe that pipelining can be used in Cholesky if a square processor grid is used. Work in progress.7Depending on the page size, keeping an n by n triangular matrix in memory may require as few as n2=2

memory locations (if the page size is 1) or as many as n2 (if the page size is � n).

25

nication (bisection) bandwidth. Software overhead and communication latency are the

dominant costs for small problems8. Peak oating point performance is the dominant costs

for large problems 9.

Interconnection network

Bisection bandwidth and communication latency are the two important measures

of an interconnection network. Networks which allow only one pair of nodes to communicate

at a time do not o�er adequate bisection bandwidth and hence parallel dense linear algebra

(with the possible exception of huge matrix-matrix multiplies) will not perform well on such

a network.

As long as the bisection bandwidth is adequate, the topology of the interconnection

network has not proven to be an important factor in the performance of parallel dense linear

algebra.

Shared Memory Multiprocessing

Users of dense linear algebra codes have two choices on shared memory multi-

processors. They can use a serial code, such as LAPACK that has been coded in terms of

the BLAS and, provided that the manufacturer has provided an optimized BLAS, they will

achieve good performance. Or, provided that the manufacturer provides MPI[65], PVM[19]

or the BLACS they can use ScaLAPACK.

LeBlanc and Markatos[118] argue that shared memory codes typically get better

load balance while message passing codes typically incur lower communications cost. How-

ever, the real di�erence could well come down to a matter of how e�cient the underlying

libraries are.

Clusters of workstations

Some clusters of workstations, notably the NOW project[3] at Berkeley, o�er com-

parable communication performance to distributed memory computers. However, the vast

majority of networks of workstations in present use are still connected by Ethernet or FDDI

8On current architectures, n < 100pp is small for our purposes

9On current architectures, n > 1000pp is large for our purposes

26

rings and hence do not have the low latency and high bisection bandwidth required to per-

form dense linear algebra reductions in parallel e�ciently.

Cluster of SMPs (CLUMPS)

Dense linear algebra codes have two choices on clusters of SMPs: they can assign

one process to each processor or they can assign one process to each multi-processor node.

The tradeo� will be similar to the shared-memory versus message-passing question on shared

memory computers.

If each processor is assigned a separate process the details of how the processes

will be assigned to what is essentially a two level grid of processors will be important.

For a modest cluster of SMPs (say 4 nodes each with 4 processors) it might make sense

to assign one dimension within the node and the other across the nodes. However, this

will not scale well - adding nodes will require increasing the bandwidth per node else all

dense linear algebra transformations will become bandwidth limited as the number of nodes

increases. A layout which that is 2 dimensional within the nodes and 2 dimensional among

the nodes allows both the number of processors per node and the number of nodes to

increase provided only that bisection bandwidth grow with the number of processors and

that internal bisection bandwidth (i.e. main memory bandwidth) grows with the number

of processors per node.

On the �rst CLUMPS, how well each of the libraries is implemented is likely to

outweigh theoretical considerations. Shared memory BLAS are not trivial, nor will communi-

cation systems that properly handle two levels of processor hierarchy be, i.e. communication

within a node and communication between nodes.

On most distributed memory systems, the logical to physical processor grid map-

ping is of secondary importance. I suspect that this will not be the case for clusters of

SMPs. It will be important to have the processes assigned to the processors on a particular

node nearby in the logical process grid as well.

2.5 Applications

Large symmetric eigenproblems are used in a variety of applications. Some of

these applications include: real-time signal processing[156] [34], modeling of acoustic and

electro-magnetic waveguides[114], quantum chemistry[74] [22, 175], numerical simulations

27

of disordered electronic systems[95], vibration mode superposition analysis[18], statisti-

cal mechanics[132], molecular dynamics[152], quantum Hall systems[112, 106], material

science[166], and biophysics[143, 144].

The needs of these applications di�er considerably. Many require considerable

execution time to build the matrix and hence the eigensolution remains a modest part of the

total execution. However, building the matrix often parallelizes easily and grows much more

slowly than the O(n3) cost of eigensolution. Hence, for these applications, the eigensolver

becomes the bottleneck as larger problems are solved in parallel. Few applications require

the entire spectrum, but most of these listed above require at least 10% of the spectrum

and hence are best solved by dense techniques. Some have large clusters of eigenvalues[74],

while others do not.

2.5.1 Input matrix

Three features of the input matrix a�ect the execution time of symmetric eigen-

solvers: sparsity, eigenvalue clustering and spectral diagonal dominance.

Sparsity

Some algorithms and codes are speci�cally designed for sparse input matrices.

Lanczos[49] has traditionally been used to �nd a few eigenvalues and eigenvectors at the

ends of the spectrum. Recently, ARPACK[119], and PARPACK[130] have been developed

based on Lanczos with full re-orthogonalization. They can therefore compute as much of

the spectrum as the user chooses.

The Invariant Subspace Decomposition Approach and reduction to tridiagonal

form based algorithms can both be run from either a dense or banded matrix. In this

dissertation, I discuss only dense matrices.

Spectrum

Some algorithms are more dependent on the spectrum than others. Most are

dependent in some manner, but that dependence di�ers from one algorithm to another.

It is di�cult to maintain orthogonality of the eigenvectors when computing the

eigendecomposition of matrices with tight clusters of eigenvalues. Such matrices require

special techniques in divide and conquer and in inverse iteration (See section 2.7.4). On

28

the other hand, divide and conquer experiences the most de ation, and hence the greatest

e�ciency, on matrices with clustered eigenvalues.

The Invariant Subspace Decomposition Approach maintains orthogonality on ma-

trices with clustered eigenvalues. However, it may have di�culty picking a good split point

if the clustering causes the eigenvalues to be unevenly distributed.

Spectral Diagonal dominance

Spectral diagonal dominance10 speeds convergence of the Jacobi algorithm. In-

deed, if the input matrix is su�ciently diagonally dominant, Jacobi may converge in as

little as two steps (versus 10 to 20 for non diagonally dominant matrices). But, spectral

diagonal dominance has little e�ect on any of the other algorithms.

2.5.2 User request

The portion of the spectrum that the user needs, i.e. the number of eigenvalues

and/or eigenvectors, a�ects execution time of some, but not all eigensolvers.

Two step band reduction (to tridiagonal form) is most attractive when only eigen-

values are requested because the back transformation task is expensive in two step band

reduction.

The cost of bisection and inverse iteration depends upon the number of eigenvalues

and eigenvectors requested. These costs are O(n2) and generally not signi�cant for large

problem sizes. However, back transformation requires 2n2m ops where m is the number

of eigenvectors required.

Iterative methods, such as Lanczos[49] and implicitly restarted Lanczos[119] are

clearly superior if only a few eigenvectors are required.

10Spectrally diagonally dominant means that the eigenvector matrix, or a permutation thereof, is diago-nally dominant. Most, but not all, diagonally dominant matrices are spectrally diagonally dominant. Forexample if you take a dense matrix with elements randomly chosen from [�1; 1] and scale the diagonalelements by 1e3 the resulting diagonally dominant matrix will be spectrally diagonally dominant. However,if you take that same matrix and add 1e3 to each diagonal element, the eigenvector matrix is unchangedeven though the matrix is clearly diagonally dominant.

29

2.5.3 Accuracy and Orthogonality requirements.

Demmel and Veseli�c,[58] prove that on scaled diagonally dominant matrices11,

Jacobi can compute small eigenvalues with high relative accuracy while tridiagonal based

methods can fail to do so.

At present, the ScaLAPACK o�ers two symmetric eigensolvers: PDSYEVX and PDSYEV.

PDSYEVX, which is based on bisection and inverse iteration (DSTEBZ and DSTEIN from LAPACK)

is faster and scales better but does not guarantee orthogonality among eigenvectors asso-

ciated with clustered eigenvalues. PDSYEV, which is based on QR iteration (DSTEQR from

LAPACK) is slower and does not scale as well but does guarantee orthogonal eigenvectors.

2.5.4 Input and Output Data layout

At present, the execution time of the ScaLAPACK symmetric eigensolver is strongly

dependent on the data layout chosen by the user for input and output matrices. 1D data

layouts are not scalable and lead to both high communication costs and poor load balancing.

Suboptimal block sizes can likewise a�ect performance signi�cantly. In particular, a block

size of 1, i.e. cyclic data layout, causes ScaLAPACK to send a large number of small messages

resulting in unacceptable message latency costs and a huge number of calls to the BLAS. If

the block size is too large, load balance su�ers.

There are a couple ways to reduce this dependence on the data layout chosen by

the user. If algorithmic blocking is separated from data layout blocking[140] [91] [159] small

data layouts can be handled much more e�ciently. However, small block-sizes (especially

cyclic layouts) still require more messages than larger block-sizes. And, large block sizes

still lead to load imbalance.

In Chapter 8 I will show that redistributing the data to an internal format that

is near optimal for the particular machine and algorithm involved allows for improved

performance and performance that is independent of the input and output data layout.

2.6 Machine Load

The load of the machine, in addition to the direct e�ect of o�ering your program

only a portion of the total cycles, can have several indirect e�ects. If each processor is

11A matrix, A, is scaled diagonally dominant if and only if DAD with D = jdiag(A)j1=2 is diagonallydominant.

30

individually scheduled, performance can be arbitrarily poor because signi�cant progress is

only possible when all processes are concurrently scheduled. A loaded machine may also

cause your data to be swapped out to disk, which can greatly reduce peak performance.

Finally, it is the most heavily loaded machine which controls execution time. If your code

is running on 9 unloaded processors and one processor with a load factor of 5, you will

get no more than a factor of 10/5 speedup. A ScaLAPACK user has reported performance

degradation and speedup less than 1, (i.e. more processors take longer to comlete the same

sized eigendecomposition) on the IBM IBM SP2. I have also witnessed this behaviour on

the IBM IBM SP2at the University of Tennesse at Knoxville and I have reason to suspect

that the IBM IBM SP2 isnot gang scheduled and that this fact accounts for a large part of

the poor performance of PDSYEVX that the user and I have witnessed on the IBM SP2.

Space sharing, allocating subsets of the processors, solves all of these problems,

but has its own problems. On some machines, jobs running on di�erent partitions share

the same communications paths and hence if one job saturates the network, all jobs may

su�er.

2.7 Historical notes

2.7.1 Reduction to tridiagonal form and back transformation

Householder reduction to tridiagonal form is a two-sided reduction, which requires

multiplication by Householder re ectors from both the left and right side. Martin et al. im-

plemented reduction to tridiagonal form in Algol[129]. TRED1 and TRED2 perform reduction

to tridiagonal form in EISPACK[153]. Dongarra, Hammarling and Sorensen[64] showed that

Householder reduction to tridiagonal form can be performed using half matrix-vector and

half matrix-matrix multiply ops. This has been implemented as DSYTRD in LAPACK[5, 67]

for scalar and shared memory multiprocessors and PDSYTRD for distributed memory com-

puters in ScaLAPACK[42]. Chang et al. implemented one of the �rst parallel codes for

reduction to tridiagonal form, �rst using a 1D cyclic data layout[37] and then a 2D cyclic

data layout[38].

Smith, Hendrickson and Jessup[91] show that data blocking is not required for

e�cient algorithmic blocking and that PDSYTRD pays a substantial execution time penalty

for its generality (accepting any processor layout) and portability (being built on top of

31

the PBLAS, BLACS and BLAS). By restricting their attention to square processor layouts on

the PARAGON, they were able to dramatically reduce the overhead incurred in reduction to

tridiagonal form in HJS. HJS does not have the redundant communication found in PDSYEVX,

it makes many fewer BLAS calls, avoids the overhead of the PBLAS calls, and spreads the

work more evenly among all the processors (improving load balance). Furthermore, HJS,

by using communication primitives better suited to the task, reduces both the number of

messages sent and the total volume of communication substantially. Some, but not all,

of these advantages necessitate that the processor layout be square. HJS is discussed in

Section 7.1.2.

Other ways to reduce the execution time of reduction to tridiagonal form do not

require that the processor layout be square. Bischof and Sun[25] and Lang[116] showed that

in a two step band reduction to tridiagonal form, all of the ops, asymptotically, can be

performed in matrix multiply routines. Karp, Sahay, Santos and Schauser[107] showed that

subset broadcasts and reductions can be performed optimally. Van de Geijn and others[16]

are working to implement improved subset broadcast and reduction primitives.

Hegland et al.[90] argue that the fastest way to reduce a symmetric matrix A to

tridiagonal form on the VPP500 (a multiprocessor vector supercomputer by Fujitsu) is to

compute L1DLT1 = A and then compute a series of Li using orthonormal transformations

such that Ln+p�1DLTn+p�1 is tridiagonal. Their technique is, in essence, a two step band

reduction in which the two steps are performed within the same loop. Let Li[:; own(�)]

represent the columns of Li, owned by processor �. �Qi means the portion of Qi which

processor � owns.

The code is:

L1DLT1 = A

For i = 1 to n � 1 do:

Each processor independently performs:

�Qi = House(Li[:; own(�)]Di[own(�); own(�)]Li[:; own(�)]T)

Li+1[:; own(�)] =� QiLi[:; own(�)]

The processors together perform:

Allgather(Li+1[:; i+ 1 : i+ p])

Each processor performs redundantly:

Q0i = House(Li+1[:; i+ 1 : i+ p]D[i+ 1 : i+ p; i+ 1 : i+ p]Li+1[:; i+ 1 : i+ p]T )

32

Li+1[:; i+ 1 : i+ p] = Q0iLi+1[:; i+ 1 : i+ p]

In Allgather(Li+1[:; i+ 1 : i+ p]) each processor contributes the column of

Li+1[:; i+ 1 : i+ p] which it owns and all processors end up with identical copies of Li+1[:

; i+ 1 : i+ p].

The loop invariants are as follows:

Let: Ti = (Li)D(Li)T

8j<i;k<i+pTi(j; k) = 0 (Line 1)

Ti(1 : i� p; 1 : i� p)is tridiagonal (Line 2)

For p = 1, the serial case, both of these conditions are identical and meeting them

requires computing the �rst column of (Li)D(Li)T , computing the Householder vector and

applying it to Li to yield Li+1.

For p > 1, the parallel case, the �rst loop invariant is maintained by each processor

independently computing the �rst column of (Li)D(Li)T , using only the local columns12

of Li. A Householder vector is computed from this and applied to the local columns of

Li. The second loop invariant is maintained redundantly on all processors. All processors

obtain copies of columns i to i+ p � 1 of Li and compute: A(1 : p; 1) = Li(i : i+ p � 1; i :

i+ p� 1)D(i : i+ p� 1; i : i+ p� 1)L(i : i+ p� 1; i)T . A Householder vector is computed

from A(1 : p; 1) and applied to Li(i : i+p�1; :), redundantly on all processors, maintainingthe second loop invariant.

This one-sided transformation requires fewer messages than Hessenberg reduction

to tridiagonal form and, for small p, less message volume, but requires twice as many ops.

2.7.2 Tridiagonal eigendecomposition

Sequential symmetric QL and QR algorithms

The implicit QL or QR algorithms have been the most commonly used methods for

solving the symmetric eigenproblem for the last couple decades. Francis[79] wrote the �rst

implementation of the QL algorithm based on Rutishauser's LR transformation. The QL

algorithm is the basis of the EISPACK routine IMTQL1, while the LAPACK routine DSTEQR uses

either implicit QR or implicit QL depending on the top and bottom diagonal elements[86].

12Their implementation uses a column cyclic data distribution.

33

Henry[93] shows that if between each sweep of QR (or QL) in which the eigenvectors are

updated an additional sweep is performed in which the eigenvectors are not updated, better

shifts can be used, reducing the total number of ops from roughly 6n3 to 4n3.

Reinsch[145] wrote EISPACK's TQLRAT which computes eigenvalues without square

roots. LAPACK's DSTERF improves on TQLRAT using a root free variant developed by Pal,

Walker and Kahan[134]. Like DSTEQR, DSTERF uses either implicit QR or implicit QL de-

pending on the top and bottom diagonal elements

Parallel symmetric QL and QR algorithms

QR requires O(n2) e�ort to compute the eigenvalues and O(n3) to compute the

eigenvectors. No one has found a good, stable way to parallelize the O(n2) cost of computing

the eigenvalues and re ectors. Sameh and Kuck[113] use parallel pre�x to parallelize QR

for eigenvalue extraction. They obtain O( 1log(p)) speedup, but they do not show how their

method can be used to generate re ectors and hence eigenvectors.

However, parallelizing the O(n3) e�ort of computing the eigenvectors is straight-

forward as shown by Chinchalkar and Coleman[39]; and Arbenz et al.[8] and implemented

for ScaLAPACK by Fellers[76].

Symmetric QR parallelizes nicely in a MIMD programming style, but e�orts to

parallelize it on a shared memory machine in which the parallelism is strictly within the

calls to the BLAS have produced only modest speedups. Bai and Demmel[13] �rst suggested

using multiple shifts in non-symmetric QR. Arbenz and Oettli[10] showed that blocking

and multiple shifts could be used to obtain modest improvements in the speed (roughly a

factor of 2 on 8 processors) of QR for eigenvalues and eigenvectors on the ALLIANT FX/80.

Kaufman[109] showed that multi-shift QR could be used to speed eigenvalue extraction by

a factor of 3 on a 2-processor Cray YMP despite tripling the number of ops performed.

Sturm sequence methods

Givens[83] used bisection to compute the eigenvalues of a tridiagonal matrix based

on Wilkinson's original idea. Kahan[105] showed that bisection can compute small eigenval-

ues with tiny componentwise relative backward error, and sometimes high relative accuracy.

High relative accuracy is required for inverse iteration on a few matrices. Barlow and Evans

were the �rst to use bisection in a parallel code[15].

34

Computing eigenvalues of a tridiagonal matrix can be split into three phases:

isolation, separation and extraction. The isolation phase identi�es, for each eigenvalue, an

interval which contains that eigenvalue and no other. The separation phase improves the

eigenvalue estimate. And the extraction phase computes the eigenvalue to within some

tolerance. Bisection can be used for all three phases.

Neither existing codes, nor the literature explicitly distinguish between these three

phases, but they have very di�erent computational aspects. Isolation, at least to the point

of identifying p intervals so that each processor is responsible for one interval is di�cult

to parallelize, whereas the other phases are fairly straightforward. The separation phase is

typically the challenge for most root �nders, and the area where they distinguish themselves

from other codes. Divide and conquer techniques which use the eigenvalues from perturbed

matrices as estimates of the eigenvalues of the original matrix, isolate and may separate the

roots.

Techniques for eigenvalue isolation include: multi-section[126] [14], assigning dif-

ferent parts of the spectrum to di�erent processors[95, 20], divide and conquer and using

multiple processors to compute the inertia of a tridiagonal matrix[123]. In multi-section,

each processor computes the inertia at a single point, splitting an interval into p + 1 in-

tervals. Although multi-section requires communication, Crivelli and Jessup[48] show that

the communication cost is often a modest part of the total cost. Divide and conquer splits

the matrix by perturbing or ignoring a couple of elements, typically near the center of the

matrix to separate the matrix into two tridiagonal matrices whose eigenvalues can be com-

puted separately. If a rank 1 perturbation is chosen, the merged set of eigenvalues provides

a set of intervals in which exactly one eigenvalue lies.

There are a number of ways to use multiple processors to compute the inertia of

a tridiagonal matrix. Lu and Qiao[127] discuss using parallel pre�x to compute the Sturm

sequence as the sub-products of a series of 2 by 2 matrices and Mathias[131] did an error

analysis and showed that it was unstable. Ren[146] tried unsuccessfully to repair parallel

pre�x. Conroy and Podrazik[46] perform LU on a block arrowhead matrix. Each block is

tridiagonal and the arrow has width equal to the number of blocks. Swarztrauber[162] and

Krishnakumar and Morf[111] discuss ways of computing the determinant of 4 matrices of

size roughly n by n from the determinants of 8 matrices of size roughly 12n by 1

2n. Each

of these methods performs 2 to 4 times more oating point operations than a serial Sturm

sequence count would and requires O(log(p)) messages. Except for Conroy and Podrazik's

35

method, they all use multiplies instead of divides. Multiplies are faster than divides, but

require special checks to avoid over ow.

The computation of the inertia is slowed by the existence of a divide and a com-

parison in the inner loop. There are also a couple tricks that can potentially be used to

speed computation of the inertia to reduce the number of divides and comparisons or to

make them faster. ScaLAPACK's PDSYEVX uses signed zeroes and the C language ability to

extract the sign bit of a oating point number to avoid a comparison in the inner loop[54].

I have proposed perturbing tiny entries in the tridiagonal matrix to guarantee that negative

zero will never occur, thus allowing a standard C or Fortran comparison against zero. Using

a standard comparison against zero would allow compilers to produce more e�cient code.

I have also proposed reducing the number of divides in the inner loop by taking advantage

of the �xed exponent and mantissa sizes in IEEE double precision numbers. I have not im-

plemented either of these ideas. Some machines have two types of divide: a fast hardware

divide that may be incorrect in the last couple bits and a slower but correct software divide.

Demmel, Dhillon and Ren[54] give a proof of correctness for PDSTEBZ, ScaLAPACK's bisection

code for computing the eigenvalues of a tridiagonal matrix, in the face of heterogeneity and

non-monotonic arithmetic (such as sloppy divides). This shows that bisection can be robust

even in the face of incorrect divides.

Many techniques that have been used to accelerate eigenvalue extraction including:

the secant method[33], Laguerre's iteration[138], Rayleigh quotient iteration[163], secular

equation root �nding[50] and homotopy continuation[120, 45]. Bassermann and Weidner

use a Newton-like root �nder called the Pegasus method[17]. These acceleration techniques

converge super-linearly as long as the eigenvalues are separated.

Li and Ren[121] accelerate eigenvalue separation in their Laguerre based root �nder

by detecting linear convergence and estimating the e�ect of the next several steps. Brent[33]

discusses ways of separating eigenvalues when the secant method is used. Li and Zeng use an

estimate of the multiplicity in their root �nder based on Laguerre iteration[122]. Szyld[163]

uses inverse iteration with a shift set to middle of the interval known to contain only one

eigenvalue to separate eigenvalues before switching to Rayleigh quotient iteration. Cuppen's

method takes advantage of multiple eigenvalues through de ation.

Eigenvalue extraction can be performed in parallel with no communication, or a

small constant amount of communication. However, eigenvalue extraction can exhibit poor

load balance, especially if acceleration techniques are used. Ma and Szyld[128] use a task

36

queue to improve load balance. Li and Ren[121] minimize load imbalance by concentrating

on worst case performance.

ScaLAPACK chose bisection and inverse iteration for its �rst tridiagonal eigensolver,

PDSYEVX, because they are fast, well known, robust, simple and parallelize easily. ScaLAPACK

has since added a QR based tridiagonal eigensolver for those applications needing guarantees

on orthogonality within eigenvectors corresponding to large clusters of eigenvalues. See

section 4.3 for details.

Divide and Conquer

Cuppen[50] showed that by making a small perturbation to a tridiagonal matrix

it could be split into two separate tridiagonal matrices each of which could be solved inde-

pendently, and that the eigendecomposition of the original tridiagonal matrix could then be

constructed from the eigendecomposition of the two independent tridiagonal matrices and

the perturbation.

There are many ways to perturb a tridiagonal matrix such that the result is

two separate tridiagonal matrices. The following four have been implemented. Cuppen's

algorithm[50] subtracts �uuT from the tridiagonal matrix, where u = e 1

2n + e 1

2n+1 and

� = T 1

2n; 1

2n+1. Gu and Eisenstat[89] set all elements in row and column i to zero. Gates

and Arbenz[82] call this a rank-one extension and refer to this as permuting row and column

12n to the last row and column (as opposed to setting all elements in row and column i to

zero). Gates[80] uses a rank two perturbation: T 1

2n; 1

2n+1(e 1

2ne

T1

2n+1

+e 1

2n+1e

T1

2n) is subtracted

from the original tridiagonal.

Cuppen's original divide and conquer method can result in a loss of orthogo-

nality among the eigenvectors. Three methods of maintaining orthogonality have been

implemented. Sorensen and Tang[155] calculate the roots to double precision. Gu and

Eisenstat[89] compute the eigenvectors to a slightly perturbed problem. Gates[81] showed

that inverse iteration and Gram-Schmidt re-orthogonalization could be used in divide and

conquer codes to compute orthogonal eigenvectors.

Several divide and conquer codes are available today. The �rst publically available

divide and conquer code, TREEQL was written by Dongarra and Sorensen[66]. The fastest

reliable serial code currently available for computing the full eigendecomposition of a tridi-

agonal matrix is LAPACK's DSTEDC[147]. It is based on Cuppen's divide and conquer[50] and

37

uses Gu and Eisenstat's[88] method to maintain orthogonality.

There has long been interest in parallelizing divide and conquer codes because

of the obvious parallelism involved in the early stages. There are three reasons why this

technique has proven di�cult to parallelize. The �rst is that the majority of the ops are

performed at the root of the divide and conquer tree and hence the parallelism at the leaves

is less valuable[36]. The second is that de ation, the property that makes DSTEDC the fastest

serial code, leads to dynamic load imbalance in parallel codes. The third is the complexity

of the serial code itself.

Dongarra and Sorensen's parallel code[66], SESUPD, was written for a shared mem-

ory machine. The �rst parallel divide and conquer codes written for distributed memory

computers used a 1D data layout (thus limiting their scalability)[99, 81]. Potter[141] has

written a parallel divide and conquer for small matrices (it requires a full copy of the ma-

trix on each node). Fran�coise Tisseur has written a parallel divide and conquer code for

inclusion in ScaLAPACK.

Inverse Iteration

Inverse iteration with eigenvalue shifts is typically used to compute the eigen-

vectors once the eigenvalues are known[170]. Jessup and Ipsen[102] explain the use of

Gram-Schmidt re-orthogonalization to ensure that the eigenvectors are orthogonal. Fann

and Little�eld[75] found that inverse iteration and Gram-Schmidt can be performed in par-

allel, greatly improving its e�ciency. Parlett and Dhillon[139, 59] are working on a method,

based on work by Fernando, Parlett and Dhillon[77], that may avoid, or greatly reduce the

need for re-orthogonalization.

The Jacobi method

The Jacobi method for the symmetric eigenproblem consists of applying a series

of rotators each of which forces a single o�-diagonal element to zero. Each such rotation

reduces the square of the Frobenius norm of the o�-diagonal elements by the square of the

element which was eliminated. Hence, as long as the o�-diagonal elements to be eliminated

are reasonably chosen, the norm of the o�-diagonal converges to zero[167].

There are several variations in the Jacobi method. Classical Jacobi[100], selects

the largest o�-diagonal element as the element to eliminate at each step, and hence requires

38

the fewest steps. However, O(n2) comparisons are required at each step to select the largest

element, requiring O(n4) comparisons per sweep, rendering it unattractive. Cyclic Jacobi

annihilates every element once per sweep in some speci�ed order. Threshold Jacobi di�ers

from cyclic Jacobi in that only those elements larger than a given threshold are annihilated.

Block Jacobi annihilates an entire block of elements at each step.

Cyclic, threshold and block variants of Jacobi each have their advantages. Cyclic

Jacobi is the simplest to implement. Block Jacobi requires fewer ops (and if done in

parallel, fewer messages) per element annihilated. Threshold Jacobi requires fewer steps

and converges more surely than cyclic Jacobi, however a parallel threshold Jacobi requires

more communication. Scott et al. showed that a block threshold Jacobi method[151] is

the best Jacobi method for distributed memory machines, however, it would also be the

most complex to implement. Little�eld and Maschho�[125] found that for large numbers of

processors, a parallel block Jacobi beat tridiagonal based methods available at that time.

One-sided Jacobi methods apply rotations to only one side of the matrix and force

the columns of the matrix to be orthogonal, hence represent scaled eigenvectors. One-sided

Jacobi methods require fewer ops and may parallelize better[10, 21].

Existing parallel implementations of the Jacobi algorithm are based on a 1D data

layout. Arbenz and Oettli[10] implemented a blocked one-sided Jacobi. Pourzandi and

Tourancheau[142] show that overlapping communication and computation is e�ective in

a Jacobi implementation on the i860 based NCUBE. Although a 1D data layout is not

scalable, the huge computation to communication ratio in the Jacobi algorithm hides this

on all machines available today.

There are two publically available parallel Jacobi codes. Fernando wrote a parallel

Jacobi code for NAG[87]. O'Neal and Reddy[133] wrote a parallel Jacobi, PJAC, for the

Pittsburgh Supercomputing Center.

Demmel and Veseli�c,[58] prove that on scaled diagonally dominant matrices, Jacobi

can compute small eigenvalues with high relative accuracy while tridiagonal based methods

cannot. Demmel et al.[56] give a comprehensive discussion of the situations in which Jacobi

is more accurate than other available algorithms.

The Jacobi method is discussed in Section 7.3.

39

2.7.3 Matrix-matrix multiply based methods

There are several methods for solving the symmetric eigenproblem which can be

made to use only matrix-matrix multiply.

Matrix-matrix based methods are attractive because they can be performed ef-

�ciently on all computers, and they scale well. However, they require many more ops

(typically 6 - 60 times more) than reduction to tridiagonal form, tridiagonal eigensolution

and back transformation. Hence, these methods only make sense if tridiagonal based meth-

ods cannot be performed e�ciently or do not yield answers that are su�ciently accurate.

Invariant Subspace Decomposition Algorithm

The Invariant Subspace Decomposition Algorithm[97], ISDA, for solving the sym-

metric eigenproblem involves recursively decoupling the matrix A into two smaller matrices.

Each decoupling is achieved by applying an orthogonal similarity transformation, QTAQ,

such that the �rst columns of Q span an invariant subspace of A. Such a Q is found by

computing a polynomial function of A, A0 = p(A) which maps all the eigenvalues of A

nearly to 0 or 1, and then taking the QR decomposition of p(A). One such polynomial

can be computed by �rst shifting and scaling A so that all its eigenvalues are known to

be between 0 and 1 (by Gershgorin's theorem) and then repeatedly computing the beta

function, Ai+1 = 3A2i � 2A3

I , until all of the eigenvalues of Ai are e�ectively either 0 or 1.

(All of the eigenvalues of A0 that are less than 0:5 are mapped to 0, all the eigenvalues of

A0 that are greater than 0.5 are mapped to 1.)

The ISDA parallelizes well because each of the tasks involved perform well in

parallel[97]. Unfortunately, the ISDA requires far more oating point operations (roughly

100 n3) than eigensolvers that are based on reducing the matrix �rst to tridiagonal form

(which require 8n3 +O(n2) or fewer ops).

Applying the ISDA for banded matrices greatly reduces the op count[26]. Fur-

thermore, the banded matrix multiplications can still be performed e�ciently, and the

bandwidth does not triple with each application of Ai+1 = 3A2i � 2A3

I as one would expect

with random banded matrices. Nonetheless, the bandwidth does grow enough to necessitate

several band reductions, each of which requires a corresponding back transformation step.

A publically available code based on the ISDA is available from the PRISM

group[28].

40

The ISDA applied directly to the full matrix requires roughly 100n3 ops, or 30

times as many as tridiagonal reduction based methods, and hence will never be as fast.

Banded ISDA is almost a tridiagonal based method, but is not likely to be the fastest

method. The quickest way to compute eigenvalues from a banded matrix is to reduce the

matrix �rst to tridiagonal form. And, if eigenvectors are required, banded ISDA will require

at least twice and probably three times as many ops in back transformation.

FFT based invariant subspace decomposition

Yau and Lu[174] implemented an FFT based invariant subspace decomposition

method. This method requires O(log(n)) matrix multiplications. Tisseur and Domas[60]

have written a parallel implementation of the Yau and Lu method.

FFT based invariant subspace decomposition, like ISDA applied to dense matrices

requires roughly 100n3 ops. Hence, it, like ISDA will never be as fast as tridiagonal

reduction based methods.

Strassen's matrix multiply

Strassen's matrix-matrix multiply[157] can decrease the execution time for very

large matrix-matrix multiplies by up to 20% but will not make ISDA competitive. Several

implementations of Strassen's matrix multiply have been able to demonstrate performance

superior to conventional matrix-matrix multiply[96][43]. However, Strassen's method is only

useful when performing matrix-matrix multiplies in which all three matrices are very large

and Strassen's op count advantage grows very slowly as the matrix size grows. In order

to double Strassen's op count advantage, the matrices begin multiplied must be sixteen

times as large and hence memory usage must increase a thousand fold.

2.7.4 Orthogonality

Some methods, notably inverse iteration, require extra care to ensure that the

eigenvectors are orthogonal. In exact arithmetic, if two eigenvalues di�er, their correspond-

ing eigenvectors will be orthogonal. However, if the input matrix has, say, a double eigen-

value, the eigenvectors corresponding to this double eigenvalue span a two-dimensional

subspace and hence there is no guarantee that two eigenvectors chosen at random from

this space will be orthogonal. In oating point arithmetic, inverse iteration without re-

41

orthogonalization may not produce orthogonal eigenvectors when two or more eigenvalues

are nearly identical. In DSTEIN, LAPACK's inverse iteration code, when computing the eigen-

vectors for a cluster of eigenvalues, modi�ed Gram-Schmidt re-orthogonalization is employed

after each iteration to re-orthogonalize the iterate against all of the other eigenvalues in the

cluster[102]. Modi�ed Gram-Schmidt re-orthogonalization parallelizes poorly because it is a

series of dot products and DAXPY's each of which depends upon the result of the immediately

preceding operation. PeIGs[74] and PDSYEVX[68] have chosen di�erent responses to the fact

that the re-orthogonalization in DSYEVX parallelizes poorly.

PeIGs alternates inverse iteration and re-orthogonalization in a di�erent manner

than DSYEVX. Instead of computing one eigenvector at a time, all of the eigenvectors within

a cluster are computed simultaneously. For each cluster, PeIGs �rst performs a round of

inverse iteration without re-orthogonalization using random starting vectors. Then, PeIGs

performs modi�ed Gram-Schmidt re-orthogonalization twice to orthogonalize the eigenvec-

tors. PeIGs performs a second round of inverse iteration without re-orthogonalization, using

the output from the previous step as the starting vectors, and again repeating until su�cient

accuracy is obtained for each eigenvector. Finally, PeIGs performs modi�ed Gram-Schmidt

re-orthogonalization one last time. They have shown that this method works on application

matrices with large clusters of eigenvalues.

PDSYEVX attempts to assign the computation of all eigenvectors associated with

each cluster of eigenvalues to a single processor. When enough space is available to accom-

plish this, PDSYEVX produces exactly the same results as DSYEVX. When the user does not

provide enough local workspace PDSYEVX relaxes the de�nition of cluster repeatedly until it

can assign all the computation of all eigenvectors associated with each cluster of eigenvalues

to a single processor.

When the input matrix contains one or more very large clusters of eigenvalues,

PDSYEVX performs poorly: If enough workspace is available, PDSYEVX gives the same results

as DSYEVX, but runs very slowly. If insu�cient workspace is available, PDSYEVX does not

guarantee orthogonality. Dhillon explains the fundamental problems in inverse iteration[59].

Recently Parlett and Dhillon have identi�ed new techniques for computing the

eigenvectors of a symmetric tridiagonal matrix[136, 139]. These new results raise the hope

that we will soon have an O(n2) method for computing the eigenvectors of a symmetric

tridiagonal matrix which parallelizes well and avoids the problems with computing the

eigenvectors associated with clustered eigenvalues. ScaLAPACK looks forward to applying

42

these new techniques in a future release.

43

Chapter 3

Basic Linear Algebra Subroutines

3.1 BLAS design and implementation

The BLAS[117, 63, 62], Basic Linear Algebra Subroutines, were designed to allow

portable codes most of whose operations are matrix-matrix multiplications, matrix-vector

multiplications, and related linear algebra operations to achieve high performance provided

that the BLAS achieve high performance. In LAPACK[4], the BLAS were used to re-express the

linear algebra algorithms in the previous libraries Linpack[61] and EISPACK[153], thereby

achieving performance portability.

The BLAS routines are split into three sets. BLAS Level 1 routines involve only

vectors, require O(n) ops (on input vectors of length n) and two or three memory op-

erations for every two ops performed. BLAS Level 2 routines involve one n by n matrix,

O(n2) ops and one or two memory operations for every two ops performed (rectangu-

lar matrices are also supported). BLAS Level 3 routines involve only matrices, O(n3) ops

and O(n2) memory operations. BLAS Level 1, because they involve only O(n) operations

per invocation, have the least exibility in how the operations are ordered, and require

the most memory operations per op. Hence, BLAS Level 1 routines have the lowest peak

oating point operation rate. They also have the lowest software overhead - an important

consideration because they perform few operations. BLAS Level 3 routines have the most

exibility in how the operations are ordered and require the fewest memory operations per

op and hence achieve the highest performance on large tasks. BLAS Level 1 and 2 routines

are typically limited by the speed of memory. BLAS Level 3 routines typically execute very

near the peak speed of the oating point unit.

44

Typical hardware architectures make it possible, but not easy, to achieve high

oating point execution rates for matrix-matrix multiply. Floating point units can initiate

oating point operations every 2 to 5 nanoseconds though oating point operations take 10

to 30 nanoseconds to complete and main memory requires 20 to 60 nanoseconds per random

data fetch. Floating point units achieve high throughput through concurrency, allowing

multiple operations to be performed simultaneously, and pipelining, starting operations

before the previous operation is complete. Register �les are made large enough to provide

source and target registers for as many operations as can be active at one time. Main

memory throughput can be enhanced by interleaving memory banks and by fetching several

words simultaneously (or nearly so) from main memory. Memory performance is further

enhanced by the use of caches. Two levels of caches are now typical and systems are now

being designed with three levels of caches.

High performance BLAS routines typically incur signi�cant software overhead: be-

cause to achieve near the oating point unit's peak performance, BLAS routines need an

inner loop that can keep the oating point units busy, surrounded by one or more levels

of blocking to keep the memory accesses in the fastest memory possible. Managing con-

currency and/or pipelining requires a long inner loop which operates on several vectors at

once. Each level of blocking requires additional control code and separate loops to handle

portions of the matrix that are not exact multiples of the block size. For example, DGEMV1

(double precision matrix-vector multiplication) on the PARAGON has an average software

overhead of 23 microseconds (over 1000 cycles at 50 Mhz) and includes 200 instructions of

error checking and case selection, 750 instructions for the transpose case and 500 for the

non-transpose case2.

3.2 BLAS execution time

The execution time for each call to a BLAS routine depends upon the hardware,

the BLAS implementation, the operation requested and the state of the machine, especially

the contents of the caches, at the time of the call. The time per DGEMV, or BLAS Level 2,

op is limited by the speed of the memory hierarchy level at which the matrix resides. The

1DGEMV performs y = �Ax+ �y or y = �ATx+ �y, where A is a matrix, x and y are vectors and � and

� are scalars.2These instruction counts include all instructions routinely executed during the main loop in reduction

to tridiagonal form. Not all are executed during each call to DGEMV.

45

Table 3.1: BLAS execution time (Time = �i + number of ops � i in microseconds)

BLAS Level 3 BLAS Level 2 BLAS Level 1

peak op rate

softwareoverhead

�3

timeper

op 3(M ops/sec)

softwareoverhead

�2

timeper

op 2(M ops/sec)

softwareoverhead

�1

timeper

op 1(M ops/sec)

PARAGON

Basic MathLibrarySoftware(Release5.0)

50 300 .024

(41) 87 .026(38) 3 .10(10)

IBM SP2

ESSL 2.2.2.2 480 0 .0037(270) 5 .1e 0.0055(180) 1.2 .01(100)

time per DGEMM3, or BLAS Level 3, op is typically limited primarily by the rate at which

the oating point unit can issue and complete instructions. We will concentrate on DGEMM

and DGEMV because they perform most of the ops in PDSYEVX.

Table 3.1 shows the software overhead and time per op for the BLAS routines.

These times are based on independent timings with code cached but not data cached using

invocations that are typical for PDSYEVX. Recall that these parameters are used in a linear

model of performance:

� + number of ops � (Line 1)

In PDSYEVX we are most concerned with the time per op for Level 3 routines and

secondarily concerned with the time per op and software overhead for Level 2 routines.

For n=3840 and p=64, on the Paragon, the three largest components attributable to the

items in Table 3.1 are: 28% of the PDSYEVX execution time is attributable to BLAS Level

3DGEMM performs C = �AB + �C or c = �ATB + �C, where A, B and C are matrices, and � and � are

scalars.

46

3 oating point execution (not including software overhead), 8% is attributable to BLAS

Level 2 oating point execution and 5% is attributable to BLAS Level 2 software overhead.

(See Chapter 5 for details.) The fact that the BLAS3 software overhead for the IBM SP2 is

listed as 0 stems from the fact that matrix-matrix multiply is faster for small problem sizes

because they �t in cache4.

Figure 3.1: Performance of DGEMV on the Intel PARAGON

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 105

0

1

2

3

4

5

6

7

8

9

Flops

>1

==

Act

ual f

aste

r

DGEMV expected/actual executed time on XPS5

Data cached

Data not cached

Figure 3.1 shows how actual DGEMV performance di�ers from predicted performance

Line 1 on the PARAGON. Each point represents the time required for a call to DGEMV with

parameters that are typical of calls to DGEMVmade in PDSYEVX divided by the time predicted

by our performance model. The timings are made by an independent timer as described in

Section 3.3. The model matches quite well on most calls to DGEMV. It also shows a modest,

but noticeable di�erence between the cost when data is cached versus when it is not. If4We did not pursue this because it BLAS3 software overhead has little impact on PDSYEVX execution time.

47

the software overhead term were removed (i.e using number of ops � 2 as the model) themodel would underestimate execution by a factor of two hundred or more on small problem

sizes.

Some calls to DGEMV require much less time than expected, as little as 1=9, indi-

cating the software overhead is not independent of the type of call made. In particular,

calls which involve very few ops can vary widely in their execution time (for the predicted

time). However, not many calls di�er widely in their execution time and those that do

require few ops (hence little execution time) and the fact that they do not match well

does not signi�cantly a�ect the accuracy of my performance model for PDSYEVX (given in

Chapter 4) and hence I did not study them further.

Figure 3.2 shows that DGEMV on the PARAGON requires 10 to 50 microsends longer if

the code is not cached at the time it is is called. The additional time required is estimated

by subtracting the cost of running DGEMV alone from the cost of running DGEMV followed

by 16,384 no-ops5 while accounting for the execution time of the 16,384 no-ops themselves.

The extra time required increases as the number of ops increases. And the extra time is

greater when the data is not cached than when it is cached6. It is not surprising that the

extra time required when the code is not cached increases as the number of ops increases

because when few ops are involved, the code does not execute as many loops. However

it is surprising that the code cache miss cost in the \Data not cached" case appears to

increase almost linearly with the number of ops, I would expect to see something closer

to a step function. This deserves further study if it is determined that code cache misses

substantially a�ect execution time.

Figure 3.3 shows that the extra time required by DGEMV ranges from 1.5% (when

DGEMV performs many ops) to over 10% (when DGEMV performs few ops). Only calls made

to DGEMV with parameters that are typical of the calls commonly made by PDSYEVX are

shown. The extra time required when code is not cached can be up to 80% on calls made

to DGEMV requiring very few ops, but these are rare in PDSYEVX.

5The code cache holds 8,192 no-ops. Hence, 16,384 guarantees that the no-ops are not in cache, makingtheir execution time independent of what is in the code cache at the time the 16,384 no-ops are executed.

6I compare the execution time when neither code nor data is cached to the execution time when code iscached but data is not when estimating the extra time required when data is not cached.

48

Figure 3.2: Additional execution time required for DGEMV when the code cache is ushedbetween each call. The y-axis shows the di�erence between the time required for a runwhich consists of one loop executing 16,384 no-ops after each call to DGEMV and the timerequired for a run which includes two loops one executing DGEMV and one executing 16,384no-ops.

0 5 10 15

x 104

−1

0

1

2

3

4

5

6x 10

−5

Flops

Tim

e (s

econ

ds)

DGEMV code cache miss cost on XPS5

Data cached

Data not cached

3.3 Timing methodology

Each routine is timed with several sets of input parameters. To time a routine

with a given set of input parameters, the routine is run three times and the time from the

third run is used. Each run consists of calling the routine to be timed repeatedly within a

loop. The �rst run, in which the loop is run only once, ensures that the code is paged in.

The second run, in which the loop is run just long enough to exceed the timer resolution,

provides an estimate that is used to determine how many times to run the third run. The

third run, in which the loop is run for approximately one second, is the only one whose

execution time is recorded. We record both CPU time and wall clock time. These plots are

49

Figure 3.3: Additional execution time required for DGEMV when the code cache is ushedbetween each call as a percentage of the time required when the code is cached. SeeFigure 3.2.

0 5 10 15

x 104

−2

0

2

4

6

8

10

12

14

16

Flops

Per

cent

of t

ime

requ

ired

whe

n co

de is

cac

hed

DGEMV code cache miss cost on XPS5

Data cached

Data not cached

based on CPU time.

The input parameters for each run are randomly selected such that they match

the input parameters made in a typical call to DGEMV from PDSYEVX. Randomly selecting

the input parameters provides advantages over a systematic choice of input parameters.

A systematic choice of input parameters might include, for example, only even values of k

whereas odd values of k might require signi�cantly longer. Random selection means that the

likelihood of identifying anomalous behavior is directly related to how often that behavior

occurs in calls within PDSYEVX. Random selection scales well also: It is easy to increase or

decrease the number of timings and/or the number of processors used.

50

3.4 The cost of code and data cache misses in DGEMV

Each set of input parameters is timed under four di�erent cache situations:

Code and data cached

Code cached but data not cached

Code not cached but data cached

Neither code nor data cached

Data can be allowed to remain in cache (to the extent that it �ts in cache) by

using the same arrays in each call within the timing loop. Likewise data can be prevented

from remaining in cache by using di�erent arrays for each call within the timing loop.

Allowing data to reside in cache reduces execution time in two ways. It reduces

the cost of accessing the data in the arrays being operated on and it reduces the software

overhead cost, because software overhead also involves reading and writing data, notably

while saving and restoring registers.

Code and data cache misses are more important in DGEMV than in DGEMM because

DGEMV is called more often than DGEMM and the ratio of ops to data movement is higher for

DGEMM than for DGEMV, hence reducing the cost of data cache misses in DGEMM.

3.5 Miscellaneous timing details

We make sure that timings are not a�ected by conditions which are not likely to

be encountered in a typical run of PDSYEVX. Exceptional numbers (subnormalized numbers

and in�nities ) will occur only rarely in PDSYEVX7. Hence, we make sure that exceptional

numbers do not appear during our timing runs.

We do not time PDSYEVX on problem sizes that do not �t in physical memory.

Hence, when timing the individual BLAS routines, we make sure that the arrays �t in physical

memory. Ed D'Azevedo has written an out-of-core symmetric eigensolver and studied the

a�ect of paging on PDSYEVX[52].

7The matrix is scaled before reduction to tridiagonal form to avoid being close to the over ow or under owthreshold. Although this does not prevent under ows (or subnormalized numbers) it causes them to be rare.NaNs will never appear in PDSYEVX unless NaNs appear in the input.

51

We measure and report both wall clock time and CPU time. Wall clock time may

di�er from CPU time for several reasons, including: time spent waiting for communication,

time spent on other processes and time spent on paging and other operating system services.

When timing the BLAS, we are primarily interested in CPU time because we there is no

communication and we are not interested in measuring the time spent waiting on other

processes. However, we measure and report wall clock time because for all other timings

we must rely on wall clock timings8. When the wall clock time di�ers substantially from

the CPU time on calls to the BLAS on time shared systems (such as the IBM SP2) we use

the ratio of wall clock time to CPU time as a crude measure of the load on the system.

We use the timing routines included in the BLACS routines developed at Univeristy

of Tennessee at Knoxville[169, 69] (which are not a part of the BLACS speci�cation).

Many modern computers have cycle time counters which would allow much more detailed

measurement of execution time and often other machine characteristics. These detailed

timing routines are not portable and I chose to stick to portable timing techniques. Alter-

natively, Krste Asanovic has developed a portable interface for taking performance related

statistics over an "interval" of a code's execution[11].

8CPU time is often meaningless when communication is involved.

52

Chapter 4

Details of the execution time of

PDSYEVX

4.1 High level overview of PDSYEVX algorithm

Figure 4.1 shows how PDSYEVX reduces the original (dense) matrix to tridiagonal

form (Line 1), uses bisection and inverse iteration to solve the tridiagonal eigenproblem

(Line 2) and then transforms the eigenvectors of the tridiagonal matrix back into the eigen-

vectors of the original dense matrix(Line 3). PDSYEVX uses a two-dimensional block cyclic

data layout with an algorithmic block size equal to the data layout block size in both

Householder reduction to tridiagonal form and back transformation. When using bisec-

tion to compute the eigenvalues, it assigns each process an essentially equal number of

eigenvalues to compute. For inverse iteration, PDSYEVX attempts to assign roughly equal

numbers of eigenvectors to each process while assigning all eigenvectors corresponding to

a given cluster of eigenvalues to the same process. Gram-Schmidt re-orthogonalization is

performed locally within each process and hence orthogonality is not guaranteed for eigen-

vectors corresponding to eigenvalues within a cluster that is too large to �t on a single

process.

We assume that only the lower triangle of the square symmetric matrix A contains

valid data on input and the algorithms only read and write this lower triangle. The general

conclusions of this thesis apply to the upper triangular case as well.

Please refer to Table A.1, Table A.2, and Table A in Appendix Afor the list of

53

Figure 4.1: PDSYEVX algorithm

A = QTQT

A 2 Rn is the matrix whose eigendecompositionwe seek.T is tridiagonal.Q is orthogonal.

(Line 1)

T = U�UT

� = diag(�1; : : : ; �n) is the diagonal matrix ofeigenvalues.The columns of U = [u1 : : :un] are the eigenvec-tors of T .Tui = �iui

(Line 2)

V = QUThe columns of V = [v1 : : : vn] are the eigenvectorsof A.Avi = �ivi

(Line 3)

notation used in this chapter.

Section 4.2 describes and models reduction to tridiagonal form as performed by

PDSYTRD. Section 4.3 describes and models the tridiagonal eigensolution as performed by

PDSTEBZ (bisection) and PDSTEIN(inverse iteration). Section 4.4 describes and models back

transformation as performed by PDORMTR.

4.2 Reduction to tridiagonal form

4.2.1 Householder's algorithm

Figure 4.4 shows Householder's reduction to tridiagonal form, Figure 4.4 shows a

model for the runtime of ScaLAPACK's reduction to tridiagonal form code, PDSYTRD. The

rest of this section explains the computation and communication pattern in PDSYTRD. We

begin by describing the classical (serial and unblocked) algorithm (essentially the EISPACK

algorithm TRED1 and also LAPACK's DSYTD2), then the blocked (but still serial) algorithm

(essentially the LAPACK algorithm DSYTRD) and �nally the parallel blocked ScaLAPACK algo-

rithm PDSYTRD.

54

Classical (serial and unblocked) Householder reduction (Figure 4.2)

Figure 4.2 shows the algorithm for the clasical (serial and unblocked) Householder

reduction to tridiagonal form, (essentially the algorithm used in LAPACK's DSYTD2.

The �rst iteration through the loop performs an orthogonal similarity transforma-

tion of the form: A (I � �vvt)A(I � �vvt) where � = 2= k v2 k22, such that only the �rst

two elements in the �rst column (and hence the �rst two elements in the �rst row) of A

are non-zero. Each iteration through the loop repeats these steps on the trailing submatrix

A(2:n; 2:n) to reduce A to tridiagonal form by a series of similarity transformations.

Compute an appropriate re ector (Line 2.1 in Figure 4.2 )

We seek a re ector of the form: I � �vvt such that � = 2vtv and the �rst row and

column of (I � �vvt)A(I � �vvt) has zeroes in all entries except the �rst two.

Let z be the column vector A(2:n; 1). In exact arithmetic, any vector v = c[z1� kz k2; z2 : : : zn] for any scalar c will su�ce, and de�nes what value � must take. LAPACKand ScaLAPACK choose the sign (� k z k2) to match the sign of z1 to minimize roundo�errors, and choose c such that v(1) = 1:0. c can also be chosen to be 1, avoiding the

need to multiply z by c, at some small risk of over/under ow.

Form the matrix vector product y = Av (Line 3.3 in Figure 4.2 )

This is a matrix vector multiply (Basic Linear Algebra Subroutines Level 2) requiring

2(n�i)2 ops, which when summed from i = 1 to n�1 totals 23n

3 ops.

Compute the companion update vector w = y � 12

��(y)T � v�v (Line 5.1 in Figure 4.2 )

The vector w (which is computed here with a dot product and a DAXPY) has the

property that (I � �vvT )A(I � �vvT ) = A � vwT � wvT .

Update the matrix (Line 6.3 in Figure 4.2 )

Compute A = A�vwT�wvT , a BLAS Level 2 rank-2 update. A rank-2 update requires

4 ops per element updated, only the lower triangular portion of A is updated, so this

requires 2(n�i)2 ops, which summed over i = 1 to ni�1 is 23n

3 ops.

55

Figure 4.2: Classical unblocked, serial reduction to tridiagonal form, i.e. EISPACK'sTRED1(The line numbers are consistent with �gures 4.3, 4.4 and 4.5.)

do i = 1; n

Compute re ector

2.1 [�; v] = house(A(i+1:n; i))

v 2 Rn�i; � is a scalar;House computes a householder vectorsuch that(I � �vvT )A(i+1:n; i)(I� �vvT )is zero except for the top element.

Perform matrix-vector multiply

3.3w = tril(A(i+1:n; i+1:n))v+ tril(A(i+1:n; i+1:n);�1)vT

w 2 Rn�i; tril() is MATLAB notationfor the lower triangular portion ofa matrix (including the diagonal).tril(;�1) refers to the portion of thematrix below the diagonal.

Compute companion update vector5.1 c = w � vT ;

w = � w � (c �=2) v

Perform rank 2 update

6.3A(i+1:n; i+1:n) =tril(A(i+1:n; i+1:n)� wvT � vwT )

Here we use tril to indicate that onlythe lower triangular portion of A needbe updated.

end do i = 1; n

Blocked Householder reduction to tridiagonal form (Figure 4.3)

In the above algorithm, nearly all the ops are performed in the product y = Av,

or the rank-2 update A� vwT �wvT , both of which are BLAS Level 2 operations. Through

blocking, half of the ops can be executed as BLAS 3 ops because k matrix updates

can be performed as one rank-2k update instead of k rank-2 updates. This is done in

Line 6.3 in Figure 4.2. The cost of blocking is signi�cant in PDSYTRD, but the gain is also.

See section 7.2.2. This allows the matrix update to be considerably more e�cient, but it

complicates the computation of the re ector and the computation of the companion update

vector, because PDSYTRD must work with an out-of-date matrix. Starting with A0, the

computation of the �rst re ector v0, the matrix vector product and w0 are unchanged, but

as soon as PDSYTRD attempts to compute the second re ector, v1 it has to deal with the fact

that A1 is known only in factored from, i.e. A1 = A0� v0wT0 �w0v

T0 . This does not greatly

complicate computing the re ector because the re ector needs only the �rst column of A1.

56

Figure 4.3: Blocked, serial reduction to tridiagonal form, i.e. DSYEVX( See Figure 4.2 forunblocked serial code)

do ii = 1; n; nbmxi = min(ii+ nb; n)do i = ii ;mxi

Update current (ith) column of A

1.2A(:; i) = A(:; i)�W (:; ii:i�1)V (i; ii:i�1)T �V (:; ii:i�1)W (i; ii:i�1)T

Compute re ector2.1 [�; v] = house(A(i+1:n; i)) v 2 Rn�i; � is a scalar

Perform matrix-vector multiply3.3 w = tril(A(i+1:n; i+1:n))v w 2 Rn�i

+ tril(A(i+1:n; i+1:n);�1)TvUpdate the matrix-vector product

4.1

w = w �W (:; ii:i�1)V (i; i+1:n)Tv �V (:; ii:i�1)W (i; i+1:n)Tv

Compute companion update vector

5.1 c = w � vT ;w = � w � (c �=2) v

W (i+1:n; i) = w;V (i+1:n; i) = v

end do i = ii ;mxi

Perform rank 2k update

6.3

A(mxi+1:n;mxi+1:n) =tril(A(mxi+1:n;mxi+1:n)�W (mxi+1:n; ii:mxi)V (mxi+1:n; ii:mxi)T �V (mxi+1:n; ii:mxi)W (mxi+1:n; ii:mxi)T )

end do ii = 1; n; nb

However, computing w1 requires the computation of A1v, hence we must either update the

entire matrix A1, returning to an unblocked code, or compute y = (A0 � v0wT0 � w0v

T0 )v.

Computing the re ectors and the companion update vectors now requires that the current

column be updated (Line 1.2 in Figure 4.3). The matrix vector product must be updated

(Line 4.1 in Figure 4.3).

57

4.2.2 PDSYTRD implementation (Figure 4.4)

Figure 4.5 shows Householder's reduction to tridiagonal form along with a model

for the runtime of each step in ScaLAPACK's reduction to tridiagonal form code, PDSYTRD.

The rest of this section explains the computation and communication pattern in PDSYTRD,

and hence the ine�ciencies.

58

Figure 4.4: PDSYEVX reduction to tridiagonal form ( See Figure 4.3 for further details)


Update current (ith) column of A (Table 4.1)

1.1spread V (i; ii : i�1)T and W (i; ii : i�1)T down

processor owning V (i; ii : i�1) andW (i; ii : i�1) broadcasts to all otherprocessors in its procesor column.

1.2A(i; i) = A(:; i)�W (:; ii:i�1)V (i; ii:i�1)T �V (:; ii:i�1)W (i; ii:i�1)T

V and W are used as they are stored(no data movement required)

Compute re ector (Table 4.2)2.1 [�; v] = house(A(i+1:n; i)) v 2 Rn�i; � is a scalar

Perform matrix-vector multiply (Table 4.3)3.1 spread v across3.2 transpose v, spread down3.3 w1 = tril(A(i+1:n; i+1:n))v; w1 is distributed like row A(i; :)

w2 = tril(A(i+1:n; i+1:n);�1)Tv w2 is distributed like column A(:; i)3.4 sum w row-wise3.5 sum wT column-wise

3.6 w = w1 + w2w is distributed like column A(:; i); hencew1 must be transposed.

Update the matrix-vector product (Table 4.4)

4.1

w = w �W (:; ii:i�1)V (i; i+1:n)Tv �V (:; ii:i�1)W (i; i+1:n)Tv

Compute companion update vector (Table 4.5)5.1 c = w � vT ;

w = � w � (c �=2) v

W (i+1:n; i) = w;V (i+1:n; i) = v

end do i = ii ;mxi

Perform rank 2k update (Table 4.6)

6.1spread V (mxi+1:n; ii:mxi);W (mxi+1:n; ii:mxi) across

processors in current column of pro-cessors broadcasts to processors inother processor columns

6.2transpose V (mxi+1:n; ii:mxi);W (mxi+1:n; ii:mxi), spread down

6.3

A(mxi+1:n;mxi+1:n) =tril(A(mxi+1:n;mxi+1:n)�W (mxi+1:n; ii:mxi)V (mxi+1:n; ii:mxi)T �V (mxi+1:n; ii:mxi)W (mxi+1:n; ii:mxi)T )


59

Figure 4.5: Execution time model for PDSYEVX reduction to tridiagonal form (See Figure 4.4for details about the algorithm and indices.)

computation communicationoverhead imbalance latency bandwidth


Update current (ith) column of A1.1 spread V T and WT down 2n lg(

pp)�

1.2 A = A�W V T � V WT 2n �4n2 nbp

p 2 2n lg(pp)�

Compute re ector

2.1 v = house(A) n �4 3n lg(pp)�


3.1 spread v across n lg(pp)� 1

2n2 lg(

pp)p

p �

3.2 transpose v, spread down n2pp �1 n lg(

pp)� 1

2n2 lg(

pp)p

p �

3.3 w = tril(A)v; (n�4 (23n3

p 2+

wT = tril(A;�1)vT + n2

nbpp �2) 3 n2 nbp

p 2)

3.4 sum w row-wise n lg(pp)� 1

2n2 lg(

pp)p

p �

3.5 sum wT column-wise n lg(pp)� 1

2n2 lg(

pp)p

p �

3.6 w = w + transpose wT

Update the matrix-vector product

4.1 w = w �W V Tv � V WT v 4n �4 2 n2 nbpp 2 6n lg(

pp)�

n2 lg(pp)p

p �


5.1 c = w � vT ; n �4 2n lg(pp)�

w = � w � (c �=2) v

end do i = ii ;mxi


6.1 spread V;W acrossn2 lg(

pp)p

p �

6.2 transpose V;W , spread downn2 lg(

pp)p

p �

6.3 A = A �W V T � V WT 2 n2

nb2pp�3 (23

n3

p 3 + 3 n2 nbpp 3)


60

Distribution of data and computation in PDSYTRD

In PDSYEVX, the matrix being reduced, A, is distributed across a 2 dimensional grid

of processors. The computation is distributed in a like manner, i.e. computations involving

matrix element A(i; j) are performed by the processor which owns matrix element A(i; j).

Vectors are distributed across the processors within a given column of processors. At the

ith step, i.e. when reducing A(i :n; i :n) to A(i+1:n; i+1:n), the vectors are distributed

amongst the processors which own some portion of the vector A(i :n; i). Within calls to

the PBLAS, these vectors are sometimes replicated across all processor columns, or even

transposed and replicated across all processor rows. However, between PBLAS calls, each

vector element is owned by just one processor.

Critical path in PDSYTRD

For steps 1.1, 1.2, 2.1, 4.1, 5.1, 6.1, 6.2, 6.3 in Figure 4.5, i.e. all steps except

\forming the matrix vector product", the processor owning the most rows in the current

column of the remaining matrix has the most work to do and hence it is on the critical path.

When the matrix vector product is being formed, (steps 3.1 through 3.6) the processor

which owns the most rows and the most columns in the remaining matrix has the most

work (both communication and computation) and hence is on the critical path.

Load imbalance

Load imbalance occurs when some processor(s) take longer to perform certain op-

erations1, requiring other processors to wait. Each processor is responsible for computations

on the portion of the matrix and/or vectors that it owns. Some processors own a larger

portion of the matrix and/or vectors. Since PDSYTRD has regular synchronization points2,

the processor which takes the longest to complete any given step determines the execution

time for that step.

If row j is the �rst row in a data layout block, the processor which owns A(j; j) will

own the most rows in A(j :n; j :n): bn�j+1pr nb

c nb+min(n� j+1�b n�jpr nb

c nb pr; nb). However,if row j is not the �rst row in a data layout block, even this formula is too simplistic.

1Load imbalance also occurs during communication, but for PDSYTRD on the machines that we studiedthe communicaiton load imbalance was negligible.

2Computing the re ector (Line 2.1) and computing the companion update vector (Line 5.1) require allthe processors in the processor column owning column i of the matrix and are hence synchronization points.

61

Fortunately, n�j+1pr

+ nb2 is an excellent approximation, on average, for the maximum number

of rows of A(j :n; j :n) owned by any processor. n�j+1pr

+ nb2pr�1pr

is more accurate, but the

di�erence is too small to be useful.

The second source of load imbalance is that many of the computations are per-

formed only by the processors which own the current column of the matrix.

Updating the current column of A

As shown in table 4.1, PDSYTRD updates the current column of A through two calls

to PDGEMV, one at line 350 of pdlatrd.f and one at line 355 of pdlatrd.f. Each of these calls

to PDGEMV requires that the �rst few elements of a column vector (W or V) be transposed

and replicated among all the processors in that column. The transposition is fast because

these elements are entirely contained within one processor, but the replication requires a

spread down (column-wise broadcast) of nb or fewer items.

Standard data layout model

By making a few assumptions, we can signi�cantly simplify the model. By assum-

ing that pr = pc =pp, many of the terms coalesce. We also assume that the panel blocking

factor4 , pbf ,= 2, as it is in ScaLAPACK 1.5.

This standard data layout is also assumed in Figure 4.5 and in Chapter 5. The

models used in Figure 4.5 and in Chapter 5 are subsets, including only the most important

terms, of the \standard data layout" models shown in Tables 4.2 through 4.10.

Computing the re ector (Line 2.1 in Figure 4.5)

PDLARFG computes the re ector as shown in table 4.2. First, it broadcasts � = A(j + 1; j)

to all processes that own column A(:,j). Then, it computes the norm � = jA(j + 1:n; j)jleaving the result replicated across all processors that own column A(:; j).

The rest of the computation is entirely local and requires only 2n2pp + O(n) ops,

hence does not contribute signi�cantly to total execution time.

4The matrix vector multiplies are each performed in panels of size pbfnb. See Section 4.2.2.

62

Table 4.1: The cost of updating the current column of A in PDLATRD(Line 1.1 and 1.2 inFigure 4.5)

TaskFile:line numberor subroutine

Execution time con-tribution fromcolumns j = 1 to nshown explicitly

Execution time(simpli�ed)

Broadcast W (j;1:j0�1)Twithin current column3.

pdlatrd.f :350pdgemv .c

pbdgemv.f :560dgebs2d

nPj=1

(dlog2(pr)e�+�4+

j0 dlog2(pr)e�)

ndlog2(pr)e�+n �4+0:5nnb dlog2(pr)e�

Compute local portion ofA(j:n)=A(j:n;j)�V (j:n;1:j0�1)�W (j;1:j0�1)T


pbdgemv.f :580dgemv

nPj=1

(�2+2(n�j) j0

pr 2) n �2+0:5n

2 nbpr

2

Broadcast V (j;1:j0�1)Twithin current column.


pbdgemv.f :560

nPj=1

(dlog2(pr)e�+�4

j0 dlog2(pr)e�)

n dlog2(pr)e�+n�4+0:5n nb dlog2(pr)e�

Compute local portion ofA(j:n)=A(j:n;j)�W (j:n;1:j0�1)�V (j;1:j0�1)T


pbdgemv.f :580dgemv

nPj=1

(�2+2(n�j) j0

pr 2) n �20:5

n2 nbpr

2

Total 2n dlog2(pr)e�+n nb dlog2(pr)e�+2n �2+n2 nbpr

2+2n �4

Standard data layout(See section 4.2.2)

2n dlog2(pp)e�+n nb dlog2(

pp)e�+2n �2+

n2 nbpp

2+2n �4

63

Table 4.2: The cost of computing the re ector (PDLARFG) (Line 2.1 in Figure 4.5)




� = A(j + 1; j)pdlatrd.f :364pdlarfg.f :213dgebs2d

nPj=1

dlog2(pr)e� ndlog2(pr)e�

xnorm = jA(j + 1:n; j)jpdlatrd.f :364pdlarfg.f :229pdnrm2

nPj=1

(2 dlog2(pr)e�+ 12 �4) 2n dlog2(pr)e�+n

2 �4

� = �(� + �)=�pdlatrd.f :364pdlarfg.f :271

negligible negligible

A(j + 2; j) = A(j+2:n;j)(�+�)

pdlatrd.f :364pdlarfg.f :272pdscal

nPj=1

12�4 n

2�4

E(j) = A(j + 1; j) = �pdlatrd.f :364pdlarfg.f :273

negligible negligible

Total 3ndlog2(pr)e� + n �4


3ndlog2(pp)e�+ n �4

64

Forming the matrix vector product using PDSYMV(Lines 3.1 through 3.6 in Fig-

ure 4.5)

The matrix A is laid out in a block cyclic manner as described in section 2.5.4.

Computing the matrix vector product y = Av requires that v be copied to all processes

that own a part of A that needs to be multiplied by v. The vector v must be transposed5.

Each element is sent directly from the processor (in the processor column) Each processor

in the processor column that owns v sends to each processor in the processor row v exactly

the elements and spread down and because only half of A is stored, v must also be spread

across. Then, the matrix vector multiplies6 , w1 = tril(A; 0)v and w2 = tril(A;�1)Tv are

performed locally. w1 is summed within columns, transposed and added to the result of

tril(A; 0)Tv which is summed to the active column of processors. The algorithm used by

PDSYMV is:

Algorithm 4.1 PDSYMV as used to compute Av

1 Broadcast v within each row of processors (Line 3.1 in Figure 4.4)

2 Transpose v within each column of processors (Line 3.2 in Figure 4.4)

3 Broadcast vT within each column of processors (Line 3.2 in Figure 4.4)

4 Form diagonal portion of A (Line 3.3 in Figure 4.4)

5 w1 = locally available portion of tril(A; 0)v (Line 3.3 in Figure 4.4)

6 w2 = vT tril(A;�1) (Line 3.3 in Figure 4.4)

7 Sum w1 within each column of processors (Line 3.4 in Figure 4.4)

8 Sum w2 within each row of processors (Line 3.5 in Figure 4.4)

9 Transpose w1 and add to w2 (Line 3.6 in Figure 4.4)

The two transpose operations, steps f2,3g and step 9 in algorithm 4.1 though both

are performed by PBDTRNV, use di�erent communication patterns. The transpose performed

in steps 2 and 3, is an all-to-all. It takes v replicated across the processor columns and

distributed across the processor rows and produces vT replicated across the processor rows

and distributed across the processor columns. The transpose performed in step 9 is a one-

to-one transpose. It takes yTu distributed across the processor columns within one processor

5The non-transposed v is distributed like column A(:; i), the transposed v is distributed like row A(i; :).6tril() is MATLAB notation for the lower triangular portion of a matrix (including the diagonal). tril(;�1)

refers to the portion of the matrix below the diagonal.

65

row. It produces yu distributed across the processor rows within the current processor

column.

The all-to-all transposition is performed in two steps (steps 2 and 3 in algo-

rithm 4.1). Since each column of processors contains a complete copy of the vector v,

each acts independently, �rst collecting the portion of vT that belongs to this processor col-

umn to one processor7 and then broadcasting it to all processor columns. The operation of

collecting the portion of vT that belongs to this processor column to one processor is done as

a tree-based reduction, requiring dlog2(lcm(pr; pc))e messages, and a total oflcm(pr;pc)�1lcm(pr;pc)

jpc

words which I model as npc

words. The broadcast which completes the transpose (step 3),

requires dlog2(pc)e messages and dlog2(pc)e jpc

words.

The one-to-one transpose (step 9) is accomplished as a single set of direct messages.

Every word in yTu is owned by exactly one processor. Every word in yu should be sent to one

processor. Every word in yTu is sent from the processor that owns it to the processor that

needs the corresponding word in yu. All words being sent between the same two processors

are sent in a single message. The number of words sent by each processor that owns a part

of yTu sends every word that it owns, i.e. jpc

in lcm(pr; pc)=pc messages. Every processor that

needs a part of yu receives the number of words that it needs: jpr

in lcm(pr; pc) messages.

The two matrix vectors multiplies are each performed in panels of size: pbf nb.

pbf , the panel blocking factor, is set to max(mullen; lcm(pr; pc)=pc), where mullen is a tuning

parameter set at compile-time to 2 in ScaLAPACK 1.5.

The cost of the matrix vector multiply is detailed in table 4.3.

The number of ops in the matrix vector multiply which any given processor must

perform is controlled by the size and shape of the local portion of the trailing matrix.

The processor holding the largest portion of the trailing matrix holds a matrix of size

approximately8 d n�jmb pr

emb�d n�jnb pc

e nb. Because we update only the lower triangular portionof the matrix, each element in the lower triangular portion of the matrix is used in twomatrix

vector multiplies. And, because the shape of the local portion of the matrix is irregular

(a column block stair step with some diagonal steps) the matrix vector computation is

performed by column blocks. The irregular patterns repeats every lcm(pr;pc)pc

nb, so pbf , the

panel blocking factor is chosen to be: max(mullen; lcm(pr;pc)pc

), where mullen is a compile time

7If pr = lcm(pc; pr) the portion of data that belongs othis processor column is already on one processorand hence this \collection" is a null operation.

8The largest local matrix size di�ers from this only when mod (n�j;nb pr) < nb or mod (n�j;nb pc) <nb.

66

Table 4.3: The cost of all calls to PDSYMV from PDSYTRD

Broadcast v withineach processor row(Line 3.1)

pdlatrd.f :370pdsymv .c

pbdsymv.f :406dgebr2d

nPj=1

(dlog2(pc)e�+

dlog2(pc)e( (n�j)pr+ nb

2 )�)

n dlog2(pc)e�+0:5

n2dlog2(pc)epr

�+

0:5n nb dlog2(pc)e�

Transpose v(Line 3.2)


pbdsymv.f :421pbdtrnv.f :385pbdtrget

nPj=1

(dlog2( lcm(pr;pc)pc

)e�

+n�jpc

�)

ndlog2(lcm(pr;pc))e�+0:5n

2

pc�

Broadcast vT downwithin each processorcolumn(Line 3.2)


pbdsymv.f :421pbdtrnv.f :400dgebs2d

nPj=1

(dlog2(pr)e�+

dlog2(pr)e(n�jpc+ nb

2)�)

n dlog2(pr)e�+0:5

n2dlog2(pr)epc

�+

0:5nnb dlog2(pr)e�

Form diagonal portionof matrix, paddedwith zeroes(Line 3.3)


pbdsymv.f :685pbdlacp1

nPj=1

(n�jpc

�1n�jpc

�1) 12n2

pc�1+

12n2

pc�1

w = tril(A; 0)v,wT = vT tril(A;�1)local computation(Line 3.3)


pbdsymv.f :702,704,757,759

dgemv

2nP

j=1(2 n�j

pbf nbpc�2+

(d n�jnb pr

enb) (d n�jnb pc

enb) 2+j pbf nb

pr 2)

2 n2

pbf nb pc�2+

23n3

p 2+

12n2 nbpr

2+

12n2 nbpc

2+n2 nbpbf

pr 2

Sum, row-wise, w

(Line 3.4)


pbdsymv.f :801dgsum2d

nPj=1

( dlog2(pc)e�+

dlog2(pc)e( (n�j)pr+ nb

2 )�)

n dlog2(pc)e�+0:5

n2 dlog2(pc)epr

�+

0:5nnb dlog2(pc)e�

Sum, columnwise, wT

(Line 3.5)


pbdsymv.f :809dgsum2d

nPj=1

(dlog2(pr)e�+

dlog2(pr)e(n�jpc+ nb

2 )�)

n dlog2(pr)e�+0:5

n2 dlog2(pr)epc

�+

0:5n nb dlog2(pr)e�

Transpose wT

and sum into w(Line 3.6)


pbdsymv.f :811pbdtrnv

Pi=1

n( lcm(pr ;pc)pr

�+

lcm(pr;pc)pc

�+ npc�+ n

pr�+

nnb pc

�1+n

nb pr�1)

n lcm(pr;pc)pr

�+n lcm(pr;pc)

pc�+

0:5n2

pc�+0:5n

2

pr�+

0:5 n2

nb pc�1+0:5 n2

nb pr�1

Total

2n dlog2(pc)e�+2n dlog2(pr)e�+n lcm(pr;pc)pr

�+n lcm(pr ;pc)pc

�+

n dlog2(lcm(pr;pc))e�+n2 dlog2(pc)epr

�+n2 dlog2(pr )e

pc�+n2

pc�+0:5n

2

pr�+

n nb dlog2(pc)e�+nnb dlog2(pr)e�+2 n2

pbf nbpc�2+

23n3

p 2+

12n2 nbpr

2+12n2 nbpc

2+n2 nb pbf

pr 2+

n2

pc�1+n �4


4n dlog2(pp)e�+2n�+2

n2 dlog2(pp)ep

p�+

1:5 n2pp�+2n nb dlog2(

pp)e�+ n2

nbpp�2+

23n3

p 2+3n

2 nbpp

2+n2pp�1+n �4

67

parameter, set to 2 in the standard PBLAS release. The column panels are �lled out with

zeroes to make the matrix vector multiply e�cient. Even the act of �lling the diagonal

blocks with zeroes, because it is done ine�ciently, is noticeable on modest problem sizes.

The number of ops required for a global (n� j)� (n� j) matrix vector multiply

is approximately:

2� 2� (1

2(dn� j

nb pre nb)(dn� j

nb pce nb) + (n� j)

pbf nb

2 pr) :

The �rst 2 is because multiplies and adds are counted separately. Each element in the lower

triangular portion of the matrix is involved twice, hence the second 2. The �rst term stems

directly from the size of the local matrix. The second term stems from the odd shape of

the local matrix and is primarily the result of the unnecessary ops (zero matrix elements)

added to reduce the number of dgemv calls.

We use the following equality, dropping the O(n) term:

nXi=1

d iaed ibe = n3

3+n2 a

4+n2 b

4+O(n) :

f lops = 2� 2nX

j=1

�12dn � j

nb pre nb dn � j

nb pce nb + j

pbf nb

2 pr

�

= 2� 21

2

� n3

3 pr pc+

1

4

n2

nb pcnb2 +

1

4

n2

nb prnb2 +

n2

2

pbf nb

pr

�

=2

3

n3

p+n2 nb

2pr+n2 nb

2pc+n2 nb pbf

pr

Figure 4.6: Flops in the critical path during the matrix vector multiply

Updating the matrix vector product

Updating the matrix vector product, y = y � VWT v �WV T v, requires four matrix vector

multiplies. temp = WT v and temp = V T v are both (n � j) � j0 by n � j matrix vector

multiplies, where j0 = j mod nb . Both the matrix and the vector are stored in the current

process column. No data movement is required to perform the computation, however the

result, a vector of length j0�1 is the sum of the matrix vector multiplies performed on each

of the processes in the process column.

68

Table 4.4: The cost of updating the matrix vector product in PDLATRD(Line 4.1 in Figure 4.5)




Broadcast WT

unnecessarilyfor temp = WT v

pdlatrd.f :373pbdgemv.f :826dgebs2d

nPj=1

(�4+dlog2(pc)e�+

dlog2(pc)e (n�j)pr�+

dlog2(pc)e nb2�)

n �4+n dlog2(pc)e�+0:5

n2 dlog2(pc)epr

�

+0:5n nb dlog2(pc)e�

Local computationof temp = WT v

pdlatrd.f :373pbdgemv.f :846dgemv

nPj=1

(�2+2 npr

nb2 2) n �2+0:5n

2 nbpr

2

Sum the contributionof temp from all processesin the column

pdlatrd.f :373pbdgemv.f :858dgsum2d

nPj=1

(dlog2(pr)e�+

dlog2(pr)e nb2 )�)

n dlog2(pr)e�+0:5n nb dlog2(pr)e�

Broadcast temp

(row-wise) to all processesin this column

pdlatrd.f :376pbdgemv.f :579dgebs2d

nPj=1

(�4+dlog2(pr)e�+

dlog2(pr)e nb2�)

n �4+n dlog2(pr)e�+0:5n nb dlog2(pr)e�

Local computationof y = V � temp

pdlatrd.f :376pbdgemv.f :600dgemv

nPj=1

(�2+2 npr

nb2 2) n �2+0:5n

2 nbpr

2

y = y +WV T is identicalto y = y + VWT

pdlatrd.f :379pdlatrd.f :382

n dlog2(pc)e�+2n dlog2(pr)e�+0:5n2 dlog2(pc)e

pr�+

0:5nnb dlog2(pc)e�+n nb dlog2(pr)e�+2n�2+

n2 nbpr

2+2n �4

Total2n dlog2(pc)e�+4n dlog2(pr)e�+n2 dlog2(pc)e

pr�+

n nb dlog2(pc)e�+2n nb dlog2(pr)e�+4n �2+2 n2 nbpr

2+4n �4


6n dlog2(pp)e�+n2 dlog2(

pp)e

pr�+

3n nb dlog2(pp)e�+4n �2+2 n2 nbp

p 2+4n �4

The other two matrix vector multiplies, y = V � temp and y = W � temp, are both

(n� j)� (j0 � 1) by j0 � 1 matrix vector multiplies. Again, the computation is performed

entirely within the current process column. The 1 by j0 � 1 vector, temp, must be spread

down, i.e. broadcast column-wise, to all processes in this process column, however no further

communication is necessary in order to update y, as y is perfectly aligned with V .

Details are given in the table 4.4.

Computing the companion update vector

The details involved in computing the companion update vector are shown in table 4.5.

69

Table 4.5: The cost of computing the companion update vector in PDLATRD (Line 5.1 inFigure 4.5)




Compute y = � y pdlatrd.f :385pdscal

nPi=1

13 �4

13 n �4

Compute � = �0:5� yT v pdlatrd.f :386pddot

nPi=1

(dlog2(pr)e�+ 13 �4) n dlog2(pr)e�+ 1

3n �4

Compute w = y � �vpdlatrd.f :390pdaxpy

nPi=1

(dlog2(pr)e�+ 13�4) n dlog2(pr)e�+ 1

3n �4

Total 2n dlog2(pr)e�+ n �4


2n dlog2(pp)e�+ n �4

Performing the rank 2k update

The rank 2-k update is performed once per block column (i.e. n=nb times):

A = A� vwT � wvT :

PDSYTRD broadcasts v and w along processor tows, transposes them and then

broadcasts them along processor columns. I ignore the � (latency) cost of the transpose

here, because it is less signi�cant (by a factor of nb) than the similar cost for the transpose in

the matrix-vector multiply and because it is only relevant whenlcm(pr;pc)

pcis very large. The

third � term in the transpose and broadcast operation should be multiplied bylcm(pr ;pc)

pc�1

lcm(pr;pc)pc

but the added complexity is not justi�ed for a small term.

The number of ops performed during the rank two update of A(j : n; j : n) is

modeled as:

2� 2� nb (1

2(n� j

pr+

nb

2)(n� j

pc+

nb

2) +

n nb pbf

2 pr) :

The number of ops performed per matrix element involved in the rank-2 update is 2�2�nb.

The number of elements in the lower triangular matrix is given by the sum of the terms

within the parentheses.

The total number of ops for all rank two updates is modeled as the sum of this

quantity as j ranges from nb to n by nb.

70

Table 4.6: The cost of performing the rank-2k update (PDSYR2K) (Lines 6.1 through 6.3 inFigure 4.5)




Broadcast V and W

within process rows(Line 6.1)

pdsytrd.f :354pdsyr2k .c

pdsyr2k.f :454,477dgebs2d

nPj=nb;nb

(2 dlog2(pc)e�+

2(n�jpr

+ nb2 )dlog2(pc)e�)

2 nnbdlog2(pc)e�

+n2

prdlog2(pc)e�

�n nbdlog2(pc)e�

Transpose andbroadcast V and W

within process columns(Line 6.2)


pdsyr2k.f :491,847pbdtran

nPj=nb;nb

(2dlog2(pr)e�+

2(n�jpc

+ nb2 )nbdlog2(pr)e�

+n�jpc

�)

2 nnbdlog2(pr)e�

+n2

pcdlog2(pr)e�

�n nbdlog2(pr)e�+n2

pc�

tril(A; 0) = tril(A; 0)+V �WT +W � V T

(Line 6.3)


pdsyr2k.f :655{60 ,

1052{57

pdgemm

nPj=nb;nb

(4 n�jnb pc pbf

�3+4nb

�( 12 (n�jpr+ nb

2 )(n�jpc

+ nb2 )

+n nb pbf

2 pr) 3)

2 n2

nb2 pc pbf�3+

23 3+

12n2 nbpr

3+

12

n2 nbpc

3+n2 nbpbf

pr 3

Total

2 nnbdlog2(pc)e�+2 n

nbdlog2(pr)e�+n2

prdlog2(pc)e�+n2

pcdlog2(pr)e�+n2

pc�

�n nbdlog2(pc)e��n nbdlog2(pr)e�+2 n2

nb2 pc pbf�3+

23 3+

12n2 nbpr

3

+ 12n2 nbpc

3+n2 nb pbf

pr 3


4 nnbdlog2(

pp)e�+2 n2p

pdlog2(

pp)e�+ n2p

p�

�2n nb dlog2(pp)e�+ n2

nb2pp�3+

23 3+3n

2 nbpp 3

The negative term (�2n2 nbp 3), which results from the fact that j starts at nb, is

ignored because it is O(n2 nbp ) and hence too small.

Details are given in table 4.6.

71

4.2.3 PDSYTRD execution time summary

Table 4.7 shows that the computation cost in PDSYTRD is:

2

3

n3

p 3 +

2

3

n3

p 2 +

n2 nb pbf

pr 2 +

7

2

n2 nb

pr 2 +

1

2

n2 nb

pc 2 +

n2 nb pbf

pr 2 +

n2 nb pbf

pr 3 +

1

2

n2 nb

pr 3 +

1

2

n2 nb

pc 3 +

2n2

nb2 pc pbf�3 + 6n �2 + 2

n2

pr pbf nb�2 +

n2

pc�1 + 9n �4 :

The most important terms in the computation cost are the O(n3

p ) ops. The

relative importance of the other (o(n3)) terms depends on the computer. On the PARAGON

none stand out above the rest. Indeed on the PARAGON none of the o(n3) terms accounts

for more than 3% of the total execution time of PDSYEVX when n = 3480 and p = 64.

However, all of o(n3) terms combined account for 21% of the total execution time on that

same problem.

Figure 4.8 shows that the computation cost in the tridiagonal eigendecomposition

in PDSYEVXis:

53ne

p � + 3

nm

p � + 112n � + 265

ne

p 1 + 45

nm

p 1 + 620n 1 + 6nc2 1 :

The execution time of tridiagonal eigendecomposition is dominated by the cost

of divides, and the size of the largest cluster, c. The load imbalance terms (112n � and

620n 1) are neglible.

Table 4.9 shows that the communication cost in PDSYTRD is:

4n dlog2(pc)e�+ 13n dlog2(pr)e�+ n lcm(pr; pc)=pr�+

n dlog2(lcm(pr; pc))e� +

3n2

prdlog2(pc)e� + 2

n2

pcdlog2(pr)e� +

1

2

n2

pr� + 2

n2

pc� :

Most of the messages are in broadcasts and reductions (i.e. the O(n log(p))

terms) and most of the broadcasts and reductions (13n) are within processor rows, ver-

sus only 4n broadcasts and reductions within processor columns. By contrast, the message

volume is fairly evenly split between broadcasts and reductions within processor rows (

72

3 n2

prdlog2(pc)e� ) and broadcasts and reductions within processor columns ( 2 n2

pcdlog2(pr)e�

).

The lcm terms are negligible unless p is very large, in which case it is important

to make sure that lcm(pr; pc) is reasonable (say < 10max(pr; pc)).

4.3 Eigendecomposition of the tridiagonal

The execution time of tridiagonal eigendecomposition is dominated by two factors:

the size of the largest cluster of eigenvalues and the speed of the divide.

4.3.1 Bisection

During bisection, in DSTEBZ, each Sturm count requires n divisions and 5n other

ops to produce one additional bit of accuracy. Hence, it takes roughly 53n divisions and

53 � 5n ops9 for each eigenvalue and 53n e total divisions for all eigenvalues in IEEE

double precision, where e is the number of eigenvalues to be computed. The exact number

of divisions and ops depends on the actual eigenvalues, the parallelization strategy and

other factors. However, this simple model su�ces for our purposes.

4.3.2 Inverse iteration

Inverse iteration typically requires 3n divides and 45n ops per eigenvalue plus

the cost of re-orthogonalization.

In PDSYEVX the number of ops performed by any particular processor, pi, during

re-orthogonalization is:P

C2fclusters assigned to pig 4Psize(C)

i=1 n iter(i)n (i� 1). Where: n iter(i)

is the number of inverse iterations performed for eigenvalue i (typically 3). If the size of

the largest cluster is greater than np , the processor which is responsible for this cluster will

not be responsible for any eigenvalues outside of this cluster.

Hence, if the size of the largest cluster is greater than np , the number of ops

performed by the processor to which this processor is assigned is (on average):

4 n itern1

2c2 = 6n c2

9Although these are not all BLAS Level 1 ops, they have the same ratio memory operations to ops thatare typical of BLAS Level 1 operations.

73

where: c = maxC2fclustersg size(C) i.e. the number of eigenvalues in the largest cluster, and

n iter = 3 is the average number of inverse iterations performed for each eigenvalue.

As the problem size and number of processors grows, the largest cluster that

PDSYEVX is able to reorthogonalize properly gets smaller (relative to n). As a consequence,

reorthogonalization will not require large execution time10 Speci�cally, if the largest cluster

has fewer than np eigenvalues, (i.e. �ts easily on one processor) the number of eigenvalues

that will be assigned to any one processor, and hence the total number of ops it must

perform, is limited. The worst case is where there are p + 1 clusters each of size np+1 .

In this case, one processor must be assigned 2 clusters of size np+1 , requiring (on average)

2� 6n ( np+1)

2 or roughly 12n3

p2. 11

Our model for the execution time of Gram Schmidt re-orthogonalization (Pc

i=1 4n i

= 2n c2 1, where c is the size of the largest cluster.) assumes that the processor to which

the largest cluster is assigned is not assigned any other clusters. This is true if the largest

cluster has more than n=p eigenvalues in it. If the largest cluster of eigenvalues contains

fewer than n=p eigenvalues, reorthogonalization is relatively unimportant.

Inderjit Dhillon, Beresford Parlett and Vince Fernando's recent work[139, 77] on

the tridiagonal eigenproblem substantially reduces the motivation to model the existing

ScaLAPACK tridiagonal eigensolution code in great detail, since we expect them to replace

the current code with something that costs O(n2

p ) ops, O(n2

p ) message volume and O(p)

messages, which is negligible compared to tridiagonal reduction.

4.3.3 Load imbalance in bisection and inverse iteration

Load imbalance during the tridiagonal eigendecomposition is caused in part by

the fact that not all processes will be assigned the same number of eigenvalues and eigen-

vectors and in part by the fact that di�erent eigenvalues and eigenvectors will require

slightly di�erent amounts of computation. Our experience indicates that the load imbal-

ance corresponds roughly to the cost of �nding two eigenvalues (2� (53n � + 53� 5n 1))

and two eigenvectors (2 � (3n � + 45n 1)) on one processor. Hence, our execution time

model for the load imbalance during tridiagonal eigedecomposition is: ((2� 53 + 2� 3) =

10This is not to suggest that reorthogonalization in PDSYEVX gets better as n and p increase. (indeedPDSYEVX may fail to reorthogonalize large clusters for large n and p) It just means that reorthogonalizationin PDSYEVX will not take a long time for large n and large p.

11The appearance of p2 in the denominator stems from the restriction c � np, meaning that as p increases

the largest cluster size that PDSYEVX can handle e�ciently decreases.

74

112n � + (2� 53� 5 + 2� 45) = 620n 1)

In evaluating the cost of load imbalance in tridiagonal eigendecomposition, one

must include load imbalance in Gram Schmidt reorthogonalization. Indeed if the input

matrix has one cluster of eigenvalues that is substantially larger than all others (yet small

enough to �t on one processor so that PDSYEVX can reorthogonalize it) Gram Schmidt

reorthogonalization is very poorly load balanced and could be treated almost enttirely as a

load imbalance cost.

We do not separate the load imalance cost of Gram Schmidt from what the exe-

cution time for Gram Schmidt would be if the load were balanced, because doing so would

complicate the model without making it match actual execution time any better.

4.3.4 Execution timemodel for tridiagonal eigendecomposition in PDSYEVX

The cost of tridiagonal eigedecomposition in PDSYEVX is the sum of the cost of

bisection, inverse iteration and reorthogonalization. Hence:

53n e

p � + 3

nm

p � + 112n � + 265

n e

p 1 + 45

nm

p 1 + 620n 1+ 2n c2 1

The load imbalance terms 112n � and 530n 1 stem partly from the fact that

some processors will typically be assigned at least one more eigenvalue and/or eigenvector

than other processors and from the fact that both bisection and inverse iteration are iterative

procedures requiring more time on some eigenvalues than on others.

4.3.5 Redistribution

Inverse iteration typically leaves the data distributed in a manner in which it would

be awkward and ine�cient to perform back transformation. If each eigenvector is computed

entirely within one processor, as PDSTEIN does, inverse iteration requires no communication,

provided that all processors have a copy of the tridiagonal matrix and the eigenvalues. This,

however, leaves the eigenvector matrix distributed in a one-dimensional manner in which

back transformation would be ine�cient. Furthermore, since di�erent processors may have

been assigned to compute a di�erent number of eigenvectors (to improve orthogonality

among the eigenvectors) the eigenvector matrix will typically not be distributed in a block

cylic manner. Since PDORMTR (and all ScaLAPACK matrix transformations) requires that the

data be in a 2D block cyclic distribution, the eigenvectors must,at least, be redistributed to

75

a block cyclic distribution. For convenience and potential e�ciency12, PDSTEIN redistributes

the eigenvector matrix.

The simplest method of data redistribution is to have each processor send one

message to each of the other processors. That message contains the data owned by the

sender and needed by the receiver. Redistributing the data in this manner requires that

each processor send every element that it owns to other processors13 and receive what

it needs from other processors. Since each processor owns14 roughly (nm)=p elements

and needs roughly (nm)=p elements, the total data sent and received by each processor

is roughly 2(nm)=p. In our experience, data redistribution is slightly less e�cient than

other broadcasts and reductions and hence we use 4(nm)=p� as our model for the data

redistribution cost.

4.4 Back Transformation

Transforming the eigenvectors of the tridiagonal matrix back to the eigenvectors of

the original matrix requires multiplying a series of Householder vectors. The Householder

updates can be applied in a blocked manner with each update taking the form: (I+V TV T ),

where V 2 Rn0;nb is the matrix of Householder vectors, and T is an (nb � nb a triangular

matrix[27].

The following steps compute Z0 = (I + V TV T )Z. These are performed for each

block Householder update. The major contributors to the cost are noted below.

Compute T

Computing the nb by nb triangular matrix T requires nb calls to DGEMV, a summation

of nb2=2 elements within the current processor column and nb calls to DTRMV. The

computation of T need not be in the critical path. There are n=nb di�erent matrices

T that need to be computed, and they could be computed in advance in parallel.

Compute W = V TZ.

Spread V across. Compute V TZ locally. Sum W within each processor column.

12The actual e�ciency depends upon the data distribution chosen by the user for the input and outputmatrices

13Although some data will not have to be sent because it is owned and needed by the same processor, thiswill typically be a minor savings.

14In the absence of large clusters of eigenvalues assigned to a single processor.

76

The spread across of V is performed on a ring topology because the processor columns

need not be synchronized. Each processor column must receive V and send V , hence

the cost for each processor column is: (2n0 nb)=pr

The local computation of V TZ is a call to DGEMM involving 2(m=pc + vnb)(n0=pr +

nb=2)nb ops. Ignoring the lower order vnb nb2 term, this is:

2 (n0m nb)=p+ 2 (n0 vnb)=pr + 2 (m nb)=pc :

Compute W = TW

Local.

Compute Z = Z � VW

Spread W down. Local computation. (Note: V has already been spread across.)

The local computation of Z � VW , like the computation of V TZ involves a call to

DGEMM involving 2(m=pc + vnb)(n0=pr + nb=2)nb ops.

Back transformation di�ers from reduction to tridiagonal form in many ways. It

requires many fewer messages: O(n=nb) versus O(n). Because the back transformation of

each eigenvector is independent, the Householder updates can be applied in a pipelined

manner, allowing V to be broadcast in a ring instead of a tree topology. PDLARFB does

not use the PBLAS, allowing V to be broadcast once but used twice. Since the number of

eigenvectors does not change during the update, half of the load imbalance depends on

mod (n; nb pc) and can be reduced signi�cantly if mod (n; nb pc) = 0. In the following

table vnb is the imbalance in the 2D block-cyclic distribution of eigenvectors15

The cost of back transformation, shown in table 4.10, is asymmetric, the (O(n2=pr))

cost is smaller than the (O(n2=pc)) cost. Furthermore, the (O(n2=pr)) cost can be reduced

further by computing T in parallel, and choosing a data layout which will minimze vnb.

Reducing the O(n2=pr) cost would allow pr < pc, reducing the O(n2=pc) costs. This is

discussed further in Chapter 8.

15vnb is computed as follows: extravecsonproc1� extravecs=pr . Where: extravecs = mod (n; nb pc) andextravecsonproc1 = min(nb; extravecs).

77

Table 4.7: Computation cost in PDSYEVX

scale factor updatecurrentcolumn

(Table4.1)

computere ector

(Table4.2)

matrixvectorproduct

(Table4.4)

updatevectorproduct

(Table4.5)

computeupdatevector

(Table4.6)

performrank2kupdate

(Table4.10)

tridiagonaleigendecomposition

(Section4.3)

backtransformation

(Table4.10)

totaln3

p 3

23

23

n2mp

3 2 2

n3

p 2

23

23

n2 nb pbf

pr 2 1 1

n2nb

pr 2 1 1

2 2 12 4

n2 nbpc

212

12

n2 nb pbf

pr 3 1 1

n2 nbpr

312

12

n2 vnbpr

3 2 2

n2 nbpc

312

12

nm nbpc

3 3 3

nnb

�3 3 3

n2

nb2 pc pbf�3 2 2

n �2 2 4 2 8

n2

pr pbf nb�2 2 2

n2

pc�1 1 1

n�4 2 1 1 4 1 9

78

Table 4.8: Computation cost (tridiagonal eigendecomposition) in PDSYEVX


(Table4.1)

computere ector

(Table4.2)

matrixvectorproduct

(Table4.4)

updatevectorproduct

(Table4.5)

computeupdatevector

(Table4.6)

performrank2kupdate

(Table4.10)


(Section4.3)

backtransformation

(Table4.10)

total

nep � 53 53

nmp � 3 3

n � 112 112

nep 1 265 265

nmp 1 45 45

n 1 620 620

nc2 1 6 6

79

Table 4.9: Communication cost in PDSYEVX


(Table4.1)

computere ector

(Table4.2)

matrixvectorproduct

(Table4.4)

updatevectorproduct

(Table4.5)

computeupdatevector

(Table4.6)

performrank2kupdate

(Table4.10)


(Section4.3)

backtransformation

(Table4.10)

total

n dlog2(pc)e� 2 2 4

n dlog2(pr)e� 2 3 2 4 2 13

n lcm(pr; pc)=pr� 1 1

n lcm(pr ; pc)=pc� 1 1

n dlog2(lcm(pr;pc))e� 1 1

n2

prdlog2(pc)e� 1 1 1 3

n2

pcdlog2(pr)e� 1 1 2

n2

pr� 1

212

n2

pc� 1 1 2

n nb dlog2(pc)e� 1 1 -1 1

n nb dlog2(pr)e� 1 1 2 -1 3

80

Table 4.10: The cost of back transformation (PDORMTR)




Compute T

pdsyevx.f :855pdormtr.f :408pdormqr.f :394pdlarft.f :

nPn0=1;nb

(dlog2(pr)e nb2

2 �+

2nb �2+2n0 nb22pr

2+)

2n �2+0:5n2 nbpr

2

Compute W = V TZ

pdsyevx.f :855pdormtr.f :408pdormqr.f :412pdlarfb.f :322,398,405

nPn0=1;nb

(2n0 nbpr

�

+dlog2(pr)em nbpc

�+�3

+2m nb2

2 pc 3+2 vnbn0 nb

pr 3

+2nn0 nbp

3)

n2

pr�+

nmpc

dlog2(pr)e�+nnb�3+

nm nbpc

3+

n2 vnbpr

3+n2 mp

3

Compute W = TW

pdsyevx.f :855pdormtr.f :408pdormqr.f :412pdlarfb.f :412

nPn0=1;nb

(�3+2m nb2

2 pc 3) n

nb�3+

nm nbpc

3

Compute Z = Z � VW

pdsyevx.f :855pdormtr.f :408pdormqr.f :412pdlarfb.f :415,425

nPn0=1;nb

(dlog2(pr)em nbpc

�

+�3+2m nb2

2 pc 3+

2 vnbn0 nbpr

3+2mn0 nbp

3)

nmpc

dlog2(pr)e�+nnb�3+

nm nbpc

3+

n2 vnbpr

3+n2 mp

3

Totaln2

pr�+2nm

pcdlog2(pr)e�+2n �2+0:5n

2 nbpr

2+3 nnb�3+3nm nb

pc 3

+2n2 vnbpr

3+2n2 mp

3

Standard data layoutn2pp�+2nmp

pdlog2(

pp)e�+2n �2+0:5n

2 nbpp 2+3 n

nb�3+3nm nbp

p 3

+2n2 vnbpp

3+2n2 mp

3

81

Chapter 5

Execution time of the ScaLAPACK

symmetric eigensolver, PDSYEVX on

e�cient data layouts on the

Paragon

The detailed execution time model gives us con�dence that we understand the

execution time of PDSYEVX_It explains performance on a wide range of problem sizes, data

layouts, input matrices, computers and user requests. However, the same complexity that

allows the detailed model to explain performance over such a large domain makes it di�cult

to grasp, understand and interpret. The simple six term model shown in this chapter is

designed to explain the performance of the common, e�cient case on a well known computer.

PDSYEVX takes 205 seconds to compute the eigendecomposition of a 3840 by 3840

symmetric random matrix on a 64 node Paragon in double precision. Counting only the

103 n3 ops, PDSYEVX achieves 920 Giga ops per second which equals 14 Mega ops per second

per node.

For large, well behaved1, matrices, PDSYEVX is e�cient, as detailed in Table 5.1.

For well behaved 3840 � 3840 matrices, PDSYEVX spends 63% = (28+35)% of its time on

necessary computation and only 35% of its time on communication, load imbalance and

1For PDSYEVX's purpose, a well behaved matrix is one which does not have any large clusters of eigenvalueswhose associated eigeventers must be computed orthogonally.

82

Table 5.1: Six term model for PDSYEVX on the Paragon

Component Modeln = 3840, p = 64

% timematrix transformation computation(See section 5.3)

103n3

p ( = :0215) 35

tridiagonal eigendecompositioncomputation (See section 5.4)

239 n2

p 28

message initiation(See section 5.5)

17n log2(pp) (� = 65:9) 10

message transmission(See section 5.6)

7 n2pp log2(

pp) (� = :146) 4

order n overhead & imbalance(See section 5.7)

2780n 7

order n2 overhead & imbalance(See section 5.8)

14:0 n2pp 14

overhead required for execution in parallel.

n Matrix size

p Number of processors

Matrix-matrix multiply time (= .0215 microseconds/ op)

� Message latency time (= 65.9 microseconds/message)

� Message throughput time (= .146 microseconds/word)

Although PDSYEVX is e�cient on the PARAGON2, Table 5.1 shows us that there is

room for improvement. Ignoring the execution time required for solution of the tridiagonal

eigenproblem for the moment, we note that the matrix transformations reach only about

50% of peak performance (35% vs. 35+10+4+7+14=72%) for this problem size (roughly

the largest that will �t on this PARAGON). Furthermore, e�ciency will be lower for smaller

problem sizes.

Unfortunately, there is no single culprit that accounts for the ine�ciency. Com-

munication accounts for a bit less than half of the ine�ciency, while software overhead

accounts for a bit more than half of the ine�ciency.

2Details about the hardware and software used for this timing run are given in table 6.3

83

One could argue that while n=3840 on 64 nodes is the largest problem that

PDSYEVX can run on this particular computer, it is still a relatively small problem. How-

ever, there are several reasons not to ignore this result. First, while it is true that newer

machines have more memory, they also have much faster oating point units, steeper mem-

ory hierarchies and few o�er communication to computation ratios as high as the PARAGON.

Furthermore, we should strive to achieve high e�ciency across a range of problem sizes,

not just for the largest problems that can �t on the computer. Achieving high e�ciency on

small problem sizes means that users can e�ciently use more processors and hence reduce

execution time.

In summary, PDSYEVX is a good starting point, but leaves room for improvement.

However, signi�cantly improving performance will require attacking more than one source

of ine�ciency.

The fact that PDSYEVX spends 28% of its total time in solving the tridiagonal eigen-

problem is a result of the slow divide on the PARAGON. The PARAGON o�ers two divides: a fast

divide and a slow divide that meets the IEEE 754 spec[7]. Although the ScaLAPACK's bisec-

tion and inverse iteration codes are designed to work with an inaccurate divide, ScaLAPACK

uses the slow correct divide by default.

5.1 Deriving the PDSYEVX execution time on the Intel Paragon

(common case)

This six term model is based on the detailed model described in section 4 which

has been validated on a number of distributed memory computers and a wide range of data

layouts and problem sizes.

5.2 Simplifying assumptions allow the full model to be ex-

pressed as a six term model

I assume that a reasonably e�cient data layout is chosen. I set the data layout

parameters as follows:

nb = 32. The optimal block size on the Paragon is about 10, however the reduction in

execution time obtained by using nb = 10 rather than nb = 32 is less than 10%, so

84

we stick to our standard suggested value of nb.

pr = pc =pp. PDSYEVX achieves the best performance3 when pc � pr � pc

4 . Assuming that

pr = pc =pp allows the pr and pc terms to be coalesced into a single

pp term.

pbf = 2. The panel blocking factor4 , pbf = max(2; lcm(pr; pc)=pc) in ScaLAPACK version

1.5.

vnb = 0. vnb is the imbalance in the number of rows in the original matrix as distributed

amongst the processors. I assume that the matrix is initially balanced perfectly

amongst all processors, i.e. n is a multiple of pr nb.

2 = 3 We assume for the simpli�ed model that all ops are performed at the peak op

rate. This introduces an error equal to 2=3n3=p( 2 � 3) which is typically no more

than 2-5% of the total time on the PARAGON.

m = e = n Assume that a full eigendecomposition is required. i.e. all eigenvalues are

required e = n and all eigevectors are required m = n.

c = 1 Assume that the input matrix has no clusters of eigenvalues.

In addition, we set all of the machine parameters to constants measured or esti-

mated on the Intel Paragon as shown in table 6.3 in order to coalesce the overhead, load

imbalance, and tridiagonal eigen decomposition terms into just three terms.

5.3 Deriving the computation time during matrix transfor-

mations in PDSYEVX on the Intel Paragon

Table 5.2 shows that PDSYTRD performs 43n3

p + O(n2) ops per process. Of these,

23n3

p +O(n2) are matrix vector multiply ops and 23n3

p +O(n2) are matrix matrix multiply

ops. PDSYTRD performs the same oating point operations that the LAPACK routine, DSYTRD,

does. And 43n

3 is the textbook[84] number of ops for reduction to tridiagonal form.

3Performance of PDSYEVX is not overly sensitive to the data layout, provided that nb is su�ciently largeto allow good DGEMM performance, that the processor grid is reasonably close to square and that lcm(pr ; pc)is not outrageous compared to pc and pr . (The latter factor is only relevant when one is dealing withthousands of processors.) I have not performed a detailed study of when using fewer processors results inlower execution time. However, if you drop processors only when necessary to make pc � pr �

pc16 and

lcm pr ; pc � 10pc the processor grid chosen will allow performance within 10% of the optimal processor grid.4The matrix vector multiplies are each performed in panels of size pbfnb. See Section 4.2.2.

85

Table 5.2: Computation time in PDSYEVX

Task Full model Six term model

computation time duringreduction to tridiagonal form(See section 4.2)

23n3

p 2+

23n3

p 3

43n3

p 3

computation time duringback transformation(See table 4.10)

2n2 mp

3 2n3

p 3

Total 103

n3

p 3

Table 5.3: Execution time during tridiagonal eigendecomposition

Task Full model Paragon model Paragon time

computation time duringtridiagonal eigendecomposition(See section 4.3)

265 n ep

1+45nmp

1+

53nep

�+3nmp

�+

2n c2 1

(310�:074+56�3:85+0)n

2

p

239: n2

p

Total 239: n2

p

PDORMTR performs 2n3

p + O(n2) ops per process. Again this is the same as the

LAPACK routine.

5.4 Deriving the computation time during eigendecomposi-

tion of the tridiagonal matrix in PDSYEVX on the Intel

Paragon

The computation time during tridiagonal eigendecomposition, in the absence of

clusters of eigenvalues is O(n2) and hence for large n becomes less important.

The simpli�ed model for the execution time of the tridiagonal eigensolution on the

PARAGON in table 5.3 is obtained from the detailed model by replacing 1 and � with their

values on the PARAGON and by assuming that all clusters of eigenvalues are of modest size.

Load imbalance during the tridiagonal eigendecomposition is caused in part by the

fact that not all processes will be assigned the same number of eigenvalues and eigenvectors

and in part by the fact that di�erent eigenvalues and eigenvectors will require slightly di�er-

ent amounts of computation. Our experience indicates that the load imbalance corresponds

roughly to the cost of �nding two eigenvalues and two eigenvectors.

86

Table 5.4: Message initiations in PDSYEVX

Task Full model Six term model

message initiation duringreduction to tridiagonal form(See table 4.9)

(13dlog2(pr)e+4dlog2(pc)e)n� 17n log2(pp)�

Total 17n log2(pp)�

Table 5.5: Message transmission in PDSYEVX

Task Full model Six term modelmessage transmission timeduring reduction totridiagonal form (Seetable 4.9)

(3dlog2(pc)en2

pr+2dlog2(pr)en

2

pc)� 5 n2p

plog2(

pp)�

message transmission timeduring back transformation(See table 4.10)

2dlog2(pr)enmpc

� 2 n2pplog2(

pp)�

Total 7 n2pplog2(

pp)�

5.5 Deriving the message initiation time in PDSYEVX on the

Intel Paragon

Table 5.4 shows that PDSYEVX requires 17n log(pp) message initiations.

5.6 Deriving the inverse bandwidth time in PDSYEVX on the

Intel Paragon

Table 5.5 shows that PDSYEVX transmits 7n2=pp log(

pp) words per node.

5.7 Deriving the PDSYEVX order n imbalance and overhead

term on the Intel Paragon

Table 5.6 shows the origin of the �(n) load imbalance cost on the Intel Paragon.

87

Table 5.6: �(n) load imbalance cost on the PARAGON


load imbalance duringeigendecomposition(See section 4.3)

620 1 + 112 �620�0:0740+112�3:85

477n

order n overhead termin reduction to tridiagonal form(See table 4.7)

9�4 + 6�2 9�239+6�23:5 2256n

order n overhead termin back transformation(See table 4.10)

2 �2 2� 23:5 47n

Total 2780n

Table 5.7: Order n2pp load imbalance and overhead term on the PARAGON


Order n2=pp overhead term

in reduction to tridiagonal form(See table 4.7)

2 n2

nb pbf pc�2

+2 n2

nb2 pbf pc�3+

n2

pc�1

�2�23:532�2 + 2�103

32�32�2

+3:97�n2pp

4:70 n2pp

Order n2=pp load imbalance

term in reduction to tridiagonalform (See table 4.7)

72n2 nbpr

2+12n2 nbpc

2

+n2 nb pbf

pr 2+

n2 nbpr

3+

12n2 nbpc

3+12n2 nb pbf

pr 3

( 72�32+ 1

2�32

+32�2)�0:0247+( 12�32+ 1

2�32+2�32)�0:0215

6:81 n2pp

Order n2=pp load imbalance

term in back transformation(See table 4.10)

0:5 n2 nbpr

2+3nm nbpc

3

+2n2 vnbpc

3

(0:5�32�0:0247+3�32�0:0215+2�0:0215�0)n2p

p

2:46 n2pp

Total 14:0 n2pp

5.8 Deriving the PDSYEVX order n2ppimbalance and overhead

term on the Intel Paragon

The order n2pp load imbalance and overhead term on the Intel Paragon, ?/.0 n2p

p is

shown in table 5.7

See section 5.2 for details on the assumptions made to simplify the full model to

the six term model. Note that vnb is assumed to be zero and that pbf is assumed to be 2.

88

Chapter 6

Perfomance on distributed memory

computers

6.1 Performance requirements of distributed memory com-

puters for running PDSYEVX e�ciently

The most important feature of a parallel computer is its peak op rate. Indeed,

everything else is measured against the peak op rate. The second most important feature

is main memory, but which feature of main memory is most important depends on whether

you want peak e�ciency (i.e. using as few processors as possible) or minimum execution

time (i.e. using more processors). If you plan to use only as many processors as necessary,

�lling each processor's memory completely, then main memory size is the most important

factor controlling e�ciency. If you plan to use more processors, main memory random

access time becomes the most important factor.

Network performance of today's distributed memory computers is good enough to

keep communication cost from being the limiting factor on performance. Furthermore, if

the network performance (either latency or bandwidth) were the limiting factor, there are

ways that we could reduce the communication cost by as much as log(pp)[107]. Still, if one

has a network of workstations connected by a single ethernet or FDDI ring, the very low

bisection bandwidth will always keep e�ciency low See section 8.4.2 for details.

89

6.1.1 Bandwidth rule of thumb

Bandwidth rule of thumb: Bisection bandwidth per processor1 times the square root

of memory size per processor should exceed oating point performance per processor.

Megabytes/sec

processor�pMegabytes

processor>

Mega ops/sec

processor

assures that bandwidth will not limit performance.

The bandwidth rule of thumb shows that if memory size grows as fast as peak

oating point execution rate, the network bisection bandwidth need only grow as the square

root of the peak oating point execution rate. This is very encouraging for the future of

parallel computing. This rule also shows that the bandwidth requirement grows as the

problem sizes decreases. This rule does not make as wide a claim as the memory rule of

thumb, it does not promise that PDSYEVX will be e�cient, only that bandwidth will not be

the limiting factor.

Provided the bandwidth rule of thumb holds, execution time attributable to mes-

sage volume will not exceed 40% of the time devoted to oating point execution in PDSYEVX

on problems that nearly �ll memory.

6.1.2 Memory size rule of thumb


Megabytes

processor>

Mega ops/sec

processor

assures that PDSYEVX will be e�cient on large problems.

This rule is su�cient because it holds even if message latency and software over-

head hold constant as peak performance increases and network bisection bandwidth and

BLAS2 performance increase as slowly as the square root of the increase in the peak op rate.

1Bisection bandwidth per processor is the total bisection bandwidth of the network divided by the numberof processors.

90

message transmission time

oating point execution time=

7:5n2=pp dlog2(

pp)e �

10=3 n3=p 3Table 5.1

=7:5 dlog2(pp)e �10=3 n=

pp 3

Cancel n2=pp

=7:5 dlog2(pp)e �

10=3pM 106=(6� 8) 3

n=pp =

pM 106=(6� 8)PDSYEVX uses 6 n2=pDP words

=7:5� 3� 8 � 10�6=mbs

10=3pM 106=(6� 8) 10�6=mfs

� = 8 � 10�6=mbsdlog2(pp)e = 3 3 = 10�6=mfs

=7:5� 8

p6� 8mfs

10=3 103mbsSimplify

= :374mfspM mbs

:374 =7:5� 3� 8

p6� 8

10=3 103

Figure 6.1: Relative cost of message volume as a function of the ratio between peak oatingpoint execution rate in Mega ops,mfs, and the product of main memory size in Megabytes,M and network bisection bandwidth in Megabytes/sec, mbs.

Message latency and software overhead are limited by main memory access time, which de-

creases slowly, but bisection bandwidth and BLAS2 performance (which is limited by main

memory bandwidth) continue to improve though not as rapidly as peak performance.

When the number of megabytes of main memory equals the peak oating point

rate (in mega ops/sec), message latency will typically account for ten times less execution

time than the time devoted to oating point execution in PDSYEVX on problems that nearly

�ll memory. The arithmetic in �gure 6.2 justi�es this statement provided that message

latency does not exceed 100 microseconds.

The memory rule of thumb is too simple to capture all aspects of any computer,

nonetheless we have found it to be useful. The derivation in �gure 6.2 makes two main

assumptions: latency is around 100 microseconds and dlog2(pp)e = 3. Selcom will either

be exactly correct, but in our experience neither will tend to be small by more than a factor

of 2 (i.e. p leq4096). The memory rule of thumb also depends on su�cient bandwidth and

on reasonable BLAS2 and software overhead costs. As we will show next, network bandwidth

capacity and BLAS2 performance need not grow rapidly to support this rule and software

overhead costs need only remain constant.

The memory rule of thumb holds for all computers marketed as distributed memory

91

message latency time

oating point execution time=

17n dlog2(pp)e�

10=3n3=p 3Table 5.1

=17 dlog2(pp)e�10=3n2=p 3

Cancel n

=17 dlog2(pp)e�

10=3� (M 106=48)� 3n=pp =

pM 106=(6� 8)PDSYEVX uses 6 n2=p DP words

=17� 3� 100 10�6

10=3� (M 106=48)� (10�6=mfs)� = 100 10�6dlog2(pp)e = 3 3 = 10�6=mfs

= 0:073mfs

M17�3�100�10�6

10=3=48= 0:073

Figure 6.2: Relative cost of message latency as a function of the ratio between peak oatingpoint execution rate in Mega ops, mfs, and main memory size in Megabytes, M .

computers, but does not hold for non-scalable or extremely low bandwidth networks. One

could design a distributed memory computer for which this rule does not hold, but the

features that are necessary for this rule to hold are also important for a range of other

applications and hence we expect this rule to hold for essentially all distributed memory

computers.

The memory rule of thumb while su�cient is not necessary. It is possible to achieve

e�ciency on PDSYEVX on computers whose memory is smaller than that suggested by this

rule2. In section 6.1.3 I discuss what properties a computer must have to allow e�cient

execution on smaller problem sizes.

Though meeting the memory rule of thumb is not necessary to achieve high perfor-

mance, there are reasons to believe that it will be useful for several years. Software latencies

are not decreasing rapidly. Software overhead, since it is tied to main memory latency, is

not decreasing rapidly either. Bisection bandwidth and BLAS2 performance is increasing,

but not as fast as peak oating point e�ciency.

On the other hand, improvements to PDSYEVX will make it possible to achieve high

performance with less memory and may someday obsolete the memory rule of thumb.

2The PARAGON is an example.

92

6.1.3 Performance requirements for minimum execution time

If you intend to use as many processors as possible to minimize execution time,

the second most important machine characteristic (after peak oating point rate) is main

memory speed. Main memory speed a�ects three of the four sources of ine�ciencies in

PDSYEVX: message initiation, load imbalance and software overhead. Message initiation and

software overhead costs are controlled by how long it takes to execute a stream of code

with little data or code locality. Since the communication software initiation code o�ers

little code or data locality, its execution time is largely dependent on main memory latency.

Load imbalance consists mainly of BLAS2 row and column operations. The BLAS2 op rate

is controlled by main memory bandwidth. Smaller main memory bandwidth also requires a

larger blocking factor in order to achieve peak oating point performance in matrix matrix

multiply. Larger blocking factors mean more BLAS2 row and column operations. Hence

reduced main memory speed has a double e�ect on the cost of row and column operations:

increasing the number of them while increasing the cost per operation.

Caches can be used to improve memory performance, however the value of caches

is reduced by several factors: The inner loop in reduction to tridiagonal form, the source

of most of the ine�ciency in PDSYEVX, is substantial and includes many subroutine calls.

ScaLAPACK is a layered library which includes the PBLAS, BLAS, BLACS and the underlying

communication software. The inner loop in reduction to tridiagonal form touches every

element in the unreduced (trailing) part of the matrix. The second level cache is typically

shared between code and data. Even the way that BLAS routines are typically coded impacts

the value of caches in PDSYEVX. The fact that the inner loop in reduction to tridiagonal form

includes many subroutine calls combined with ScaLAPACK's layered approach means that

this inner loop typically involves many code cache misses. Indeed even the much simpler

inner loop in LU involves many code cache misses in ScaLAPACK[160]. Since this same inner

loop touches every element in the matrix, the secondary cache, typically shared by both

code and data, will be completely ushed each time through the loop meaning that code

cache misses will have to be satis�ed by main memory.

The way that BLAS routines are typically optimized leads to a high code cache miss

rate. BLAS routines are typically coded and optimized by timing them on a representative

set of requests[92]. Each request however is typically run many times and the times are

averaged. Each run may involved di�erent data to ensure that the times represent the cost

93

of moving the data from main memory. However, no e�ort is made3 to account for the

cost of moving the code from main memory. Hence, the code cache is a resource to which

no cost is assigned during optimization. Loop unrolling can vastly expand the code cache

requirements but it can also improve performance, at least if the code is in cache. Hence

it is likely that in optimizing BLAS codes, some loops get unrolled to the point where they

use half or more of the code cache. If two such codes are called in the same loop, code

cache misses are inevitable. The unfortunate aspect of this is that the hardware designer is

powerless to prevent it. Increasing the size of the code cache might lead to even more loop

unrolling and even worse performance.

There are two ways that hardware manufacturers could make caches more useful.

One would be to improve the way that BLAS codes are optimized to ensure that the code

cache is a recognized resource (either by measuring code cache use in each call or by having

the codes optimized on a system with smaller cache sizes than those o�ered to the public).

The second would be to allow a path from main memory to the register �le that bypasses

the cache. In the inner loop of reduction to tridiagonal form, every element of the matrix

is touched, but there is no temporal locality and no point in moving these elements up the

cache hierarchy. If these calls to the BLAS matrix-vector multiply routine, DGEMV, could be

made to bypass the caches, these caches would remain useful in the other portions of the

code: i.e. software overhead and communication latency. Even row and column operations

would bene�t because these operations involve data locality across loop iterations, this data

locality is made worthless by the fact that the loop touches every element in the matrix

each time through but could be useful if certain DGEMV calls could be made to bypass the

caches. This would require a coordinated software and hardware e�ort.

Secondary caches are of little importance in determining PDSYEVX execution time

because the inner loop traverses the entire matrix without any data temporal locality within

the loop. Secondary caches are important to achieving peak matrix-matrix multiply perfor-

mance, but that is their only use in PDSYEVX. This is because in principle if the secondary

cache were large enough and the problem small enough, secondary cache could hold the

entire matrix and hence act as fast main memory. Unfortunately, secondary caches are

never large enough to support an e�cient problem size.

I would hope that, if there are other applications like PDSYEVX that could make

3It is di�cult to account for the cost of moving the code from main memory.

94

e�cient use of smaller faster memories, some vendor or vendors will build some machines

with smaller faster main memory. I suspect that more applications need large slow mem-

ory, than small fast memory. Indeed, PDSYEVX, can work well either way. But, especially

with improvements to PDSYEVX that will allow it to achieve high performance on smaller

problem sizes, PDSYEVX could achieve impressive results on a distributed memory machine

with half the main memory now typical of distributed memory parallel computers if that

smaller main memory could be made modestly, say 20%, faster. With the out-of-core sym-

metric eigensolver being developed by Ed D'Azevedo (based on my suggestion to reduce

main memory requirements from 4n2 to 12n

2 by using symmetric packed storage during

the reduction to tridiagonal form and two passes trhough back transformation), the main

memory requirements of PDSYEVX will drop by a factor of 6 to 12, furthering the argument

for smaller, faster main memory.

As ScaLAPACK improves, it will be able to achieve high e�ciency on smaller problem

sizes. This will mean that the best machines for ScaLAPACK will have less memory than

that suggested by the memory rule of thumb at the top of this chapter.

6.1.4 Gang scheduling

6.2 sec:gang

A code which involves frequent synchronizations, such as reduction to tridiagonal

form, requires either dedicated use of the the nodes upon which it runs or gang scheduling.

If even one node is not participating in the computation, the computation will stall at the

next synchronization point.

6.2.1 Consistent performance on all nodes

A statically load balanced code, such as PDSYEVX, will executed only as fast as

the slowest node on which it is run. This, like the need for gang scheduling, is obvious.

Yet, occasionally nodes which have identical speci�cations perform di�erently. Kathy Yelick

noticed that some nodes CM5 at Berkeley were slower than others.. And, I have reason to

believe that at least two of the nodes on the PARAGON at Univeristy of Tennessee at

Knoxville are slower than the others (See Table 6.3).

The people who design and maintain distributed memory parallel computers should

95

Table 6.1: Performance

messagelatency�

transmissioncostperword�

BLAS1 oprate 1

matrix-vectormutiply

softwareoverhead� 2

matrix-vectormutiply

oprate 2

matrix-matrixmutiply

oprate 3

divide �

IBM SP2 54 0.12(67)

.0037270

.25(4)

5 ?? P

PARAGON 66 0.14(57)

0.0235(42)

3.8(.26)

80 P ??

make sure that slow nodes are identi�ed and marked as such or taken o�-line.

6.3 Performance characteristics of distributed memory com-

puters

6.3.1 PDSYEVX execution time (predicted and actual)

Table 6.3 compares predicted and actual performance on the Intel PARAGON. Actual

PDSYEVX performance never exceeds the performance predicted by our model and usually

is within 15% of the predicted performance. Every run which shows actual execution time

which is more than 15% greater than expected execution time is marked with an asterisk.

I would be satis�ed with a performance model that is within 20% to 25%, and would not

expect this performance model to match to within 15% on other machines. I have checked

several of these and have noticed that in these runs one or two processors have noticeably

slower performance on DGEMV than the other processors. I have also rerun many of these

aberrant timings and for each that I have rerun, at least one of the runs completed within

15% of predicted performance. Nonetheless, this aberrant behavior deserves further study.

96

PARAGON MP IBM SP24

Processor 50 Mhz i860 XP 120 Mhz POWER2 SC

Location xps5.ccs.ornl.gov chowder.ccs.utk.edu

Data cache

16K bytes4way set-associatedwrite-back32-byte lines5

128K bytes

Code cache

16 Kbytes4way set-associated32-byte blocks

32K bytes

Second level cache None NoneProcessors per node 1 1Memory per node 32 Mbytes 256 Mbytes

Operating system Paragon OSF/1 xps51.0.4 R1 4 5

AIX

ScaLAPACK 1.5 1.5

BLAS -lkmath -lesslp2

BLACS NX BLACS MPL BLACS

Communication software NX MPI

PrecisionDouble64 bits

Double64 bits

Table 6.2: Hardware and software characteristics of the PARAGON and the IBM SP2.

97

Table 6.3: Predicted and actual execution times of PDSYEVX on xps5, an Intel PARAGON.Problem sizes which resulted in execution time of greater than 15% greater than predictedare marked with an asterix. Many of these problem sizes which result in more than 15%greater execution time than expected were repeated to show that the unusually large exe-cution times are aberrant.

n nprow npcol nb

Actualtime(seconds)

Estimatedtime(seconds)

ActualEstimated

375 2 4 32 8.51 8.24 0.97

375 4 8 32 6.34 4.65 0.73*

750 2 4 32 31.2 30.1 0.96

750 2 4 32 31.3 30.1 0.96

750 2 4 32 31.5 30.1 0.96

750 2 4 32 41.2 30.1 0.73*

750 2 4 32 43.3 30.1 0.7*

750 4 4 32 20.3 18.9 0.93

750 4 6 32 16.5 15.3 0.93

750 4 6 32 22.3 15.3 0.69*

750 4 6 32 23.1 15.3 0.66*

750 4 8 32 14.1 13.2 0.93

1000 2 4 32 55.8 53.8 0.96

1000 2 4 8 52.9 54.4 1

1000 4 2 32 56.5 54.9 0.97

1000 4 2 8 56.2 59.3 1.1

1125 2 4 32 72.2 68.8 0.95

1125 4 8 32 38.2 26.7 0.7*

1500 2 4 32 133 127 0.95

1500 2 4 32 134 127 0.95

1500 2 4 32 134 127 0.95

1500 2 4 32 176 127 0.73*

1500 2 4 32 183 127 0.7*

1500 4 4 32 77.2 72.9 0.94

1500 4 6 32 77 55 0.71*

1500 4 6 32 59.3 55 0.93

1500 4 6 32 80.9 55 0.68*

1500 4 8 32 48.6 45.2 0.93

1875 4 8 32 99.7 70.9 0.71*

2250 4 4 32 186 175 0.94

2250 4 6 32 138 127 0.92

2250 4 6 32 179 127 0.71*

2250 4 6 32 182 127 0.7*

2250 4 8 32 112 102 0.91

2625 4 8 32 203 144 0.71*

3000 4 8 32 214 191 0.89

98

Chapter 7

Execution time of other dense

symmetric eigensolvers

In this chapter, I present models for performance of other symmetric eigensolvers. These

models have not been fully validated, although some have been partly validated.

7.1 Implementations based on reduction to tridiagonal form

7.1.1 PeIGs

PeIGs[74], like PDSYEVX, uses reduction to tridiagonal form, bisection, inverse iteration and

back transformation to perform the parall eigendecomposition of a dense symmetric matrix.

The execution time of PeIGs di�ers from that of PDSYEVX for two signi�cant reasons: PeIGs

is coded di�erently, (using a di�erent language and di�erent libraries) than PDSYEVX and

it uses a di�erent re-orthogonalization strategy. I am more interested in the di�erence

resulting from the di�erent re-orthogonalization strategy.

In PDSYEVX the number of ops performed by any particular processor, pi, during

re-orthogonalization is:P

C2fclusters assigned to pig 4Psize(C)

i=1 n iter(i)n (i� 1). Where: n iter(i)

is the number of inverse iterations performed for eigenvalue i (typically 3). If the size of

the largest cluster is greater than np , the processor which is responsible for this cluster will

not be responsible for any eigenvalues outside of this cluster.

Hence, if the size of the largest cluster is greater than np , the number of ops

99

performed by the processor to which this processor is assigned is (on average):

4 n itern1

2c2 = 6n c2

where: c = maxC2fclustersg size(C) i.e. the number of eigenvalues in the largest cluster, and

n iter = 3 is the average number of inverse iterations performed for each eigenvalue.

If the largest cluster has fewer than np eigenvalues, the number of eigenvalues that

will be assigned to any one processor, and hence the total number of ops it must perform,

is limited. The worst case is where there are p + 1 clusters each of size np+1 . In this case,

one processor must be assigned 2 clusters of size np+1 , requiring (on average) 2� 6n ( n

p+1)2

or roughly 12n3

p2 ops.

In contrast, PeIGs uses multiple processors and simultaneous iteration to maintain

orthogonality among eigenvectors associated with clustered eigenvalues. Traditional inverse

iteration[102] computes one eigenvector at a time, re-orthogonalizing against all previous

eigenvectors associated with eigenvalues in the same cluster, after each iteration. PeIGs,

in what they refer to as simultaneous iteration, performs one step of inverse iteration on

all eigenvectors associated with a cluster of eigenvalues and then reorthogonalizes all the

eigenvectors. This allows the re-orthogonalization to be performed e�ciently in parallel.

PeIGs is more accurate but slower than PDSYEVX if the input matrix has large

clusters of eigenvalues1 The cost of re-orthogonalization in PeIGs is O(n2 c=p) ops versus

O(nc2) ops in PDSYEVX.

7.1.2 HJS

Hendrickson, Jessup and Smith[91] wrote a symmetric eigensolver, HJS, for the PARAGON

which is signi�cantly faster than PDSYEVX, but which has never been released, and only

works on the Intel PARAGON.

HJS requires that the data layout block size be 1, i.e. a cyclic distribution, that

the processor grid be square, i.e. pr = pc and that intermediate matrices be replicated

across processor columns and distributed across processor rows. The requirement that the

processor grid be square limits e�ciency when used on a non-square processor gird. They

show that the algorithmic block size need not be tied to the data layout block size. At the

time that PDSYTRD was written, the PBLAS could not e�ciently use a cyclic distribution and

1PDSYEVX can maintain orthogonality among eigenvectors associated with clusters up to n

peigenvalues

easily and e�ciently.

100

did not support matrices replicated in one processor dimension and distributed across the

other.

HJS has several advantages over PDSYEVX. It uses a more e�cient transpose oper-

ation, eliminates redundant communication, reduces the number of messages by combining

some and reduces the number of words transmitted per process by using recursive halving

and doubling. HJS also reduces the load imbalance by a factor ofpp by using a cyclic data

layout and using all processors in all calculations2 . ScaLAPACK will incorporate several of

these ideas into the next version of PDSYEVX.

HJS notation

HJS also di�ers in a couple other rather minor aspects. They compute the norm

of v in a manner which could over ow, and they represent the re ector in a manner could

likewise over ow. These reduce execution time and program complexity slightly.

Their manner of counting the cost of messages in their performance model di�ers

from ours also. They count the cost of a message swap (sending a message to and simul-

taneously receiving a message from another processor) as equal to cost of sending a single

message. This re ects reality on the PARAGON and many but not all distributed memory ma-

chines. Using their method would not signi�cantly change the model for PDSYEVX because

PDSYEVX does not use message swap operations.

In their paper[91], they use di�erent variable names for the result of each compu-

tation, and show all indices explicitly. Figure 7.1 relates their notation to ours.

Figure 7.1: HJS notation

HJS our equivalent details

L tril(A)x� w tril(A)vy� wT tril(A;�1)vTp w w + transpose wT

c not mathematically identical

2PDSYTRD uses only pr processors in many computations

101

7.1.3 Comparing the execution time of HJS to PDSYEVX

The HJS implementation of parallel blocked Household tri-diagonalization performs essen-

tially the same computation as PDSYEVX. The di�erence is in the communication, load

balance and overhead costs. However, the operations are not performed in the same order,

and hence the steps don't match exactly. Some of the costs, particularly communication

costs, could easily have been assigned to a di�erent operation than the one that I assigned

them to. Hence, the execution time models for each of the individual tasks should not be

taken in isolation but understood as an aid in understanding the total.

Updating the current column of A (Line 1.1 in Figure 7.2)

As shown in table 4.1, the cost of updating the current column of A in PDSYTRD is:

2ndlog2(pp)e�+n nb dlog2(

pp)e�+2n �2+n2 nbp

p 2+2n �4

In Figure 6[91] steps Y2, 10.1, 10.2 and 10.3 of HJS are involved in updating the current

column of A and the cost of these steps is:

n�+1

2

n2pp� + 2n �2+

n2 nb

p 2 :

In PDSYEVX, a small part of vT and wT must be broadcast within the current column

of processors. In HJS, there is no need to broadcast vT because it is already replicated across

all processor rows. Instead of broadcasting the piece of wT that is necessary for this update,

HJS transposes all of wT , (cost: n�+1=2n2=pp�) anticipating the need for this in the rank

2k update.

The number of DGEMV ops performed does not change, but they are distributed

across all of the processors instead of being shared only by one column of processors. In

order to allow these ops to be distributed across all the processors, this update is performed

in a right-looking manner, i.e. the entire block column of the remaining matrix is updated

with the Householder re ector. In PDSYEVX, this update is performed in a left looking

manner, only the current column is updated (with a matrix vector multiply). In PDSYEVX,

the right-looking variant does not spread the work any better and hence the left-looking

variant is preferred because it involves a matrix-vector multiply, DGEMV, rather than a rank-

one update, DGER. Matrix-vector multiply requires only that every matrix element be read.

A rank-one update requires that every matrix element be read and then re-written.

102

The �4 term does not exist for HJS because they do not use the PBLAS, avoiding

the error checking and overhead associated with the PBLAS.

Computing the re ector (Line 2.1 in Figure 7.2)

As shown in table 4.2, the cost in PDSYTRD is: 3n dlog2(pr)e�+n �4 .In Figure 6[91] steps 2, 3, 4, 5, 6 and X of HJS are involved in computing the re ector, and

the cost of these steps is: n dlog2(p)e� , a little less than the cost in PDSYTRD.

Step 1 in HJS is also used in the computation of the re ector in HJS, however step

1 isn't necessary to compute the re ector, and it is necessary for the matrix-vector multiply,

hence I assign the cost of Step 1 to the matrix-vector multiply.

Both routines perform essentially the same operations. HJS appends the broadcast

of A(J+1; J) to the computation of xnorm (though HJS actually computes xnorm2), which

HJS performs as a sum-to-all. On the other hand, they involve all processors rather than

just one column of processors, hence the sum costs dlog2(p)e rather than dlog2(pr)e.The di�erence in performance would appear more dramatic if I included the cost

of the BLAS1 operations in my PDSYEVX model. I do not because they account for an

insigni�cant O(n2

pr) 1 execution time. HJS performs fewer BLAS1 ops (because they do not

go the extremes that PDSYTRD does to avoid over ow) and the ops that they perform are

distributed over all processors instead of over only one column of processors.

The cost of matrix vector multiply(Lines 3.1-3.6 in Figure 7.2)

As shown in table 4.3, the cost of the matrix vector multiply in PDSYTRD is:

4n dlog2(pp)e�+2n �+2

n2dlog2(pp)epp

�+1:5n2pp�+2n nb dlog2(

pp)e �+ n2

nbpp�2

+2

3

n3

p 2+3

n2 nbpp

2+n2pp�1+n �4

In Figure 6[91] steps 1, Y1, 7.1, 7.2, and 7.3 are involved in matrix vector multiply and the

cost of these steps is:

2n dlog2(pp)e�+ 2n�+

1

2

n2ppdlog2(

pp)e �+3

2

n2pp� + 2n �2+

1

3

n3

p 2 :

The model for HJS is much simpler because 1) the local portion of the matrix-vector multiply

requires just a single call to DGEMV and 2) the load imbalance in HJS is negligible (O(n2

p )

versus O( n2pp)) in PDSYEVX).

103

The communication performed in HJS during the matrix vector multiply includes:

Figure 6[91] Execution time model

Broadcast v within a row Step 1) n dlog2(pp)e�+ 1

2n2ppdlog2(

pp)e�

Transpose v and y Steps Y1, 7.3 2n�+n2pp�

Recursive halve p Step 7.3 n dlog2(pp)e�+1

2n2pp�

The transpose operations take advantage of the fact that pr = pc. Each processor

(a; b) simply sends its local portion of the vector to processor (b; a) while receiving the

transpose from that same processor.

The recursive halving operation is a distributed sum in which each of the pc pro-

cessors in the row starts with k values and end up with kpc

sums.

Updating the matrix vector multiply (Line 4.1 in Figure 7.2)

As shown in table 4.4, the cost of updating the matrix vector multiply in PDSYTRD is:

6n dlog2(pp)e�+n2dlog2(

pp)e

pr�+3n nb dlog2(

pp)e �+4n �2+2 n

2 nbpp

2+4n �4 :

In Figure 6[91] step 7.4 updates the matrix vector multiply and the cost of this step is:

2n �2+n2 nb

p 2+ :

Computing the companion update vector, w (Line 5.1 in Figure 7.2)

As shown in table 4.5, the cost of computing the companion update vector in PDSYTRD is:

2n dlog2(pp)e�+n �4+ :

In Figure 6[91] steps 8 and 9 compute the companion update vector and the cost of these

steps is:

5n dlog2(pp)e�+

n2pp� :

Just as in the computation of the re ector, the O(n2) costs of the BLAS1 operations

is insigni�cant. HJS performs these more e�ciently than PDSYEVX, because it uses all the

processors in these computations.

104

Perform the rank 2k update(Line 6.3 in Figure 7.2)

As shown in table 4.6, the cost of the rank 2k update in PDSYTRD is:

4n

nbdlog2(

pp)e�+2 n2p

pdlog2(

pp)e �+ n2p

p��2n nb dlog2(

pp)e �

+4n2

nb2pp pbf

�3+2

3 3+3

n2 nbpp

3 :

In Figure 6[91] step 10.4 performs the rank 2k update and the cost of this step is:

2n2

nb2pp�3+

2

3 3+ :

HJS does not require any communication here because W and V , are already

replicated across the processor rows, while WT and V T are already replicated across all

the processor columns.

Both HJS and PDSYEVX must perform the rank 2k update as a series of panel

updates using DGEMM. Both PDSYTRD and HJS use a panel width of twice the algorithmic

blocking factor.

Figure 7.2 summarizes the main sources of ine�ciencies in HJS reduction to tridi-

agonal form.

Table 7.1 compares the execution time in PDSYEVX and HJS reduction to tridiagonal

form. Each row represents a particular operation. The second column is the time (in

seconds) associated with the given operation in PDSYEVX. The third column shows the

number of the given operation performed in PDSYEVX. The product of the third column

with the �rst column, after substituting the cost given for the operation given in section 5.2

and n = 4000 and p = 64 is the second column. For example the cost of matrix multiply ops

in PDSYTRD on the PARAGON is: 2=3 (n = 4000)3=(p = 64) ( 3 = :0215e�6) = 14:3. Likewise,

the second to last column (the number of the given operation performed in reduction to

tridiagonal form in HJS) times the �rst column equals the last column (the time associated

with the given operation in reduction to tridiagonal form in HJS.)

Columns 4 through 10 represent unimplemented intermediate variations on reduc-

tion to tridiagonal form. Column 4, labeled \minus PBLAS ine�ciencies" assumes that a

couple ine�ciencies of the PBLAS are removed: (a bug in the PBLAS causing unnecessary

communication and the PBLAS overhead). Column 5, labeled \be less paranoid", assumes

that in addition PDSYTRD computes re ectors in the slightly faster, slightly riskier manner

105

Figure 7.2: Execution time model for HJS reduction to tridiagonal form. Line numbersmatch Figure 4.5(PDSYEVX execution time)



Update current (ith) column of A

1.1 transpose w n lg(pp)� 1

2n2pp�

1.2 A = A�W V T � V WT

Compute re ector2.1 v = house(A) 2n lg(

pp)�


3.1 spread v across n lg(pp)� 1

2n2 lg(

pp)p

p �

3.2 transpose v 12

n2pp �

3.3 w = tril(A)v; 23n3

p 2wT = tril(A;�1)vT

3.5 recursive halve w 12n2pp �

3.6 w = w + transpose wT 12n2pp �


4.1 w = w �W V Tv � V WT v 3n lg(pp)� n2p

p �


5.1 c = w � vT ; 2n lg(pp)�

w = � w � (c �=2) v

end do i = ii ;mxi


6.3 A = A �W V T � V WT 2 n2

nb2pp�3

23n3

p 3


that HJS does. Column 6 assumes direct transpose operations. Column 7 assumes that

certain messages are combined, reducing the message latency cost. Column 8 assumes that

sum-to-all is used instead of sum-to-one follow by a broadcast, reducing the latency cost.

Column 9, assumes that V;W; VT;WT are stored replicated across processor columns, this

eliminates all communication in the rank 2k update. Storing the data replicated also allows

all processors to be involved in all computations, but this is not assumed until column 11.

Column 10 assumes a cyclic data layout, eliminating some load imbalance. Column 11 as-

106

sumes that all processors are involved in all computations, eliminating the load imbalance

which was not eliminated by using a cyclic data layout.

7.1.4 PDSYEV

PDSYEV uses the QR algorithm to solve the tridiagonal eigenproblem. Each eigenvector is

spread evenly among all the processors. Each processor redundantly computes the rotations

and updates the portion of each eigenvector which it owns. Computing the rotations requires

O(n2) ops, whereas updating the eigenvectors requires O(n3) ops. Hence PDSYEV scales

reasonably well as long as all the eigenvectors are required.

Each rotation requires 2 divides, 1 square root and approximately 20 to compute

and 6 ops to apply.

The cost of the QR based tridiagonal eigensolution in PDSYEV is:

nXj=1

sweeps(j) (n� j) (2 �+ p + 20 1+1

p6n 1)

On average, it takes two sweeps per eigenvalue, so we set sweeps(j) = 2 and simplify:

2n2 � + 1n2 p + 20n2 1 + 6mn2

p 1

7.2 Other techniques

7.2.1 One dimensional data layouts

One dimensional data layouts can improve the performance of dense linear algebra codes

on modest numbers of processors, especially on one-sided reductions like LU and QR de-

composition. In general, one dimensional data layouts require fewer communication calls

in the inner loop but more words transmitted per process. One-sided reductions typically

require fewer messages within rows than within columns, sometimes by a factor as high as

nb, other times the advantage is a more modest log(pp). One-sided reductions often require

fewer words to be transmitted between columns than between rows of processors, usually

by a factor of nb.

One dimensional data layouts also o�er less overhead. Often an entire block column

can be computed by a call to the corresponding LAPACK code rather than the ScaLAPACK

code, saving signi�cant overhead costs.

107

Table 7.1: Comparison between the cost of HJS reduction to tridiagonal form and PDSYTRD

on n = 4000; p = 64; nb = 32. Values di�ering from previous column are shaded.

PDSYTRDestimatedtime

PDSYTRDcounts

minusPBLAS

ine�ciency

belessparanoid

directtranspose

mergeoperations

usesum-to-all

StoreV;W;VT;WT

replicated

nodatablocking

Allprocessorscompute

(i.e.HJS)

HJSestimatedtime

scale factorn3

p 3 14.3 23

23

23

23

23

23

23

23

23 14.3

n3

p 2 16.4 23

23

23

23

23

23

23

23

23 16.4

n2ppnb2 pbf

�3 2.1 2 2 2 2 12

12

12

12

12 0.5

n �2 0.6 6 6 6 6 5 5 5 4 4 0.4n2p

p nb pbf�2 4.7 2 2 2 2 1 1 1 0 0 0.0

n2pp�1 4.0 1

2 0 0 0 0 0 0 0 0 0.0n2pp�1 1.7 1

2 0 0 0 0 0 0 0 0 0.0

n �4 8.9 9 12 0 0 0 0 0 0 0 0 0.0

n2 nb pbfpp 2 0.9 1 1 1 1 1 1 1 0 0 0.0

n2 nbpp 2 1.7 4 4 4 4 4 4 4 3 0 0.0

n2 nb pbfpp 3 1.0 1 1 1 1 1 1 1 1 1 1.0

n2 nbpp 3 0.5 1 1 1 1 1 1 1 0 0 0.0

n dlog2(pp)e� 13.5 17 15 14 12 9 6 6 6 9 7.1

n� 0.5 2 2 2 4 4 4 3 3 3 0.8nnb

dlog2(pp)e� 0.3 4 4 4 2 1 1 0 0 0 0.0

n2pp dlog2(

pp)e� 4.4 5 4 4 2 2 2 11

2 112

12 0.8

n2pp � 0.6 2 2 2 2 2 2 11

2 112 21

2 0.7

n nb dlog2(pp)e� 0.02 8 7 7 5 5 5 21

2 0 0 0

Total est. time 76 59 58 55 51 49 48 41 42

Actual time 93. 61.

108

Both LU decomposition and back transformation would bene�t considerably from

one-dimensional data layouts when p is small, although the advantage would be most pro-

nounced on LU. One-sided reductions require O(n) reductions across processor rows but

only O( nnb) reductions across processor columns. On a high latency system such as a net-

work of workstations, the performance improvement from using a one-dimensional data

layout could be substantial since LU requires O(nb) fewer messages on a one-dimensional

data layout.

ScaLAPACK does not take full advantage of one-dimensional data layouts because

it calls the ScaLAPACK code even when the LAPACK code would do the job faster.

Two-sided reductions, such as reduction to tridiagonal form, do not bene�t from

one dimensional data layouts. Two-sided reductions require O(n) reductions across pro-

cessor rows and O(n) reductions across processor columns,hence eliminating the reductions

across processor rows (by using a 1D data decomposition) will not substantially reduce the

number of messages in two-sided reductions.

7.2.2 Unblocked reduction to tridiagonal form

Unblocked reduction to tridiagonal form can outperform blocked reduction for small and

modest sized problems, especially if a good compiler is available for the inner kernel. Un-

blocked reduction to tridiagonal form must perform all of its ops as BLAS2 ops, whereas

blocked reduction to tridiagonal form performs half of its ops as BLAS3 ops. However,

unblocked reduction to tridiagonal form requires much less overhead. Blocked reduction

to tridiagonal form requires at least 6n calls to DGEMV. unblocked reduction to tridiagonal

form requires only n calls to DSYMV and n calls to DGER.

If a compiler is available that will e�ciently compile the following kernel, unblocked

reduction to tridiagonal form could require only n BLAS2 calls and still attain near peak

performance on large problem sizes, especially for Hermitian eigenproblems3. The kernel

shown below only requires each element of A be read once and written once, while per-

forming 8 ops. This ratio, 1 memory read, 1 memory to 8 ops is one that many modern

computers can handle at near peak speed, even from main memory - in part because the

access are essentially all stride 1.

3Complex arithmetic requires only half as much memory tra�c per op

109

for i = 1, n {

for j = 1, i {

A(i,j) = A(i,j) - v(i) * wt(j) - w(i) * vt(j);

nwt(i) = nwt(i) + A(i,j) * nv(j);

nw(j) = nw(j) + A(i,j) * nvt(i);

}

}

7.2.3 Reduction to banded form

Reducing a dense matrix to banded form can be more e�cient than reduction to tridiagonal

form[24, 25, 116], however it is not clear that this can be made to be fast enough to overcome

the added costs to the rest of the code. Reduction to banded form requires less execution

time than reduction to tridiagonal form because it requires fewer messages O(n=nb) instead

of O(n) and because asymptotically all of the ops can be performed as BLAS3 ops rather

than half BLAS2 ops.

An e�cient eigensolver based on reduction to banded form could be designed as follows:

Reduce to banded formReduce from banded form to tridiagonal form (do not save rotations)Compute eigenvalues using bisection on tridiagonal formPerform inverse iteration on banded formBack transform the eigenvectors

This would be even simpler if only eigenvalues were required, as that eliminates the inverse

iteration and back transformation steps.

If only a few eigenvectors are required, one could reduce from banded form to

tridiagonal form, saving the rotations. This would allow the eigenvectors to be computed

on the tridiagonal using inverse iteration (or the new Parlett/Dhillon work). Then the

rotations could be applied as necessary and �nally the eigenvectors would be transformed

back. This would result in a complex code.

If two step band reduction to tridiagonal form were performed as above and the

eigenvectors were computed on the tridiagonal matrix, the cost of transforming them back to

the original problem would be at least 4n3, adding 60% more O(n3) ops to full tridiagonal

eigendecomposition. This could be done in two steps, applying �rst the rotations accrued

during reduction from banded to tridiagonal form and then transforming the eigenvectors

of the banded form back to the original problem. A cleaner, though more costly solution

110

would be to form the back transformation matrix after (or during) reduction to banded

form, update that during reduction to banded form and then use this to transform the

eigenvectors of the tridiagonal back to the original problem.

Using reduction to banded form in an eigensolver requires, at a minimum, that

two step band reduction to tridiagonal form be faster than direct reduction to tridiagonal

form. If eigenvectors are required, it must be signi�cantly faster in order to overcome the

additional 2n3 cost of back transformation.

So far, no one has demonstrated that two step reduction to tridiagonal form can

be performed faster than direct reduction on distributed memory computers. Alpatov,

Bischof and van de Geijn's two-step reduction to tridiagonal form[173] is not faster than

PDSYTRD. They assert that it can be optimized, but that is also true of PDSYTRD. So, it

is not yet clear whether two-step reduction to tridiagonal form will be signi�cantly faster

than direct reduction to tridiagonal form on any important subset of distributed memory

parallel computers.

I believe that software overhead plays a signi�cant role in limiting the performance

of two step reduction to banded form.

7.2.4 One-sided reduction to tridiagonal form

Hegland et al.[90] show that one can reduce the Cholesky factor (of a shifted input

matrix) to bidiagonal form updating from only one side. The result, in their implementation,

is a code which requires (10=3n3=p+n2�p) ops per processor, (n2�p) words communicatedper processor and (n � p) messages per processor.

They argue that this technique, despite requiring 2.5 times as many ops, yields

better performance on their target machine than conventional methods for reduction to

tridiagonal form. They use a 1D processor grid, a unblocked algorithm, a non-scalable

pattern communication and computation and ignore symmetry. By ignoring much of the

conventional wisdom they have achieved a simple, high performance code for their target

machine (vector).

111

7.2.5 Strassen's matrix multiply

The number of ops in Strassen's' matrix matrix multiply is:

2mnk�min(m;n; k)

s 12

�3�log 7:

Where s 12is the break even point for a particular Strassen's implementation, i.e.

the point at which one additional Strassen's divdied and conquer step neither increases nor

decreases execution time. Three factors contribute to preventing the use of Strassen's in

reduction to tridiagonal form and back transformation:

s 12is still too large

Lederman et al.[96] have reduced s 12to the range 100 to 500.

k is modest (where k is the block size.)

We can increase the block size, but only at the cost of additional load imbalance.

n3�log 7 = n�:193 shrinks slowly

Increasing n by enough to improve the ratio of \Strassen ops" to standard matrix

multiply ops by 50% requires a thousand-fold increase in the amount of memory

required. (5�:193 � :5, hence n must increase by a factor of 32 to to improve the ratio

of Strassen ops to standard matrix multiply ops.) Improving the ratio of \Strassen

ops" to standard matrix multiply ops by increasing the number of processors in-

volved is even di�cult. Although Chou et al.[43] have shown that 7k processors can

be used to do the work of 8k , it takes 75 = 16807 processors to get a factor of two

advantage this way. (75=85 = :51)

It is this last point that prevents Strassen's from rescuing ISDA (which is described below).

Because 32�:193 � 0:5, the problem size must be 32 s 12in order to halve the number of ops

required in ISDA. Halving the number of ops again would require that n be increased by

another factor of 30, increasing memory by another factor of 900 and the total number

of ops, even after the factor of two savings, by 1230

3 = 13; 500. I have not yet seen a

Strassen's matrix matrix multiply that achieves twice the performance of a regular matrix

matrix multiply.

112

Table 7.2: Fastest eigendecomposition method

n > 500pp n < 500

pp

Random matricesTridiagonal(> 4 times faster)

Tridiagonal

Spectrally diagonallydominant matrices

Tridiagonal Jacobi

7.3 Jacobi

7.3.1 Jacobi versus Tridiagonal eigensolvers

This section is based on models that have only been informally validated. I have compared

my models to those used by Arbenz and Slapni�car[9] and Little�eld and Maschho�[125] as

well as against the execution times reported in these papers but have not performed any

independent validation. Hence, the opinions that I express in this section should be taken

as conjectures.

Large matrices4 can be solved faster by a tridiagonal based eigensolver than by a

Jacobi eigensolver, but it is likely that Jacobi will outperform tridiagonal based eigensolvers

on small spectrally diagonally dominant matrices5. Since tridiagonal based methods require,

asymptotically, no more than a quarter as many ops as blocked Jacobi methods, even on

spectrally diagonally dominant matrices, I expect that tridiagonally based methods will

win on large matrices, even spectrally diagonally dominant ones, because tridiagonal based

methods can achieve 25% of peak performance on large matrices as shown in Chapter 5. I

also expect that tridiagonal based eigensolvers will beat Jacobi eigensolvers on random ma-

trices regardless of their size because on random matrices, tridiagonal eigensolvers perform

roughly 16 times fewer ops6 and I don't think that Jacobi methods will be 16 times faster

per op regardless of the input size. Table 7.2 summarizes which eigensolution method I

expect to be faster as a function of these input matrix characteristics.

4On current machines (n > 500pp) is su�ciently large to allow a tridiagonal eigensolver to outperform

Jacobi.5Spectrally diagonally dominant means that the eigenvector matrix, or a permutation thereof, is diago-

nally dominant. Most, but not all, diagonally dominant matrices are spectrally diagonally dominant. Forexample if you take a dense matrix with elements randomly chosen from [�1; 1] and scale the diagonalelements by 1e3 the resulting diagonally dominant matrix will generally) be spectrally diagonally dominant.However, if you take that same matrix and add 1e3 to each diagonal element, the eigenvector matrix isunchanged even though the matrix is clearly diagonally dominant.

6Assuming Jacobi converges in an optimistic 8 sweeps

113

7.3.2 Overview of Jacobi Methods

Despite Jacobi's simplicity there are several possible variants, especially for a par-

allel code, each of which have advantages. In section 7.3.16 I describe the code that I would

write if I were going to write a parallel code. I recommend a 2D data layout if one wishes

to be able to run e�ciently on large numbers of processors (say 48 or more). However, a

1D data layout is considerably simpler to implement and simpler implementation translates

into less software overhead. On some computers, Jacobi with a 1D data layout might be

e�cient for hundreds of processors. I recommend using a one-sided, blocked, non-threshold

Jacobi[9] with a caterpillar track pairing[150] and distinct communication and computation

phases, but other methods cannot be entirely rejected. For a spectrally diagonally matrix

the fastest serial Jacobi algorithm is a threshold Jacobi, hence threshold methods cannot be

ignored. A threshold method would almost certainly have to be two-sided, use a di�erent

pairing strategy and either a non-blocked code or some unconventional blocking strategy.

Non-blocked codes may make sense for small matrices and large numbers of processors

as well as for machines, such as vector architectures, which o�er comparable BLAS1 and

BLAS3 performance. Overlapping communication and computation will save time, but my

experience indicates that the savings is limited.

My recommendation is weighted toward small matrices that are modestly spec-

trally diagonally dominant, but not so dominant that certain matrix entries can be com-

pletely ignored. If the input matrix is sparse and so strongly spectrally diagonally dominant

that the matrix never �lls in, one would have to consider threshold methods and methods

that don't update parts of the matrix that remain zero. On the other hand, if the matrix

is quite large, performance could be further improved by using a di�erent data layout from

the one that I recommend.

There are many implementation options available to anyone writing a Jacobi code.

I will discuss many of these implementation options in the following sections. Section 7.3.3

explains the basic variants and data layout options. Section 7.3.4 explains the computa-

tion requirements of each of the basic variants. Section 7.3.5 explains the communication

requirements of each of the basic variants. Section 7.3.6 discusses blocking (both commu-

nication and computation). Section 7.3.7 discusses the importance of exploiting symmetry.

Section 7.3.8 explains that one-sided methods need not recompute diagonal blocks of ATA.

Section 7.3.9 discusses options for the partial eigendecomposition required by a blocked

114

Jacobi method. Section 7.3.10 discusses threshold strategies. Section 7.3.12 discusses pre-

conditioners. Section 7.3.13 discusses overlapping communication and computation.

7.3.3 Jacobi Methods

The matlab code for the classical, two-sided, Jacobi method shown in �gure 7.3 di�ers from

textbook descriptions only in that the rotation is computed by calling parteig and the

o� diagonals are compared to the diagonals (in the threshold test) in an unusual manner.

Figure 7.6 gives ine�cient matlab code for parteig which calls matlab's eig() routine and

sorts the eigenvalues to guarantee convergence. In a real implementation, parteig would

be one or two sweeps of two-sided Jacobi.

A two-sided blocked Jacobi matlab code is given in �gure 7.4. Because the code in

�gure 7.3 uses parteig to compute the rotations and norm in the threshold test, the only

di�erence between the blocked and unblocked versions is the de�nition of I and J. parteig

is not typically a full eigendecomposition, more often it is a single sweep of Jacobi.

The one-sided Jacobi variants can operate on any matrix whose left singular vectors

are the same as, or related to, the eigenvectors of the input matrix. This allows many choices

for pre-conditioning the input matrix, several of which are discussed in section 7.3.12

The one-sided Jacobi methods lose symmetry, but still require fewer ops than the

two-sided Jacobi methods because they do not have to update the eigenvectors separately7.

Furthermore, the one-sided Jacobi methods always access the matrix in one direction (by

column for Fortran). A typical one-sided Jacobi method is shown in �gure 7.5.

Parallel Jacobi methods require two forms of communication. The columns and/or

rows of the matrix must be exchanged in order to compute the rotations and the rotations

must be broadcast. The basic communication for one-sided Jacobi is shown in �gure 7.7

while the communication pattern for two-sided Jacobi is given in �gure 7.8.

7.3.4 Computation costs

The computation and communication cost for the Jacobi method which I recommend for

non-vector distributed memory computers with many nodes, a one-sided blocked Jacobi on

a 2 dimensional (pr � pc) processor grid, is shown in table 7.3. De�nitions for all symbols

used here can be found in Appendix A.

7They also avoid applying rotations from both sides, but this advantage is negated by the fact that they

115

Figure 7.3: Matlab code for two-sided cyclic Jacobi

function [Q,D] = jac2(A)

%

% Classical two-sided threshold Jacobi

%

thresh = 1e-15;

maxiter = 25;

n = size(A,2)

iter = 0

mods = 1

Q = eye(n);

while (iter < maxiter & mods > 0 )

mods = 0;

for I = 1:n

for J = 1:I-1

blkA = A([J,I],[J,I]) ;

if ( norm(blkA-diag(diag(blkA))) > ( norm(blkA)*thresh))

mods = mods + 1;

[R,D] = parteig(A([J,I],[J,I]));

A([J,I],:) = R' * A([J,I],:) ;

A(:,[J,I]) = A(:,[J,I]) * R ;

Q(:,[J,I]) = Q(:,[J,I]) * R ;

end % if

end % for J

end % for I

iter = iter + 1

end % while

D = diag(diag(A)) ;

116

Figure 7.4: Matlab code for two-sided blocked Jacobi

function [Q,D] = bjac2( A )

%

% Two sided blocked threshold Jacobi

%

maxiter = 25 ;

thresh = 1e-15;

nb = 1;

n = size(A,2)

iter = 0;

mods = 1;

Q = eye(n);

while (iter < maxiter & mods > 0 )

A = ( A + A' ) / 2; % restore symmetry

mods = 0;

for i = 1:nb:n

maxi = min(i+nb-1,n);

I = i:maxi;

for j = 1:nb:I-1

maxj = min(j+nb-1,n);

J = j:maxj;

blkA = A([J,I],[J,I]) ;

if ( norm(blkA-diag(diag(blkA))) > ( norm(blkA)*sqrt(nb)*thresh))

mods = mods + 1 ;

[R,D] = parteig(A([J,I],[J,I])) ;

A([J,I],:) = R' * A([J,I],:) ;

A(:,[J,I]) = A(:,[J,I]) * R ;

Q(:,[J,I]) = Q(:,[J,I]) * R ;

end % if

end % for j

end % for i

iter = iter + 1

end % while

D = diag(diag(A)) ;

117

Figure 7.5: Matlab code for one-sided blocked Jacobi

function [ Q, D ] = bjac1( A )

%

% One sided blocked Jacobi

%

thresh = 1e-15 ;

nb = 2 ;

maxiter = 25;

n = size(A,2)

B = A;

iter = 0 ;

mods = 1 ;

while (iter < maxiter & mods > 0)

mods = 0 ;

for i = 1:nb:n

maxi = min(i+nb-1,n);

I = i:maxi;

for j = 1:nb:I-1

maxj = min(j+nb-1,n);

J = j:maxj;

blkA = A(:,[J,I])' * A(:,[J,I]) ;

if (norm(blkA-diag(diag(blkA))) > norm(blkA)*sqrt(nb)*thresh)

mods = mods + 1 ;

[R,D] = parteig(blkA) ;

A(:,[J,I]) = A(:,[J,I]) * R ;

end % if

end % for j

end % for i

iter = iter + 1

end % while

D = A' * A;

Q = A * diag(1./sqrt(diag(D))) ;

D = Q' * B * Q ;

D = diag(diag(D)) ;

118

Figure 7.6: Matlab code for an ine�cient partial eigendecomposition routine

%

% parteig - eigendecomposition with eigenvalues sorted

%

function [ Q, D ] = parteig( A )

[QQ,DD ] = eig(A) ;

[tmp,Index] = sort(- diag(DD));

D = DD(Index,Index) ;

Q = QQ(:,Index) ;

Table 7.3: Performance model for my recommended Jacobi method

TaskCost perparallel pairing

Cost per sweepi.e. (n=nb2)=(2pc)parallel pairings

Cost for recommendeddata layout(nb=n=(2pc) pc=16pr=4

pp)

Move columnfor this pairinga

2 2nbpcn

�(�+nnb

pr�)

nnb�+n2

pr� 8

pp�+ 4n2p

p�

diag=A([I;J ];:)0�A(I;J ];:) b �3+2nnb2

pr 3 ( n

nb)2 1

2pc�3+

n3

p 3 8

pp�3+

n3

p 3

Sum diag within eachprocessor column

lg(pr)2nbpcn

�

+ lg(pr)nb2�

( nnb) lg(pr)�

+ n2

2pclg(pr)�

4pp(lg(p)�4)�

+n2=( 16pp(lg(p)�4)�

[Q;D] = parteig(diag) c 2nb2(2 �+ p )

+6(2nb)3 1

2n2

pc �+n2

pc p

+ 24n2nbpc

1

12n2pp �+ 1

4n2pp p

+ 34n3

p 1 (see noted)

Broadcast Q withineach processor column

lg(pr)2nbpcn

�

+ lg(pr)nb2�

( nnb) lg(pr)�

+ n2

2pclg(pr)�

4pp(lg(p)�4)�

+ n2

16pp(lg(p)�4)�

A = QA �3+2 npr

(2nb)2 3 ( nnb)2 1

2pc�3+4n

3

p 3 8

pp�3+4n

3

p 3

Total

nnb�+2( n

nb) lg(pr)�+

n2

pr�+n2

pclg(pr)�

+2n2

pc �+n2

pc p

+24n2nb

pc 1

+( nnb)2 1

pc�3+5n

3

p 3

8pp(lg(p)�3)�+ 7

2n2pp�+

n2

8pplg(p)�+ 1

2n2pp �

+ 14n2pp p

+ 34n3

p 1 8

pp�3+5n

3

p 3

aMy models assume that sends and receives do not overlap, hence the factor of 2. The factor of (2nbpc=n)represents the number of parallel pairings that can be performed on the data local to one processor column.

bOnly A(I; :)0 �A(J; :) need be computed. See section 7.3.8cPartial eigendecomposition of the (2nb)� (2nb) matrix performed with one pass of an unblocked two-

sided Jacobi method exploiting symmetry, see column labeled \exploiting symmetry" in table 7.6d( 24n

2nb

pc) � (( n

nb)2 1

2pc) = 24 n3

2pc= 24=36n

3

p= 3=4n

3

p

119

Figure 7.7: Pseudo code for one-sided parallel Jacobi with a 2D data layout with commu-nication highlighted

Until convergence do:

Foreach paring do:

Move column data (A) to adjacent columns of processors

Compute ATA locally (i.e. blkA = A(:,[I,J])' * A(:,[I,J]))

Combine ATA within each column of processors

Partial eigendecomposition of diagonal block (i.e. [R;D] = eig(ATA))

Broadcast R within each row of processors

Compute A R locally

End Foreach

End Until

Table 7.4 shows the estimated execution time for one sweep of my recommended

Jacobi on a matrix of size 1000 by 1000 on a 64-node PARAGON. As this model has not

been validated, these estimates must be viewed with caution. Actual performance will be

di�erent, but the model gives some idea of how important the various aspects may be.

This model is given in matlab form in section B.2.1. Table 7.4 suggests that Jacobi is

indeed e�cient (1.68/2.69 = 62%) even on such small problems. It also suggests that the

optimal data layout may be even taller and thinner than my recommended data layout:

pc = 32; pr = 2. A taller and thinner layout (speci�cally pc = 64; pr = 1) would double the

cost of message transmission between columns but would decrease the cost of the partial

eigensolver. The cost of the divides and square roots in the partial eigensolver would

decrease by a factor of 64=32 because all 64 processors would participate in the partial

eigensolver. And the cost of accumulating the rotations within the partial eigensolver would

decrease by 2� 2 = 4. The �rst factor of 2 stems the fact that all processors would share

in the work, while the second factor of 2 stems from the fact that the block size would be

smaller by a factor of 2 and the cost of accumulating rotations grows as O(n2nb).

Table 7.5 gives computation cost models for 6 one-sided Jacobi variants. These models are

not complete (they overlook many overhead and load imbalance costs), nor have they been

validated. This table is designed mainly to put the various variants in perspective and not

must perform dot products to form the square sub matrices to be diagonalized.

120

Table 7.4: Estimated execution time per sweep for my recommended Jacobi on the PARAGONon n=1000, p=64

TaskPerformanceModel

Operationcosta

Estimatedtime (seconds)

Messagelatency

8pp(lg(p)� 3)� � = 65:9e� 6 0.01

Messagetransmissionbetweencolumns

72n2pp� � = :146e� 6 0.06

Messagetransmissionwithincolumns

18n2pp lg(p)� � = :146e� 6 0.01

Computingrotations

12n2pp � � = 3:85e� 6 0.24

Computingrotations

14n2pp p p = 7:7e� 6 0.24

Accumulatingrotationsin partialeigensolver

34n3

p 1 1 = :074e� 6 0.43

Softwareoverhead

8pp�3 �3 = 103e� 6 0.01

A = QA 5n3

p 3 3 = :0215e� 6 1.68Total(per sweep)

2.68

aSee 6.1

121

Figure 7.8: Pseudo code for two-sided parallel Jacobi with a 2D data layout, as describedby Schrieber[150], with communication highlighted

Until convergence do:

Foreach paring do:

Move row and column data (A) to diagonally adjacent processors

Compute partial eigendecomposition of diagonal block

Broadcast R within each row of processors

Broadcast R' within each column of processors

Compute R A R' locally

Compute Q R locally

End Foreach

End Until

to establish which is best. Communication costs are considered in section 7.3.5

I have attempted to list the variants that have been implemented as well as the

most promising suggestions. For each variant I have, where appropriate, followed my rec-

ommendations for implementing a Jacobi code made in section 7.3.16.

Table 7.6 gives performance models for 5 commonly mentioned two-sided Jacobi variants.

Like the performance models for one-sided Jacobi variants, these models are incomplete and

have not been validated.

7.3.5 Communication costs

Table 7.7 summarizes the communication costs for parallel Jacobi methods. I assume that

the communication block size is chosen to be as large as possible.

A performance model for Jacobi could be created by selecting the appropriate

computation costs from table 7.5 or table 7.6 and the appropriate communication cost from

table 7.7. Not all load imbalance and overhead costs are covered in either of these tables,

and the models have not been validated.

122

Table 7.5: Performance models ( op counts) for one-sided Jacobi variants. Entries whichdi�er from the previous column are shaded.

Unblocked

Blocked

a

Little�eld

Maschho�

b

exploit

symmetry

c

store

diagonalsd

fast

givense

exploit

symmetry

f

store

diagonals

g

ATA

3 2n2� 1+3n3 1

3 2n2� 1+3n3 1

1 2n2� 1+n3 1

1 2n2� 1+n3 1

2p2 c� 3+2n3 3

2p2 c� 3+n3 3

[Q;D]=

parteig(ATA)

Onesweeph

1 2n2( �+ p)

1 2n2( �+ p)

1 2n2( �+ p)

1 2n2( �+ p)

8n2� 1+

n2( �+ p)

+24n2nb 1

8n2� 1+

n2( �+ p)

+24n2nb 1

Q�L

2n2� 1+3n3 1

2n2� 1+3n3 1

2n2� 1+3n3 1

n2� 1+2n3 1

2p2 c� 3+4n3 3

2p2 c� 3+4n3 3

V�Q

2n2� 1+3n3 1

0

0

0

0

0

Total

(persweep)

7 2n2� 1+

1 2n2( �+ p)

+9n3 1

7 2n2� 1+

1 2n2( �+ p)

+6n3 1

5 2n2� 1+

1 2n2( �+ p)

+4n3 1

3 2n2� 1+

1 2n2( �+ p)

+3n3 1

4p2 c� 3+

8n2� 1+

n2( �+ p)

+24n2nb 1

+6n3 3

4p2 c� 3+

8n2� 1+

n2( �+ p)

+24n2nb 1

+5n3 3

Assume:

i

nb=

n2pc

pc=16pr

7 8n2pp� 1+

1 8n2pp �+

1 8n2pp p+

9n3 p

1

7 8n2pp� 1+

1 8n2pp �+

1 8n2pp p+

6n3 p

1

5 8n2pp� 1+

1 8n2pp �+

1 8n2pp p+

4n3 p

1

3 8n2pp� 1+

1 8n2pp �+

1 8n2pp p+

3n3 p

1

64p� 3+

2n2pp� 1+

1 4n2pp �+

1 4n2pp p+

3 8n3 p

1+

6n3 p

3

64p� 3+

2n2pp� 1+

1 4n2pp �+

1 4n2pp p+

3 8n3 p

1+

5n3 p

3

aFor parallel codes we assume that the blocksize is chosen to be as large as possible i.e. nb = n=(2pc)where pc is the numer of processor columns. For a serial code pc = n=(2� nb) can be arbitrarily chosen.

bThis is the one-sided method used by Little�eld and Mascho�[125].cThis is the method shown in �gure 7.5dThis is the method used by Arbenz and Oettli[10]eUsing fast givens is often mentioned, but rarely implemented. Perhaps the bene�t is not as good as this

model would suggest.fThis is the method shown in �gure 7.4gThis is the method used by Arbenz and Slapnicar[9]hOne sweep of Jacobi on an matrix of size 2nb by 2nbiI also assume that only one processor in each processor column is involved in each partial eigendecom-

position.

123

Table 7.6: Performance models ( op counts) for two-sided Jacobi variants

Unblocked Blockeda

Ignoresymmetryb

Exploitsymmetry

fastgivensc

Ignoresymmetryd

Exploitsymmetry

parteig(A([I; J ]; [I; J ])(one sweepe)

12n

2( �+ p) 12n

2( �+ p ) 12n

2( �+ p )

8n2�1+

n2( �+ p)

+24n2nb 1

8n2�1+

n2( �+ p )

+24n2nb 1

QAQT Rotatefrom both sides

4n2�1+6n3 12n2�1+

3n3 1n2�1+2n3 1 4p2c�3+8n3 3 4p2c�3+4n3 3

QZ

Update eigenvectors2n2�1+3n3 1 2n2�1+3n3 1 n2�1+3n3 1 2p2c�3+4n3 3 2p2c�3+4n3 3

Total (per sweep)

12n

2( �+ p)

+6n2�1

+9n3 1

12n

2( �+ p )

+4n2�1

+6n3 1

12n

2( �+ p )

+2n2�1

+4n3 1

8n2�1+

n2( �+ p)

+24n2nb 1+

6n2�1+12n3 1

8n2�1+

n2( �+ p )

+24n2nb 1+

6n2�1+8n3 1

Assume:f

nb= n2pc

pc=16pr

18n2pp �+

18n2pp p+

32n2pp�1+

9n3

p 1

18n2pp �+

18n2pp p+

n2pp�1+

6n3

p 1

18n2pp �+

18n2pp p+

12n2pp�1+

4 n3pp 1

2 n2pp�1+

n2pp �+

14n2pp p+

38n3

p 1+

32n2pp�1+

12n3

p 1

2 n2pp�1+

14n2pp �+

14n2pp p+

38n2

p 1+

12n2pp�1+

8n3

p 1

aFor parallel codes we assume that the blocksize is chosen to be as large as possible i.e. nb = n=(2pc)where pc is the number of processor columns. For a serial code pc = n=(2� nb) can be arbitrarily chosen.

bThis is the method used by Pourandi and Tourancheau[142], by Schreiber[150] and the method describedin �gure 7.3.

cUsing fast givens is often mentioned, but rarely implemented. Perhaps the bene�t is not as good as thismodel would suggest.

dThis is the method shown in �gure 7.4eOne sweep of Jacobi on an matrix of size 2nb by 2nbfI also assume that only one processor in each processor column is involved in each partial eigendecom-

position.

124

Table 7.7: Communication cost for Jacobi methods (per sweep)

One-sided Two-sided

1-D datalayouta

2-D datalayoutb

1-D datalayoutc

2-D datalayoutd

2-D datalayoute

exchangecolumn vectors

4p�+2n2� 4pc�+2n2

pr� 4p�+2n2�

6pc log(pp)�+

32n2 log(

pp)

pr�

12pp�+

3 n2

pp�

Reduce ATA 02pc lg(pr)�+

2n2

pclg(pr)�

0 0 0

Broadcastrotationsf

02pc lg(pr)�+

12n2 lg(pr)�

2p lg(p)�+n2

plg(p)�

4pc log(pr)�

+4pc log(pc)�

+ 12n2

prlog(pc)�

+ 12n2

pclog(pr)�

8pp�+

2 n2pp�

aThis is the method used by Arbenz and Slapnicar[9]bThis is the method used by Little�eld and Maschho�[125]cThis is the method used by Pourzandi and Tourancheau[142]dThis is 2D method most likely to be used todayeThis is method used by Schreiber[150]fOn the unblocked methods we assume that communication is blocked even though the computation is

not. We also assume that each rotation is sent as a single oating point number. This is natural if you areusing fast Givens but requires extra divides and square roots if fast Givens are not used.

7.3.6 Blocking

Classical Jacobi methods annihilate individual entries whereas blocked Jacobi

methods use a partial eigendecomposition on blocks. Cyclic Jacobi methods use fewer

ops, especially if fast Givens rotations are used. But, almost all of the oating point oper-

ations in blocked Jacobi methods are performed in matrix-matrix multiply operations, the

most e�cient operation.

Both cyclic and blocked Jacobi methods can be blocked for communication. The

communication block size need only be an integer multiple of the computation block size.

Blocking for communication may be more important than blocking for computation because

it reduces the number of messages by a factor equal to the communication block size.

Blocking allows greater possibilities for the partial eigendecomposition. A better

partial eigendecomposition will lead to faster convergence. For example, performing two

Jacobi sweeps in the partial eigendecomposition would result in fewer sweeps through the

entire matrix. However initial experiments indicate that on random matrices the best that

one can hope for is a reduction of lg(nb) in the number of full sweeps even if one uses a

complete eigendecomposition as the \partial eigendecomposition".

125

Using a block size that is smaller than the maximum allowed (i.e. nb < n=(2pc))

o�ers various possibilities. It allows communication to be pipelined to some extent. Alter-

natively it allows more than pc processors to be involved in computing the partial eigende-

compositions.

The per sweep cost of the partial eigensolutions grows as the square of the block

size because larger block sizes mean that fewer processors are involved in the partial eigen-

decomposition8.

I recommend keeping the code simple by keeping the communication and compu-

tation block size equal and setting nb = n=(2pc) so that each parallel pairing involves one

partial eigendecomposition per processor column. Using a rectangular process grid such that

(16pr � pc � 32pr) requires a lower nb and hence allows the code to keep communication

and computation block size equal while holding the cost of the partial eigendecomposition

to 38 to 3

4n3

p 1. On most machines this will be no more than half the 5n3

p 3 cost, in part

because the partial eigendecomposition will �t in the highest level data cache.

A larger computational block size increases the cost of partial eigendecomposition

and decreases the cost of the BLAS3 operations. Larger communication block size decreases

message latency cost but leaves less opportunity for overlapping communication with com-

putation. A larger ratio of pc to pr increases message latency but reduces the partial

eigendecomposition cost9. See section 7.3.9 for details on the partial eigendecomposition

cost.

7.3.7 Symmetry

Exploiting symmetry in two-sided Jacobi methods is important because it reduces

the number of ops per sweep from 12n3 to 8n3. However, exploiting symmetry while

maintaining load balance is di�cult. If in a blocked Jacobi method, the block size were set

to the largest value possible, i.e. n2pc

, and a standard rectangular grid of processors were

used, half of the processors (either those above or below the diagonal) would be idle all

the time. Using a smaller block size would allow better load balance but gives up some

of the bene�ts of blocking. Alternatives, such as using a di�erent processor layout for the

eigenvector update, are feasible, but their complexity make them unattractive.

8This does not hold for nb < n=(2p).9Assuming only one processor per processor column is involved in computing partial eigendecompositions.

126

In one-sided Jacobi methods, ATA is symmetric and only half of it need be com-

puted. In fact, only a quarter of it must be computed as shown in the following section.

7.3.8 Storing diagonal blocks in one-sided Jacobi

One-sided Jacobi methods must compute diagonal blocks of ATA. This is shown

in the matlab code given in �gure 7.5 as: blkA = A(:; [J; I ])0 �A(:; [J; I ]). This is ine�cientbecause not only does it compute both halves of a symmetric (or hermitian) matrix, but

A[:; I ]0�A[:; I ] and A[:; J ]0�A[:; J ] are already known. They are the diagonal blocks returnedby parteig on the most recent previous pairing which involved I and J respectively. Storing

these blocks for future use avoids the need to recompute them, although they may need to

be refreshed from time to time for accuracy reasons.

7.3.9 Partial Eigensolver

My performance models suggest that execution time is likely to be minimized

when the partial eigendecomposition consists of either one or two sweeps of Jacobi. The

per sweep cost of the partial eigensolver grows as O(n2nbpp ). In my recommended Jacobi

method, the partial eigensolver consists of one sweep of Jacobi and based on the data

layout which I recommend, and costs 38n3

p 1 + O( n2pp) or roughly 10% to 30% of the total

cost of the sweep. Preliminary experiments indicate that with a block size of 32, using a

full eigendecomposition instead of a partial eigendecomposition may reduce the number of

sweeps by as much as 20%. Assuming that a full eigendecomposition of a 32 by 32 matrix

costs 6 times what a single sweep of Jacobi would cost, this analysis suggests that the added

cost of a full eigendecomposition will not reduce the number of sweeps su�ciently to result

in a net decrease in execution time, especially if DGEMM performs e�ciently on a smaller

block size10. On the other hand, since most of the advantage of a full eigendecomposition

will come from the second sweep, using two sweeps of Jacobi in the partial eigensolver

may result in a net decrease in execution time. This analysis depends on a great many

assumptions and should be taken as a guide, not a prediction. Schreiber[150] reached a

similar conclusion.

In a non-blocked code, the \partial eigendecomposition" should consist of a rota-

tion, i.e. a full eigendecomposition. In a non-blocked code, the cost of the partial eigensolver,

10A smaller block size reduces the cost of the partial eigensolver.

127

though still O(n2nb), is lower because nb = 1 and for a 2 by 2 matrix, a single sweep of

Jacobi is a full eigendecomposition. Except for very small n, say n < 100, partial eigende-

compositions, such as those suggested by G�otze[85], are not likely to result in lower total

execution time.

In a blocked eigensolver, one must compute a partial eigendecomposition for each

pairing. Most commonly, a single sweep of two-sided Jacobi is used as the partial eigende-

composition. Since the elements in A[I; I ]0 � A[I; I ] and A[J; J ]0 � A[J; I ] are involved in

more pairings than the elements in A[I; J ]0�A[I; J ] they need not be annihilated in every

pairing.

The number of partial eigenproblems that can be performed simultaneously is

n2nb . If this is less than p, either the partial eigenproblems must themselves be performed

in parallel or some processors will be idle. Unless nb is quite large, say nb � 64, it is

likely to be faster to compute them each on a single processor, especially since the partial

eigendecomposition is a two-sided, not one-sided, sweep.

If n=(2nb) = pc, it is natural to assign one processor within each processor column

to perform the partial eigendecomposition. If n=(2nb) > pc, each parallel pairing will have

more partial eigenproblems than processor columns, hence the code could assign di�erent

partial eigenproblems to di�erent processors within each processor column. The other alter-

native is to increase pc (decreasing pr). Hence, assigning di�erent partial eigenproblems to

di�erent processors within a column only makes sense if bandwidth cost makes increasing pc

unattractive. On the other hand, the only disadvantage to assigning di�erent partial eigen-

problems to di�erent processors with a column (as opposed to increasing pc) is increased

code complexity.

If the cost of divisions and square roots (14n22 �+ p

pc) is signi�cant, one should

consider inexact rotations in the partial eigensolver. G�otze points out that one need not

perform exact rotations and suggests a number of approximate rotations which avoid divides

and square roots[85]. It would be counterproductive to use inexact rotations (saving O(n2)

ops at the expense of increasing the number of sweeps and the accompanying O(n3) ops)

in a parallel cyclic Jacobi method. Likewise I would be hesitant to use inexact rotations in

the partial eigensolver unless doing so makes it feasible to perform two sweeps in the partial

eigensolver. However, it is entirely possible that more sweeps with inexact rotations might

be better than fewer sweeps using exact rotations in the partial eigensolver.

Using a classical threshold scheme in the partial eigensolver is likely to save little

128

time, but using thresholds to perform more important rotations might improve performance.

A classical threshold scheme is not attractive because the processors performing fewer ro-

tations would simply sit idle. However having each processor compute the same number of

rotations, while using thresholds to skip some rotations might allow the rotations performed

to be more productive.

7.3.10 Threshold

For serial cyclic codes, thresholds can signi�cantly reduce the total number of

oating point operations performed, especially on spectrally diagonally dominant matrices.

Since Jacobi methods are most likely to be attractive on spectrally diagonally dominant

codes, thresholds cannot be rejected as unimportant. However in a blocked parallel pro-

gram, entire blocks can only be skipped if the whole block requires no rotations. As an

example, consider a blocked parallel Jacobi eigensolution of a 1024 by 1024 matrix on a

1024 node computer using a block size of 16. This would involve 63 (or 64) steps each of

which would consist of 32 pairings performed in parallel. Each pairing involves a partial

eigendecomposition of a 2� 16 by 2� 16 matrix. If any of the o�-diagonal elements in any

of the 32 pairings requires annihilation, no savings is achieved in that step. Hence, in the

worst case, if just 63 of the 499,001 o�-diagonal elements (one per step) require annihilation,

the threshold algorithm realizes no bene�t.

Corbato[47] devised a method for implementing a classical Jacobi method in O(n3)

time. His method involves keeping track of the largest o�-diagonal element in each column.

The cost of maintaining this data structure would more than double the cost of each rotation

and may not lead to reduced execution time even in serial codes. However, Beresford

Parlett[137] pointed out to me that one need not keep track of the true largest element and

that each rotation must maintain the sum of the squares of the elements, hence allowing

the list of \largest" o�-diagonal elements to be out-of-date would seriously undermine the

advantage and would signi�cantly reduce the overhead. This deserves further study.

Untested Threshold methods

One could design a code that used variable block sizes and/or switched from a

one-sided-non-threshold Jacobi to a two-sided-threshold Jacobi. A code could even scan

the matrix, identify the elements that need to be eliminated and select pairings and block

129

sizes that would eliminate those elements as e�ciently as possible. In our worst case example

given in the preceding paragraph, it might be that those 63 o� diagonal elements could be

annihilated in just two parallel steps each requiring only a two element rotation.

Scanning all o�-diagonal elements and choosing the largest n non-interfering ele-

ments might be an attractive compromise between the classical Jacobi method which ex-

amines all o� diagonal elements and annihilates the largest and the cyclic Jacobi method

which annihilates all elements without regard to size. If software overhead could be kept

modest, such a method might on small spectrally diagonally dominant matrices. Precisely

the matrices that are best suited to Jacobi methods.

Jacobi methods that attempt to annihilate larger elements, i.e. threshold methods,

work best on two-sided Jacobi methods. This is unfortunate because it appears that one-

sided Jacobi is otherwise preferred.

As mentioned in section 7.3.9, thresholds might be useful in the partial eigende-

composition.

7.3.11 Pairing

The order in which the o�-diagonal elements are annihilated is referred to as the

pairing strategy. Eliminating o�-diagonal element Ai;j in a two-sided Jacobi requires that

rows i and j of A and columns i and j of A be rotated. Hence, rows i and j of A must be

distributed similarly, i.e. Ai;k and Aj;k must both reside on the same processor. Likewise,

columns i and j of A must be distributed similarly. Orthogonalizing vectors i and j in

a one-sided Jacobi also requires that the two vectors be distributed similarly. In order to

annihilate multiple o�-diagonal elements simultaneously, they must reside on di�erent sets

of processors.

The pairing strategy a�ects execution time through communications cost, num-

ber of pairings per sweep and number of sweeps required for convergence. Di�erent pair-

ing strategies require di�erent communication patterns and hence di�erent communication

costs. Some pairings strategies require slightly more pairings than others. Mantharam and

Eberlein argue that some pairings lead to faster convergence than others[72].

In this section, we illustrate two pairing strategies, showing how each would pair

8 elements in 4 sets at a time. The elements might be individual indices (in a non-blocked

Jacobi) or blocks of indices. The sets might correspond to individual processors (in a 1D

130

data layout) or columns of processors (in a one-sided Jacobi on a 2D layout) or rows and

columns of processors (in a two-sided Jacobi on a 2D layout). Furthermore, several sets

might be assigned to the same processor or column of processors.

The classic round robin pairing strategy[84] leaves one element stationary and

rotates the other elements. As the following diagram shows, in 7 pairings, each element is

paired exactly once with each of the other elements. Elements 3 through 8 follow elements

2 through 7 respectively, while element 2 follows element 8.

1 2 3 48 7 6 5

1 3 4 52 8 7 6

1 4 5 63 2 8 7

1 5 6 74 3 2 8

1 6 7 85 4 3 2

1 7 8 26 5 4 3

1 8 2 37 6 5 4

A slight variation, called the caterpillar pairing method[72, 73, 150], cuts the

communication cost in half at the expense of increasing the number of pairings from n� 1

to n. The caterpillar method, modi�ed so that communication is always performed in the

same direction, is shown below. Only the elements in the top line rotate, and they always

rotate to the left. The elements, shown in red, in the bottom line get swapped into the top

line one at a time. In this pairing method, it takes 8 pairings in order for each element to

be paired with every other element. The swapped elements need not perform any work,

but must exchange the blocks assigned to them prior to the next communication step. This

pairing strategy requires 16 (in general 2n) pairings to come back to the original pairing,

but the second n pairings duplicate the �rst n.

131

1 2 3 45 6 7 8

2 3 4 85 6 7 1

3 4 8 25 6 7 1

4 8 7 35 6 2 1

8 7 3 45 6 2 1

7 6 4 85 3 2 1

6 4 8 75 3 2 1

5 8 7 64 3 2 1

8 7 6 54 3 2 1

Mantharam and Eberlein[72] suggest that some pairing strategies may lead to

convergence in fewer steps than others.

7.3.12 Pre-conditioners

One sided Jacobi methods compute eigenvectors by orthogonalizing a matrix which

has the same or related left singular vectors as the original matrix. Some options include:

[U;D; V ] = svd(A); U contains the eigenvectors of A, D is the absolute value of the

eigenvalues of A. This method is used by Berry and Sameh[21].

[U;D; V ] = svd(chol(A)); U contains the eigenvectors of A, D is the square root of the

eigenvalues of A. This is used by Arbenz and Slapni�car[9] and is mathematically

equivalent to classical Jacobi.

[Q;R] = qr(A); [U;D; V ] = svd(R);Q � U contains the eigenvectors of A. D contains the

absolute value of the eigenvalues of A

132

In addition, there are pivoting counterparts to both Cholesky and QR, indeed many a-

vors of QR with pivoting, which would improve these pre-conditioners. If A is spectrally

diagonally dominant, permuting A so that the diagonal elements are non-increasing might

provide most of the bene�t that Cholesky with pivoting does and at considerably lower

cost.

7.3.13 Communication overlap

Overlapping communication and computation is attractive because in theory it re-

duces the total cost from the sum of the computation and communication costs to their max-

imum. Arbenz and Slapni�car demonstrated that overlapping communication and computa-

tion is straightforward in a one-sided Jacobi method with a one-dimensional data layout[10].

But, overlapping communication and computation when using a two-dimensional data lay-

out is not as straightforward. Furthermore, actual experience with communication and

computation overlap has been disappointing, see section B.1.6

7.3.14 Recursive Jacobi

The partial eigendecomposition could be a recursive call to a Jacobi eigensolver. A

recursive Jacobi could o�er all the bene�ts shown by Toledo on LU[165], notably excellent

use of the memory hierarchy. Unfortunately, each level of recursion requires 6 calls, tripling

the software overhead. Therefore, the number of subroutine calls, and hence the software

overhead, grows at an unacceptably high O(nlg(6)).

Increasing software overhead in order to reduce the number of sweeps will make

sense for large matrices but not for small matrices. Since Jacobi is unlikely to be faster than

tridiagonal based methods for large matrices, I feel that it is more important to concentrate

on making Jacobi fast on smaller matrices. Hence, I do not include recursion as a part of my

recommended Jacobi method. Nonetheless, it may be that one step of recursion (tripling

the software overhead) and conceivably two steps of recursion (increasing software overhead

by a factor of 9) may reduce total execution time, but I would not expect the improvement

to be signi�cant.

133

7.3.15 Accuracy

Demmel and Veseli�c[58] prove that on scaled diagonally dominant matrices, Jacobi

can compute small eigenvalues with high relative accuracy while tridiagonal based methods

cannot. Drma�c and Veseli�c[71] show that Jacobi methods can be used to re�ne an eigen-

solution, thereby providing high relative accuracy on scaled diagonally dominant matrices

at lower total cost than a full Jacobi. Demmel et al.[56] give a comprehensive discussion of

the situations in which Jacobi is more accurate than other available algorithms.

7.3.16 Recommendation

If I were asked to write one Jacobi method for all non-vector distributed memory

computers, it would be a one-sided blocked Jacobi method. It would use a one-dimensional

data layout on computers with fewer than 48 nodes and a two-dimensional data layout on

computers with 48 or more nodes. It would use 16-32 times as many processor columns as

rows in a two-dimensional data layout11. It would use a computational and communication

block size equal to12 max(n=(2pc); 8), leaving processors idle if 8 < n=(2pc). It would

compute the partial eigendecompositions on just one processor in each processor column.

It would avoid recomputing diagonal entries unnecessarily, use a one-directional caterpillar

track pairing and one sweep of Jacobi for the partial eigendecomposition. It would use the

largest block size possible for both computation and communication.

If I had time to experiment, I would investigate di�erent partial eigendecompo-

sitions, pre-conditioners and pairing strategies in that order. Overlapping communication

and computation appears to o�er greater performance improvements in theory than in prac-

tice. I would use thresholds as a part of the stopping criteria, but wouldn't count on them

to avoid unnecessary ops. I would check to make sure that my suggested data layout (1D

for p < 48, 16pr < pc < 32pr for p > 48 and nb = max(n=(2pc); 8) ) was reasonable on

several computers, but unless there was a substantial bene�t to tuning the data layout to

each machine I would hesitate to do so.

For vector machines I recommend an unblocked code with fast Givens rotations

if the cost of BLAS1 operations is no more than twice that of BLAS3 operations. If the

BLAS1 operations cost just twice what BLAS3 operations cost, the op cost in an unblocked

11The ratio pcpr

can be made to fall in the 16-32 range for any number of processors except 1 to 15, 32 to

63 and 128 to 144. No more than 2.1% of the processors are left idle following these rules.12De�nitions for all symbols used here can be found in Appendix A.

134

code would be 6=5 that of the blocked code (because unblocked codes using fast Givens

require 3/5 as many ops. Savings on other aspects can be expected to make up for this

di�erence on all but the largest matrices. Communication should still be blocked however.

One-dimensional data layout can be used for more nodes if a cyclic code is used, perhaps

as many as a hundred nodes, since block size is not an issue. As long as n < 2p a one-

dimensional data layout is limited only by communication costs.

Combining elements of classical and cyclic Jacobi is an interesting long shot. Clas-

sical Jacobi always annihilates the largest o�-diagonal element but requires O(n4) compar-

isons per sweep13. Annihilating the n largest o�-diagonal elements each time would roughly

match the number of comparisons performed to the number of ops performed. To paral-

lelize this idea, one would have to choose the n largest non-interfering elements.

7.4 ISDA

The total execution time for the ISDA[97] for solving the symmetric eigenprob-

lem14 will be no less than 100n3 on typical matrices. The execution time depends largely

on how many decouplings are required to make each of the smaller matrices no larger than

half the size of the original matrix. It also depends on the cost of each decoupling, but this

will not vary that much.

The ISDA achieves high oating point execution rates, but in order to beat tridi-

agonal methods it must achieve 100=(10=3) = 30 times higher oating point rates, which

it does not. The PRISM implementation of ISDA takes 36 minutes = 2160 seconds to

compute the eigendecomposition of a matrix of size 4800 by 4800 on the 100 node SP2 at

Argonne[29], ScaLAPACKs PDSYEVX takes 397 seconds to compute the eigendecomposition

of a matrix of size 5000 by 5000 on a 64 node SP2[31]. ISDA should not require as large

a granularity, n=pp, as PDSYEVX because of its heavy reliance on matrix-matrix multiply.

However, at present, the PRISM implementation is still at least three times slower than

PDSYEVX even on small matrices. Solving a matrix of size 800 by 800 on 64 nodes takes 60

seconds using the PRISM ISDA code, whereas PDSYEVX can solve a matrix of size 1000 by

1000 on 64 nodes of an SP2 in 16 seconds.

The cost of each decoupling depends upon how close the split happens to come to

13Or increased overhead if Corbato's method[47] is used.14See section 2.7.3 for a brief description of the ISDA.

135

a eigenvalue of the matrix being split. The number of beta function evaluations required

for a given decoupling is roughly � log(mini2n(split � �i)), where split is the split point

selected for this decoupling. The distance between split and the nearest eigenvalue cannot be

computed in advance but is likely to fall in the range: (log(n)= log(1:5)+2; log(n)= log(1:5)+

8. This is consistent with empirical results. For our purposes we will say that the number

of beta function evaluations is: (log(1500)= log(1:5) + 2 = 20. The cost per beta function

evaluation is 2 matrix-matrix multiplies at: 2(n0)3=p 3 each, where n0 is the size of the

matrix being decoupled. Hence the cost for the �rst decoupling is: 2 � 2 � 20n3=p 3 =

80n3=p 3.

If each decoupling splits the matrix exactly in half, round i of decouplings involves

2i decouplings each involving a matrix of size n=2i at a total cost of: 2i � 80(n=2i)3 =

80n3=4i. The sum of all rounds would then be:P1

i=0 80n3=4i = 80� 4=3 = 107n3.

The ISDA for symmetric eigendecomposition may require substantially longer on

some matrices with a single cluster of eigenvalues containing more than half of the eigen-

values and on matrices with most of the eigenvalues at one end of the spectrum15. It is

unlikely that the �rst split point chosen for decoupling will lie in the middle of a cluster.

Hence, if the matrix contains one large cluster, that cluster will likely remain completely

in one of the two submatrices, making the decoupling less even and hence less successful.

Likewise, if most of the eigenvalues are at one end of the spectrum, the submatrix on that

end of the spectrum will likely be much larger than the other after the �rst decoupling. If

each decoupling splits o� only 20% of the spectrum, the total time will be twice what it

would be if each decoupling splits the spectrum exactly in half.

One could check to make sure that a reasonable split point has been chosen by

performing an LDLT decomposition on the shifted matrix, and counting the number of

positive or negative values in D. An LDLT decomposition costs 1=3n3 ops or about 0.5%

of the ops required to perform the full decoupling.

7.5 Banded ISDA

Banded ISDA is very nearly a tridiagonal based method and hence o�ers per-

formance that is nearly as good as tridiagonal based methods. PRISM's single processor

implementation of banded ISDA is two to three times slower than bisection (DSTEBZ)[26].

15Fann et al.[75] present a couple examples of real applications that �t this description.

136

Computing eigenvectors using banded ISDA will not only be more di�cult to code, it will

require about twice as many ops as inverse iteration. Banded ISDA requires additional

bandwidth reductions, each of which requiring up to 2n3 additional ops during back trans-

formation16.

Banded ISDA could make sense if reduction to banded form were twice as fast as

reduction to tridiagonal. Although even then one has to question whether it makes sense

to use banded ISDA instead of a banded solver.

Banded ISDA should perform a few shifted LDLT decompositions to make sure

that the selected shift will leave at least 1/3 of the matrix in each of the two submatrices.

7.6 FFT

Yau and Lu[174] have implemented an FFT based invariant subspace decomposi-

tion method. It, like ISDA, uses e�cient matrix-matrix multiply ops, but since it requires

100n3 ops the same analysis which shows that ISDA will not be faster applies to it as well.

Domas and Tisseur have implemented a parallel version of the Yau and Lu method[60].

16The �rst bandwidth reduction essentially always requires the full 2n3 ops during back transformation,though later ones typically require less than that. However, taking advantage of the opportunity to performfewer ops wither means a complex data structure or that the update matrix Q be formed and then applied,adding another 4=3n3 ops.

137

Chapter 8

Improving the ScaLAPACK

symmetric eigensolver

8.1 The next ScaLAPACK symmetric eigensolver

The next ScaLAPACK symmetric eigensolver will be 50% faster than the ScaLAPACK

symmetric eigensolver in version 1.5 and provide performance that is independent of the

user's data layout. Separating internal and external data layout will not only make the code

easier to use because the user need not modify their storage scheme, it will also improve

performance. The next ScaLAPACK symmetric eigensolver will select the fastest of four

methods for reduction to tridiagonal form1, and use Parlett and Dhillon's new tridiagonal

eigensolver[139].

Separating internal and external data layout allows execution time to be reduced

for three reasons. It allows reduction to tridiagonal form and back transformation to use

di�erent data layouts. It allows reduction to tridiagonal form to use a square processor grid,

signi�cantly reducing message latency and software overhead. It allows the code to support

any input and output data layout without all the layers of software required to support

any data layout. Last but not least by concentrating our coding e�orts on the simple, but

e�cient square cyclic data layout, we can implement several reduction to tridiagonal codes

and incorporate ideas that would be prohibitively complicated in a code that had to support

multiple data layouts.

1On machines where timers are not available, a heuristic will be used which may not always pick thefastest.

138

The rest of this section concentrates on improving execution time in reduction to

tridiagonal form. Back transformation is already very e�cient and hence leaves less room

for improvement. We leave the tridiagonal eigensolver to others[139]. Figure 8.1 gives a

top-level description of the next ScaLAPACK symmetric eigensolver.

Figure 8.1: Data redistribution in the next ScaLAPACK symmetric eigensolver

Choose a data layout for reduction to tridiagonal form (see �gure 8.2)Redistribute A to reduction to tridiagonal form data layoutReduce to tridiagonal formReplicate diagonal, (D), and sub-diagonal, (E), to all processorsUse Parlett and Dhillon's tridiagonal eigendecomposition schemeChoose data layout for back transformation

BCK-pr = dpp=15e ; BCK-pc = bp=prc ; BCK-nb = dn=(k pc)eIf space is limited, redistribute A back to original data layoutRedistribute eigenvectors, Z, to back transformation data layoutRedistribute A to back transformation data layoutPerform back transformationRedistribute eigenvectors to user's format

8.2 Reduction to tridiagonal form in the next ScaLAPACK sym-

metric eigensolver

Figure 8.2 shows how the data layout for reduction to tridiagonal form will be

chosen. The data layout and the code used for reduction to tridiagonal form must be

chosen in tandem.

Although the new PDSYTRD has three variants, they all share the same pattern of

communication and computation shown in �gure 8.3.

Message initiations are reduced by using techniques �rst used in HJS, and several

new ones. HJS stores V andW in a row-distributed/column-replicated manner which avoids

to need to broadcast them repeatedly. HJS also keeps the number of messages small by

combining messages wherever possible.

Our communication pattern has three advantages over HJS: it requires fewer mes-

sages, does not risk over/under ow and uses only the BLACS communication primitives2.

The manner in which we compute the Householder vector requires the same number of

message initiations as the HJS, but avoids the risk of over/under ow in the computation of

the norm. We use fewer messages than HJS because we update w in a novel manner (see

2Whether the right communication primitives were chosen for the BLACS may be debatable, but they arewhat is available for use within ScaLAPACK.

139

Figure 8.2: Choosing the data layout for reduction to tridiagonal form

If timers (or environmental inquiry routine) are availableTime select operationsDetermine the best data layout for each of the four reduction to tridiagonal form codesEstimate the execution time for each of the four reduction to tridiagonal form codesSelect the fastest code and the corresponding data layout

elseif p=bppc2 � 1:5 (i.e if p =2,3,6,7,14, 15)

TRD-pr = bp=7:5c+ 1TRD-pc = p=prTRD-nb = 32Use old PDSYTRD

elseTRD-pc = bppcTRD-pr =TRD-pcTRD-nb = 1if the compiler is good

if (n > 200pp)

Use new PDSYTRD with compiled kernelelse

Use unblocked reduction to tridiagonal form (no BLAS)endif

elseif (n > 100

pp)

Use new PDSYTRD with DGEMVelse

Use unblocked reduction to tridiagonal form (no BLAS)endif

endifendif

discussion of Line 4.1 below) and we delay the spread of w (which HJS naturally performs

at the bottom of the loop) to the top of the loop so that it can be spread in the same

message that spreads v.

Our communication pattern has one disadvantage over HJS: it requires redundant

computation in the update of w. The discussion of Line 4.1 below explains that we can

choose to eliminate this redundant computation by increasing the number of messages.

Line 2.1 in Figure 8.3 In Section 8.4.1 we show how to avoid over ow while using just

2n log(pp) messages.

Lines 3.2 and 3.6 in Figure 8.3 Only 2 messages are required to transpose a matrix

when a square processor layout is used. Each processor, (a; b) must sends a message

to, and receive a message from, its transpose processor (b; a). The required time is:

nXn0=1

2(�+ 2n0 �) = 2n�+ 2n2 �

Line 4.1 in Figure 8.3 w = w � W V Tv � V WT v can be computed in a number of

ways. W;V and v are distributed across processor rows and replicated across proces-

140

Figure 8.3: Execution time model for the new PDSYTRD. Line numbers match Fig-ure 4.5(PDSYTRD execution time) where possible.



Update current (ith) column of A1.2 A = A �W V T � V WT

Compute re ector

2.1 v = house(A) 2n lg(pp)�


1.1, 3.1 spread v; w across n lg(pp)�

n2 lg(pp)p

p �

3.2 transpose v; w 2 n2pp �

3.3 w = tril(A)v; 23n3

p 2

wT = tril(A;�1)vT 2 n2pp �


4.1 w = w �W V Tv � V WTv n2pp �

3.6 w = w + transpose wT


5.1 c = w � vT ; 2n lg(pp)�

w = � w� (c �=2) v

end do i = ii ;mxi


6.3 A = A �W V T � V WT 23n3

p 3end do ii = 1; n; nb

sor columns. WT ; V T and vT are distributed across processor columns and replicated

across processor rows. Furthermore, since only the partial sums contributing to w are

known, the updates to w can be made on any processor column, and even spread across

various processor columns. Appendix B.1 how this update is performed without com-

munication and shows that there are a range of options which trade o� communication

and load imbalance.

Line 1.1 in Figure 8.3 updates the current block column. This can be implemented in

141

several ways. LAPACK's DSYTRD uses a right looking update3 because a matrix-matrix

multiply is more e�cient than an outer product update. HJS uses a left looking update

because on their cyclic data layout, the left looking update allows all processors to be

involved, reducing load imbalance.

Line 5.1 in Figure 8.3 Computing c = wvT requires summing c within a processor col-

umn. In order to compute w in Line 5.1, c must be known throughout a processor

column. To allow w and v to be broadcast in the same message (Line 3.1), c is summed

and broadcast in the column that owns column i+ 1 of the matrix.

Line 6.1 in Figure 8.3 No communication is required here. W;V T and WT are already

replicated as necessary.

8.3 Making the ScaLAPACK symmetric eigensolver easier to

use

The next ScaLAPACK symmetric eigensolver will separate internal data layout from

external data layout while executing 50% faster than PDSYEVX on a large range of problem

sizes on most distributed memory parallel computers and requiring less memory. Separating

internal and external data layout allows the user to choose whatever data layout is most

appropriate for the rest of their code and to use that data layout regardless of the problem

size and computer they are using. Separating internal and external data layouts also makes

it easy for the ScaLAPACK symmetric eigensolver to add support for additional data layouts.

However, while these ease-of-use issues are the most important advantages of separating

internal and external data, we will focus further discussion on how this separation improves

performance.

8.4 Details in reducing the execution time of the ScaLAPACK

symmetric eigensolver

Separating internal and external data layout will improve the performance of

PDSYEVX by allowing PDSYEVX to use di�erent data layouts for di�erent tasks, and by allow-

3A right looking update updates the current column with a matrix-matrix multiply. A left looking updateupdates every column in the block column with an outer product update.

142

ing PDSYEVX to concentrate only on the most e�cient data layout for each task. A reduction

to tridiagonal form which only works on a cyclic data layout on a square processor grid will

not only have lower overhead and load imbalance than the present reduction to tridiagonal

form, but will be able to incorporate techniques that would be prohibitively complicated if

they were implemented in a code that must support all data layouts.

Signi�cant reduction of the execution time in PDSYEVX, the ScaLAPACK symmetric

eigensolver, requires that all four sources of ine�ciency (message latency, message trans-

mission, software overhead and load imbalance) be reduced. Fortunately, as Hendrickson,

Jessup and Smith[91] have shown, all of these can be reduced. PDSYEVX sends 3 times as

many messages as necessary4, and require 3 times as much message volume as well5. Over-

head and load imbalance costs are harder to quantify. Load imbalance costs will be reduced

by using data layouts appropriate to each task6. If necessary, load imbalance costs can be

further reduced at the expense of increasing the number of messages sent. Overhead will

be reduced by eliminating the PBLAS, reducing the number of calls to the BLAS and, where

a su�ciently good compiler is available, eliminating the calls to the BLAS entirely.

8.4.1 Avoiding over ow and under ow during computation of the House-

holder vector without added messages

Over ow and under ow can be avoided during the computation of the Householder

vector without added messages by using the pdnrm2 routine to broadcast values. The eas-

iest way to compute the norm of a vector in parallel is to sum the squares of the elements.

However, this will lead to over ow if the square of one of the elements or one of the inter-

mediate values are greater than the over ow threshold (likewise under ow occurs if one or

more of the squares of the elements or the intermediate vallues is less than the under ow

threshold). The ScaLAPACK routine pdnrm2 avoids under ow and over ow during reduc-

tions by computing the norm directly leaving the result on all processors in the processor

column. The requires 2 lg(pr)� execution time. In PDSYTRD, � = A(i+1; i) is broadcast

4PDSYEVX uses 17 n log(

pp), HJS uses 9n log(

pp), we will show that this can be reduced to 5n log(

pp)

but do not claim that this is minimal.5PDSYEVX sends (5 log(

pp) + 2)n2=

pp elements per processor and HJS reduces this to ( 12 log(

pp) +

52 ) n

2=pp elements per processor. The design I suggest requires ( 3

2 log(pp)+ 5

2 )n2=pp elements per processor

but requires fewer messages.6Statically balancing the number of eigenvectors assigned to each processor column will reduce load

imbalance in back transformation. Using a smaller block size will reduce load imbalance in reduction totridiagonal form

143

to all processors in the processor column, this requires 2 lg(pr)� execution time. In HJS,

they sum the squares of the elements and broadcast � = A(i+1; i) at the same time by

summing an additional value in the reduction. All processors except for the processor that

owns A(i+1; i) contribute 0 to the sum while the processor owning A(i+1; i) contributes

A(i+1; i).

In the new PDSYEVX, we will employ this trick, to broadcast � at the same time

as the norm is computed. It is slightly more complicated because norm computations do

not preserve negative numbers. Hence, we compute two norms: max(0; �) and max(0;��),from these � is easily recovered. Ideally, we need a new PBLAS or BLACS routine which

would simultaneously compute a norm and broadcast both it and other values.

8.4.2 Reducing communications costs

Communications costs can be reduced in both reduction to tridiagonal form and

back transformation but by vastly di�erent methods. PDSYTRD, ScaLAPACK's reduction to

tridiagonal form code, will use a cyclic data layout on a square processor grid to simplify

the code, allowing PDSYEVX to use the techniques demonstrated by Hendrickson, Jessup and

Smith[91]: direct transpose, a column replicated/row distributed data layout for interme-

diate matrices and combining messages. In addition, PDSYTRD will delay the last operation

in the loop to combine it with the �rst, reducing the number of messages per loop iteration

from 6 to 5.

Communication costs will be reduced in back transformation by using a rectangular

grid and a relatively large block size. Most of the communication in back transformation

is within processor columns, and the communication within processor columns cannot be

pipelined (meaning that it grows as log(pr)), hence setting pc to be substantially larger

(roughly 4-8 times larger) than pr will cut message volume nearly in half compared to the

message volume required for a square processor grid.

Communications cost could be reduced further on select computers by writing ma-

chine speci�c BLACS implementations7 , but I don't think that the bene�t will justify the

7Karp et al.[107] proved that a broadcast or reduction of k elements on px processors can be executedin log(px)�+ k �. Equally importantly, the latency term can be reduced signi�cantly by machine speci�ccode because latency is primarily a software cost, the actual hardware latency is typically less than onetenth of the total observed latency. I believe that by coding broadcasts and reductions in a machine speci�cmanner, I could reduce the latency to ( �software + log(px)�hardware. It might be possible to achieve asimilar result using active messages. Machine speci�c optimization of the BLACS broadcast and reductioncodes is attractive because it would bene�t all of the ScaLAPACK matrix transformation codes. However,

144

cost. In PDSYEVX as shipped in version 1.5 of ScaLAPACK, software overhead and load imbal-

ance are roughly twice as high as communications cost on the PARAGON. The new PDSYEVX

should reduce communications by at least a factor of 2, and though I hope it will reduce

software overhead and load imbalance by close to a factor of 4, overhead and load imbalance

will probably remain larger than communications cost. The fact that communication costs

is not the dominant factor limiting e�ciency limits the improvement that one can expect

from machine speci�c BLACS implementations.

Communications cost in back transformation could be reduced further by overlap-

ping communication and computation and/or using an all-to-all broadcast pattern instead

of a series of broadcasts. Back transformation enjoys the luxury of being able to compute

the majority of what it needs to communicate in advance. This allows many possibili-

ties for reducing the communications bandwidth cost. The fact that message latency, load

imbalance and software overhead costs are modest in back transformation means that a

reduction in the communications bandwidth cost ought to result in signi�cant performance

improvement in back transformation. However, overlapping communication and computa-

tion has historically o�ered less bene�t than in practice than in theory, (see section B.1.6)

so I approach this with caution and will not pursue it without �rst convincing myself that

the bene�t is signi�cant on several platforms.

8.4.3 Reducing load imbalance costs

Load imbalance can be reduced in both reduction to tridiagonal form and back

transformation by careful selection of the block size. The number of messages in reduction

to tridiagonal form is not dependent on the data layout block size, hence a cyclic data

layout (i.e. block size of 1) will be used, reducing load imbalance. The fact that only half

of the ops in reduction to tridiagonal form are BLAS3 ops and the large number of load

imbalanced row operations combine to make the optimal algorithmic block size for reduction

to tridiagonal form small.

Load imbalance is minimized in back transformation by choosing a block size

which assigns a nearly equal number of eigenvectors to each column of processors (nb =

dn=(k pc)e for some small integer k). A block cyclic data layout reduces execution time

in back transformation by reducing the number of messages sent, hence we must look for

purely from the point of view of improving the performance of the ScaLAPACK symmetric eigensolver thise�ort probably would not be worth the e�ort.

145

other ways to reduce load imbalance. Fortunately, all eigenvectors must be updated at each

step, hence a good static load balance of eigenvectors across processor columns eliminates

most of the load imbalance in back transformation. The load imbalance within each column

of processors is less important because the number of processor rows will be small. The

computation of T can be performed simultaneously on all processor columns, eliminating

the load imbalance in that step.

8.4.4 Reducing software overhead costs

There are many ways to reduce software overhead, but software overhead is poorly

understood and hence it is hard to predict which method will be best. Hendrickson, Jessup

and Smith[91] showed that using a cyclic data layout and a square processor grid reduces the

number of DTRMV calls from O(n2=nb) to O(n) because each local matrix is triangular. Using

lightweight (no error checking, minimal overhead) BLAS would reduce software overhead, but

these are still in the planning stages. If the compiler produces e�cient code for a simple

doubly nested loop, software overhead can be further reduced by using a compiled code

instead of calls to the BLAS. Peter Strazdins has shown that software overhead within the

PBLAS can be reduced up to 50%[161, 160]. Alternatively, eliminating the PBLAS entirely

would eliminate the overhead associated with the PBLAS. I would prefer to reduce the PBLAS

overhead and continue to use the PBLAS. But, that is likely to be much harder than simply

abandoning the PBLAS.

When PDSYTRD, ScaLAPACK's reduction to tridiagonal form, was written the PBLAS

did not support column-replicated/row-distributed matrices or algorithmic blocking. Hence,

many of the ideas mentioned here for improving the performance of PDSYTRD were not

available to a PBLAS-based code. PBLAS version 2 now o�ers these capabilities.

Software overhead cannot be measured separate from other costs and is hence

di�cult to measure, understand and reason about. It varies widely frommachine to machine

and can change just by changing the order in which subroutines are linked. We do not,

for example, know how much can be attributed to subroutine calls, how much is caused

by error checking, how much is caused by loop unrolling and how much is caused by code

cache misses.

A good compiler should be able to compute the local portion of Av faster than

two calls to DTRMV because a simple doubly nested loop could access each element in the

146

local portion of A only once whereas two calls to DTRMV would require that each element

in A be read twice. The result is that the ratio of ops to main memory reads is 4-to-1

in the doubly nested loop versus 2-to-1 in DTRMV8. Furthermore, a compiled kernel would

avoid the BLAS overhead and might involve less loop unrolling - reducing overhead directly

and reducing code cache pressure as well. However, compiler technology is uneven, so we

would make using compiled code instead of the BLAS optional.

Unblocked reduction to tridiagonal form will likely be faster than blocked reduc-

tion to tridiagonal form on problem sizes where software overhead is the dominant cost.

Unblocked reduction to tridiagonal form on a cyclic data layout eliminates load imbalance,

requires a minimum of communication and software overhead. The only disadvantage is

that all of the 4=3n3 ops are BLAS2 ops. However, with a good compiler, these BLAS2

ops can perform well on most computers. The kernel in an unblocked reduction to tridiag-

onal form involves 8 ops to each read-modify-write memory access9. Most computers have

adequate main memory bandwidth to handle this at full speed. However, not all compilers

are good enough yet.

8.5 Separating internal and external data layout without in-

creasing memory usage

Separating internal and external data layout will require memory-intensive data re-

distribution, but making the data redistribution codes more space e�cient will save enough

memory space to o�set the memory needs of separating internal and external data lay-

out. Data redistributions between two data layouts with di�erent values of pr; pc or nb use

messages of O(n2=(p3=2) + nb2) data elements. However, degenerate data redistributions

between two data layouts with the same values of pr; pc or nb use messages of roughly

n2=p elements. In order to avoid treating degenerate data redistributions separately, the

current redistribution codes require n2=p bu�er space for all redistributions. Splitting one

large message into several smaller ones is not conceptually di�cult but will require that the

code be rewritten and the testing will have to be augmented to properly exercise the new

paths. However, the execution time will not be signi�cantly a�ected. Both PDLARED2D, the

8These ratios are 8-to-1 and 4-to-1 respectively for Hermitian matrices.9The ratio for reducing Hermitian matrices to tridiagonal form is 16 ops per read-modify-=write

operation.

147

eigenvector redistribution routine, and DGMR2D, the general purpose redistribution routine,

will have to be modi�ed.

If the redistribution routines are not modi�ed as described above, memory usage

would increase from 4n2=p to 6n2=p, and run a remote risk of causing the eigensolver to

crash. While both PDLARED2D and DGMR2D require n2=p space and could use the same

space, they do not. PDLARED2D uses space passed to it in the WORK array, while DGMR2D calls

malloc to allocate space. The eigensolver could crash if a message of n2=p elements were

sent, and the communication system was unable to allocate a bu�er of that size. Messages of

that size are not required during normal ScaLAPACK eigensolver tests, hence the eigensolver

could crash during regular use even after passing all tests and after months or even years

of awless service. Modifying the redistribution routines as we propose, eliminates this

potential problem.

Memory needs could be reduced from 4n2=p to 3n2=p by using the space allocated

to the input matrix, A, and the output matrix, Z, as internal workspace. This would

require a modi�cation to the present calling sequence, probably in the form of a new data

descriptor. However, reducing memory usage by 25% may not justify a change to the calling

sequence.

148

Chapter 9

Advice to symmetric eigensolver

users

Parallel dense tridiagonal eigensolvers should be used if none of the following

counter indications hold. Use a serial eigensolver if the problem is small enough to �t1. Use

a sparse eigensolver if your input matrix is sparse2 and you don't need all the eigenvalues

or if the matrix is dense and you only need a small fraction of the eigenvalues. Use a

Jacobi eigensolver if you need to compute small eigenvalues of a scaled diagonally dominant

matrix (or a matrix satidying one of the other properties described by Demmel et al.[56])

accurately.Use a Jacobi eigensolver for small (n < 100pp) spectrally diagonally dominant

matrices3.

Currently the three most readily available parallel dense symmetric eigensolvers

are PeIGs and ScaLAPACK's PDSYEV and PDSYEVX. PeIGs and PDSYEVmaintain orthogonality

among eigenvectors associated with clustered eigenvalues. PeIGs and PDSYEVX are faster

than PDSYEV. PDSYEVX scales better than either PeIGs or PDSYEV.

The choice between PeIGs and ScaLAPACK is probably more a matter of which

infrastructure4 is preferred and is out of the scope of this thesis. Furthermore, it is likely

that PeIGs will at some point use the ScaLAPACK symmetric eigensolver in the future. Hence,

1i.e. if memory allows2The break-even point is not known, so I suggest that if your matrix is less than 10% non-zero and you

need less than 10% of the eigenvalues you should use a sparse eigensolver.3Spectrally diagonally dominant means that the eigenvector matrix, or a permutation thereof, is diago-

nally dominant.4PeIGs is built on top of Global Arrays[101] while ScaLAPACK is built on the BLACS or MPI.

149

the upgrade path for both may end up with the same underlying code. If you are not likely

to use more than 32 processors, PeIGs performance should be acceptable5. If your input

matrices do not include large clusters of eigenvalues or if you can accept non-orthogonal

eigenvectors, PDSYEVX is the right choice. Otherwise, i.e. if your input matrix has large

clusters of eigenvalues for which you need orthogonal eigenvectors, and you wish to use

more than 32 processors, PDSYEV is the right choice. Eventually, the imporved version of

PDSYEVX described in Chapter 8 will be the method of choice in all cases.

5Since PeIGs uses a 1D data layout, its performance will degrade if you use more than 32 processors.

150

Part II

Second Part

151

Bibliography

[1] R.C. Agarwal, S.M. Balle, F.G. Gustavson, M. Joshi, and P. Palkar. A three-

dimensional approach to parallel matrix multiplication. IBM Journal of Research and

Development, 39(5), 1995. also available as:http://www.almaden.ibm.com/journal/

rd/agarw/agarw.html.

[2] R.J. Allan and I.J. Bush. Parallel diagonalisation routines. Technical report, The

CCLRC HPCI Centre at Daresbury Labaratory, 1996. http://www.dl.ac.uk/TCSC/

Subjects/Parallel_Algorithms/diags/diags.doc.

[3] A. Anderson, D. Culler, D. Patterson, and the NOW Team. A case for networks

of workstations: NOW. IEEE Micro, Feb 1995. http://now.CS.Berkeley.EDU/

Papers2.

[4] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum,

S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen. LAPACK Users'

Guide (second edition). SIAM, Philadelphia, 1995. 324 pages.

[5] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. DuCroz, A. Greenbaum,

S. Hammarling, A. McKenney, and D. Sorensen. LAPACK: A portable linear algebra

library for high-performance computers. Computer Science Dept. Technical Report

CS-90-105, University of Tennessee, Knoxville, 1990. LAPACK Working Note #20

http://www.netlib.org/lapack/lawns/lawn20.ps.

[6] E. Anderson and J. Dongarra. Evaluating block algorithm variants in LAPACK. Com-

puter Science Dept. Technical Report CS-90-103, University of Tennessee, Knoxville,

1990. (LAPACK Working Note #19).

152

[7] ANSI/IEEE, New York. IEEE Standard for Binary Floating Point Arithmetic, Std

754-1985 edition, 1985.

[8] P. Arbenz, K. Gates, and Ch. Sprenger. A parallel implementation of the symmetric

tridiagonal qr algorithm. In Frontier's 92, McLEan, Virginia, 1992.

[9] P. Arbenz and I. Slapni�car. On an implementation of a one-sided block jacobi method

on a distributed memory computer. Z. Angew. Math. Mech., (76, Suppl. 1):343{344,

1996. http://www.inf.ethz.ch/personal/arbenz/ICIAM_jacobi.ps.gz.

[10] Peter Arbenz and Michael Oettli. Block implementations of the symmetric qr and

jacobi algorithms. Technical Report 178, Swiss Institue of Technology, 1995. ftp:

//ftp.inf.ethz.ch:/pub/publications/tech-reports/1xx/178.ps.

[11] K. Asanovic. Ipm: Interval performance monitoring. http://www.icsi.berkeley.

edu/~krste/ipm/IPM.html.

[12] C. Ashcraft. A taxonomy of distributed dense LU factorization methods. Technical

Report ECA-TR-161, Boeing Computer Services, March 1991.

[13] Z. Bai and J. Demmel. On a block implementation of Hessenberg multishift QR

iteration. International Journal of High Speed Computing, 1(1):97{112, 1989. (also

LAPACK Working Note #8 http://www.netlib.org/lapack/lawns/lawn8.ps).

[14] R. Barlow, D. Evans, and J. Shanehchi. Parallel multisection applied to the eigenvalue

problem. Comput. J., 6:6{9, 1983.

[15] R.H. Barlow and D.J. Evans. A parallel organization of the bisection algorithm. The

Computer Journal, 22(3), 1978.

[16] Mike Barnett, Lance Shuler, Robert van de Geijn, Satya Gupta, David Payne, and

Jerrell Watts. Interprocessor collective communication library (InterCom). In Pro-

ceedings of the Scalable High Performance Computing Conference, pages 357{364.

IEEE, 1994. ftp://ftp.cs.utexas.edu/pub/rvdg/shpcc.ps.

[17] A. Basermann and P. Weidner. A parallel algorithm for determining all eigenvalues of

large real symmetric tridiagonal matrices. Parallel Computing, 18:1129{1141, 1992.

153

[18] K. Bathe. Finite Element Procedures in Enginerring Analysis. Prentice Hall, Inc.,

Englewood Cli�s, NJ, 1982.

[19] A. Beguelin, J. Dongarra, A. Geist, R. Manchek, and V. Sunderam. A users' guide

to PVM: Parallel virtual machine. Technical Report ORNL/TM-11826, Oak Ridge

National Laboratory, Oak Ridge, TN, July 1991.

[20] H. Bernstein and M. Goldstein. Parallel implementation of bisection for the calcu-

lation of eigenvalues of a tridiagonal symmetric matrices. Technical report, Courant

Institute, New York, NY, 1985.

[21] M. Berry and A. Sameh. Parallel algorithms for the singular value and dense sym-

metric eigenvalues problems. J. Comput. and Appl. Math., 27:191{213, 1989.

[22] Allan J. Beveridge. A general atomic and molecular electronic structure system.

available as:http://gserv1.dl.ac.uk/CFS/gamess_4.html.

[23] J. Bilmes, K. Asanovic, J. Demmel, D. Lam, and C.-W. Chin. Optimizing matrix mul-

tiply using phipac: a portable, high-performance, ansi c coding methodology. Com-

puter Science Dept. Technical Report CS-96-326, University of Tennessee, Knoxville,

May 1996. LAPACK Working Note #111 http://www.netlib.org/lapack/lawns/

lawn111.ps.

[24] C. Bischof and X. Sun. A framework for symmateric band reduction and tridiagonal-

ization. Technical report, Supercomputing Research Center, 1991. (Prism Working

Note #3 ftp://ftp.super.org/pub/prism/wn3.ps).

[25] C. Bischof, X. Sun, and B. Lang. Parallel tridiagonalization through two-step band

reduction. In Scalable High-Performance Computing Conference. IEEE Computer

Society Press, May 1994. (Also Prism Working Note #17 ftp://ftp.super.org/

pub/prism/wn17.ps).

[26] C. Bischof, X. Sun, A. Tsao, and T. Turnbull. A study of the invariant subspace

decomposition algorithm for banded symmetric matrices. In Proceedings of the Fifth

SIAM Conference on Applied Linear Algebra. IEEE Computer Society Press, June

1994. (Also Prism Working Note #16 ftp://ftp.super.org/pub/prism/wn16.ps).

154

[27] C. Bischof and C. Van Loan. The WY representation for products of Householder

matrices. SIAM J. Sci. Statist. Comput., 8:s2{s13, 1987.

[28] Christian Bischof, William Gerorge, Steven Huss-Lederman, Xiaobai Sun, Anna Tsao,

and Thomas Turnbull. Prism software, 1997. http://www.mcs.anl.gov/Projects/

PRISM/lib/software.html.

[29] Christian Bischof, William Gerorge, Steven Huss-Lederman, Xiaobai Sun, Anna Tsao,

and Thomas Turnbull. SYISDA User's Guide, version 2.0 edition, 1995. ftp://ftp.

super.org/pub/prism/UsersGuide.ps.

[30] R. H. Bisseling and J. G. G. van de Vorst. Parallel LU decomposition on a transputer

network. In G. A. van Zee and J. G. G. van de Vorst, editors, Lecture Notes in

Computer Science, Number 384, pages 61{77. Springer-Verlag, 1989.

[31] L. S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra,

S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley.

ScaLAPACK Users' Guide. SIAM, Philadelphia, 1997. http://www.netlib.org/

scalapack/slug/scalapack_slug.html.

[32] Jerry Bolen, Arlin Davis, Bill Dazey, Satya Gupta, Greg Henry, David Robboy, Guy

Shi�er, David Scott, Mark Stallcup, Amir Taraghi, Stephen Wheat, LeeAnn Fisk,

Gabi Istrail, Chu Jong, Ro� Riesen, and Lance Shuler. Massively parallel distributed

computing: World's �rst 281 giga op supercomputer. In Intel Supercomputer User's

Group, 1995.

[33] R. P. Brent. Algorithms for minimization without derivatives. Prentice-Hall, 1973.

[34] K. Bromley and J. Speiser. Signal processing algorithms, architectures and applica-

tions. In Proceedings SPIE 27th Annual International Technical Symposium, 1983.

Tutorial 31.

[35] S. Carr and R. Lehoucq. Compiler blockability of dense matrix factorizations.

ACM TOMS, 1977. also available as:ftp://info.mcs.anl.gov/pub/tech_reports/

lehoucq/block.ps.Z.

155

[36] S. Chakrabarti, J. Demmel, and K. Yelick. Modeling the bene�ts of mixed data and

task parallelism. In Symposium on Parallel Algorithms and Architectures (SPAA),

July 1995. http://HTTP.CS.Berkeley.EDU/~yelick/soumen/mixed-spaa95.ps.

[37] H. Chang, S. Utku, M. Salama, and D. Rapp. A parallel Householder tridiagonaliza-

tion strategem using scattered row decomposition. I. J. Num. Meth. Eng., 26:857{874,

1988.

[38] H.Y. Chang and M.Salama. A parallel Householder tridiagonalization stratagem using

scattered square decomposition. Parallel Computing, 6:297{312, 1988.

[39] S. Chinchalkar. Computing eigenvalues and eigenvectors of a dense real symmetric

matrix on the ncube 6400. Technical Report CTC91TR74, Advanced Computing

research Institute, June 1991.

[40] J. Choi, J. Demmel, I. Dhillon, J. Dongarra, S. Ostrouchov, A. Petitet, K. Stanley,

D. Walker, and R. C. Whaley. ScaLAPACK: A portable linear algebra library for

distributed memory computers - Design issues and performance. Computer Science

Dept. Technical Report CS-95-283, University of Tennessee, Knoxville, March 1995.

LAPACK Working Note #95 http://www.netlib.org/lapack/lawns/lawn95.ps.

[41] J. Choi, J. Dongarra, S. Ostrouchov, A. Petitet, D. Walker, and R. C. Whaley. A

proposal for a set of parallel basic linear algebra subprograms. Computer Science Dept.

Technical Report CS-95-292, University of Tennessee, Knoxville, May 1995. LAPACK

Working Note #100 http://www.netlib.org/lapack/lawns/lawn100.ps.

[42] J. Choi, J. Dongarra, R. Pozo, and D. Walker. ScaLAPACK: A scalable linear algebra

library for distributed memory concurrent computers. In Proceedings of the Fourth

Symposium on the Frontiers of Massively Parallel Computation, pages 120{127. IEEE

Computer Society Press, 1992. LAPACK Working Note #55 http://www.netlib.

org/lapack/lawns/lawn55.ps.

[43] C-C Chou, Y. Deng, G. Li, and Y. Wang. Parallelizing strassen's method for matrix

multiplication on distributed-memory mimd architectures. International Journal of

Computers and Mathematics with Applications, 30:45{69, 1995.

156

[44] Almadena Chtchelkanova, John Gunnels, Greg Morrow, James Overfelt, and

Robert A. van de Geijn. Parallel implementation of blas: General techniques for level

3 blas. Technical Report TR-95-40, Department of Computer Sciences, University of

Texas, October 1995. PLAPACK Working Note #4, to appear in Concurrency: Prac-

tice and Experience. http://www.cs.utexas.edu/users/plapack/plawns.html.

[45] M. Chu. A note on the homotopy method for linear algebraic eigenvalue problems.

Lin. Alg. Appl, 105:225{236, 1988.

[46] John M. Conroy and Louis J. Podrazik. A parallel inertia method for �nding eigen-

values on vector and simd architectures. SIAM Journal on Statistical Computing,

16:500{505, March 1995.

[47] F. J. Corbato. On the coding of jacobi's method for computing eigenvalues and

eigenvectors of real symetric matrices. Journal of the ACM, 10(2):123{125, 1963.

[48] Jessup E.R. Crivelli, S. The cost of eigenvalue computation on distributed memory

mimd multiprocessors. Parallel Computing, 21:401{422, 1995.

[49] J. Cullum and R. A. Willoughby. Lanczos algorithms for large symmetric eigenvalue

computations. Birkha�user, Basel, 1985. Vol.1, Theory, Vol.2. Program.

[50] J.J.M. Cuppen. A divide and conquer method for the symmetric tridiagonal eigen-

problem. Numer. Math., 36:177{195, 1981.

[51] M. Dayde, I. Du�, and A. Petitet. A Parallel Block Implementation of Level 3

BLAS for MIMD Vector Processors. ACM Transactions on Mathematical Software,

20(2):178{193, 1994.

[52] E. D'Azevedo. personal communication, 1997. http://www.epm.ornl.gov/

~efdazedo/.

[53] J. Demmel. CS 267 Course Notes: Applications of Parallel Processing. Computer

Science Division, University of California, 1991. 130 pages.

[54] J. Demmel, I. Dhillon, and H. Ren. On the correctness of some bisection-like parallel

eigenvalue algorithms in oating point arithmetic. Electronic Trans. Num. Anal.,

3:116{140, December 1995. LAPACK working note 70.

157

[55] J. Demmel, J. J. Dongarra, S. Hammarling, S. Ostrouchov, and K. Stanley. The dan-

gers of heterogeneous network computing: Heterogenous networks considered harmful.

In Proceedings Heterogeneous Computing Workshop '96, pages 64{71. IEEE Computer

Society Press, 1996.

[56] J. Demmel, M. Gu, S. Eisenstat, I. Slapnicar, K. Veselic, and Z. Drmac. Computing

the singular value decomposition with high relative accuracy. Computer Science Dept.

Technical Report CS-97-348, University of Tennessee, Feb 1997. LAPACK Working

Note #2 http://www.netlib.org/lapack/lawns/lawn2.ps.

[57] J. Demmel and K. Stanley. The performance of �nding eigenvalues and eigenvectors

of dense symmetric matrices on distributed memory computers. In Proceedings of the

Seventh SIAM Conference on Parallel Proceesing for Scienti�c Computing. SIAM,

1994.

[58] J. Demmel and K. Veseli�c. Jacobi's method is more accurate than QR. SIAM J. Mat.

Anal. Appl., 13(4):1204{1246, 1992. (also LAPACK Working Note #15).

[59] Inderjit Dhillon. A New O(n2) Algorithm for the Symmetric Tridiagonal Eigen-

value/Eogenvector Problem. PhD thesis, University of California at Berkeley, 1997.

[60] St�ephane Domas and Fran�coise Tisseur. Parallel implementation of a symmetric eigen-

solver based on the yau and lu method. In International Journal of Supercomputer

Applications (proceedings of Environments and Tools For Parallel Scienti�c Comput-

ing III, Faverges de la Tour, France, 21-23 August, 1996.

[61] J. Dongarra, J. Bunch, C. Moler, and G. W. Stewart. LINPACKUser's Guide. SIAM,

Philadelphia, PA, 1979.

[62] J. Dongarra, J. Du Croz, I. Du�, and S. Hammarling. A set of Level 3 Basic Linear

Algebra Subprograms. ACM Trans. Math. Soft., 16(1):1{17, March 1990.

[63] J. Dongarra, J. Du Croz, S. Hammarling, and Richard J. Hanson. An Extended Set of

FORTRAN Basic Linear Algebra Su broutines. ACM Trans. Math. Soft., 14(1):1{17,

March 1988.

158

[64] J. Dongarra, S. Hammarling, and D. Sorensen. Block reduction of matrices to con-

densed forms for eigenvalue computations. J. Comput. Appl. Math., 27:215{227, 1989.

LAPACK Working Note #2 http://www.netlib.org/lapack/lawns/lawn2.ps.

[65] J. Dongarra, R. Hempel, A. Hay, and D. Walker. A proposal for a user-level message

passing interface in a distributed memory environment. Technical Report ORNL/TM-

12231, Oak Ridge National Laboratory, Oak Ridge, TN, February 1993.

[66] J. Dongarra and D. Sorensen. A fully parallel algorithm for the symmetric eigenprob-

lem. SIAM J. Sci. Stat. Comput., 8(2):139{154, March 1987.

[67] J. Dongarra and R. van de Geijn. Reduction to condensed form for the eigenvalue

problem on distributed memory computers. Computer Science Dept. Technical Report

CS-91-130, University of Tennessee, Knoxville, 1991. LAPACK Working Note #30

http://www.netlib.org/lapack/lawns/lawn30.ps also Parallel Computing.

[68] J. Dongarra, R. van de Geijn, and D. Walker. A look at scalable dense linear alge-

bra libraries. In Scalable High-Performance Computing Conference. IEEE Computer

Society Press, April 1992.

[69] J. Dongarra and R. C. Whaley. A user's guide to the blacs v1.1. Technical report,

University of Tennessee, Knoxville, March 1995. LAPACK Working Note #94 http:

//www.netlib.org/lapack/lawns/lawn94.ps.

[70] C. C. Douglas, M. Heroux, G. Slishman, and R.M. Smith. Gemmw: A portable level

3 blas winograd variant of strassen's matrix-matrix multiply algorithm. Journal of

Computational Physics, 110:1{10, 1994.

[71] Zlatko Drma�c and Kre�simir Veseli�c. Iterative re�nement of the symmet-

ric eigensolution. Technical report, University of Colorado at Boulder, 1997.

[email protected].

[72] P.J. Eberlein and M. Mantharam. Jacobi sets for the eigenproblem and their e�ect

of convergence studied by graphci representations. Technical report, SUNY Bu�alo,

1990.

[73] P.J. Eberlein and M. Mantharam. New jacobi for parallel computations. Parallel

Computing, 19:437{454, 1993.

159

[74] G. Fann and R. Little�eld. Performance of a fully parallel dense real symmetric

eigensolver in quantum chemistry applications. In Proceedings of the Sixth SIAM

Conference on Parallel Processing for Scienti�c Computation. SIAM, 1994.

[75] G. Fann and R. J. Little�eld. A parallel algorithm for householder tridiagonalization.

In Proceedings of the Sixth SIAM Conference on Parallel Processing for Scienti�c

Computing, pages 409{413. SIAM, 1993.

[76] R. Fellers. Performance of pdsyev, : : : . Mathematics Dept. Master's Thesis avail-

able by anonymous ftp to http://cs-tr.CS.Berkeley.EDU/NCSTRL/, University of

California, 1997.

[77] V. Fernando, B. Parlett, and I. Dhillon. A way to �nd the most redundant equation

in a tridiagonal system. Berkeley Mathematics Dept. Preprint, 1995.

[78] UTK Joint Institute for Computational Science, 1997. http://www-jics.cs.utk.

edu/SP2/sp2_config.html.

[79] J.G.F. Francis. The QR transformation: A unitary analogue to the LR transformation,

parts I and II. The Computer Journal, 4:265{272, 332{345, 1961.

[80] K. Gates. A rank-two divide and conquer method for the symmetric tridiagonal

eigenproblem. In Frontier's 92, McLean, Virginia, 1992.

[81] Kevin Gates. Using inverse iteration to improve the divide and conquer algorithm.

Technical Report 159, Swiss Institue of Technology, 1991.

[82] Kevin Gates and Peter Arbenz. Parallel divide and conquer algorithms for the sym-

metric tridiagonal eigenproblem. Technical Report 222, Swiss Institue of Technology,

1995. ftp://ftp.inf.ethz.ch:/pub/publications/tech-reports/2xx/222.ps.

[83] W. Givens. Numerical computation of the characteristic values of a real matrix.

Technical Report 1574, Oak Ridge National Laboratory, 1954.

[84] Gene H. Golub and Charles F. Van Loan. Matrix Computations. The Johns Hopkins

University Press, Baltimore, MD, 1983.

160

[85] J. G�otze. On the parallel implementation of jacobi and kogbetliantz algorithms.

SIAM J. on Sci. Comput., pages 1331{1348, 1994. http://www.nws.e-technik.

tu-muenchen.de/~jugo/pub/SIAMjac.ps.Z.

[86] A. Greenbaum and J. Dongarra. Experiments with QL/QR methods for the sym-

metric tridiagonal eigenproblem. Computer Science Dept. Technical Report CS-

89-92, University of Tennessee, Knoxville, 1989. LAPACK Working Note #17

http://www.netlib.org/lapack/lawns/lawn17.ps.

[87] Numerical Algorithms Group, 1997. http://www.nag.co.uk/numeric.html.

[88] M. Gu and S. Eisenstat. A stable algorithm for the rank-1 modi�cation of the sym-

metric eigenproblem. Computer Science Dept. Report YALEU/DCS/RR-916, Yale

University, September 1992.

[89] M. Gu and S. C. Eisenstat. A divide-and-conquer algorithm for the symmetric tridi-

agonal eigenproblem. SIAM J. Mat. Anal. Appl., 16(1):172{191, January 1995.

[90] M. Hegland, M. H. Kahn, and Osborne M. R. A parallel algorithm for the reduction to

tridiagonal form for eigendecomposition. Technical Report TR-CS-96-06, Australian

National University, 1996. http://cs.anu.edu.au/techreports/1996/index.html.

[91] B. Hendrickson, E. Jessup, and C. Smith. A parallel eigensolver for dense symmetric

matrices. Technical Report SAND96{0822, Sandia National Labs, Albuquerque, NM,

March 1996. Submitted to SIAM J. Sci. Comput.

[92] G. Henry. personal communication, 1997. http://www.cs.utk.edu/~ghenry/.

[93] Greg Henry. Improving Data Re-Use in Eigenvalue-Related Computations. PhD thesis,

Cornell University, 1994.

[94] High Performance Fortran Forum. High Performance Fortran language speci�cation

version 1.0. Draft, January 1993. Also available as technical report CRPC-TR 92225,

Center for Research on Parallel Computation, Rice University.

[95] Y. Huo and R. Schreiber. E�cient, massively parallel eigenvalue computations.

preprint, 1993.

161

[96] S. Huss-Lederman, Jacobson E.M., J. R. Johnson, Tsao A., and T. Turnbull.

\strassen's algorithm for matrix multiplication: Modeling, analysis, and implementa-

tion". Technical report, Center for Computing Sciences, 1996. (Also Prism Working

Note #34 ftp://ftp.super.org/pub/prism/wn34.ps).

[97] S. Huss-Lederman, A. Tsao, and G. Zhang. \a parallel implementation of the invariant

subspace decomposition algorithm for dense symmetric matrices". In Proceedings of

Sixth SIAM conference on Parallel Processing for Scienti�c Computing, March 1993.

(Also Prism Working Note #9 ftp://ftp.super.org/pub/prism/wn9.ps).

[98] IBM, Kingston, NY. Engineering and Scienti�c Subroutine Library | Guide and

Reference, release 3 edition, 1988. Order No. SC23-0184.

[99] I. Ipsen and E. Jessup. Solving the symmetric tridiagonal eigenvalue problem on the

hypercube. SIAM J. Sci. Stat. Comput., 11(2):203{230, 1990.

[100] C.G.J. Jacobi. �Uber ein leichtes verfahren die in der theorie der s�acul�arst�orungen

vorkommenden gleichungen numerisch aufzul�osen. Crelle's Journal, 30:51{94, 1846.

[101] Paci�c Northwest Labaratories Jarek Nieplocha, 1996.

http://www.emsl.pnl.gov:2080/docs/global/ga.html.

[102] E. Jessup and I. Ipsen. Improving the accuracy of inverse iteration. SIAM J. Sci.

Stat. Comput., 13(2):550{572, 1992.

[103] B. K�agstr�om, P. Ling, and C. Van Loan. GEMM-Based Level 3 BLAS: High-

Performance Model Implementations and Performance Evaluation Benchmark. Re-

port UMINF-95.18, Department of Computing Science, Ume�a University, S-901 87

Ume�a, Sweden, 1995. To appear in ACM Trans. Math. Software LAPACK Working

Note #107 http://www.netlib.org/lapack/lawns/lawn107.ps.

[104] B. K�agstr�om, P. Ling, and C. Van Loan. GEMM-Based Level 3 BLAS: Portability

and Optimization Issues . Technical report, Department of Computing Science, Ume�a

University, 1997. To appear in ACM Trans. Math. Software.

[105] W. Kahan. Accurate eigenvalues of a symmetric tridiagonal matrix. Computer Science

Dept. Technical Report CS41, Stanford University, Stanford, CA, July 1966 (revised

June 1968).

162

[106] R.K. Kamilla, X.G. Wu, and J.K. Jain. Composite fermion theory of collective exci-

tations in fractional quantum hall e�ect. Physical Review Letters, 1996.

[107] R.M. Karp, A. Sahay, E. Santos, and K.E. Schauser. Optimal broadcast and summa-

tion in the LogP model. In Proc. 5th ACM Symposium on Parallel Algorithms and

Architectures, pages 142{153, 1993.

[108] L. Kaufman. Banded eigenvalue solvers on vector machines. ACM Trans. Math. Soft.,

10:73{86, 1984.

[109] L. Kaufman. A parallel qr algorithm for the symmetric tridiagonal eigenvalue problem.

Journal of Parallel and Distributed Computing, 23:429{434, 1994.

[110] D. Koebel, D. Loveman, R. Schreiber, G. Steele, and M. Zosel. The High Performance

Fortran Handbook. MIT Press, Cambridge, 1994.

[111] A. S. Krishnakumar and M. Morf. Eigenvalues of a symmetric tridiagonal matrix: A

divide and conquer approach. Numer. Math., 48:349{368, 1986.

[112] Peter Freche Krystian Pracz, Martin Janssen. Correlation of eigenstates in the critical

regime of quantum hall systems. J. Phys. Condens. Matter, 8:7147{7159, 1996. also

available as:http://xxx.lanl.gov/abs/cond-mat/9605012.

[113] D. Kuck and A. Sameh. A parallel QR algorithm for symmetric tridiagonal matrices.

IEEE Trans. Computers, C-26(2), 1977.

[114] J.R. Kuttler and V.G. Sigillito. Eigenvalues of the laplacian in two dimensions. SIAM

Review, 26:163{193, 1984.

[115] M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimiza-

tions of blocked algorithms. In Proceedings of the Fourth International Conference

on Architectural Support for Programming Languages and Operating Systems, pages

63{74, April 1991.

[116] B. Lang. A parallel algorithm for reducing symmetric banded matrices to tridiagonal

form. SIAM J. Sci. Comput., 14(6), November 1993.

[117] C. Lawson, R. Hanson, D. Kincaid, and F. Krogh. Basic Linear Algebra Subprograms

for Fortran usage. ACM Trans. Math. Soft., 5:308{323, 1979.

163

[118] Thomas J. LeBlanc and Evangelos P. Markatos. Shared memory vs. message pass-

ing in shared-memory multiprocessors. In 4th Symp. on Parallel and Distributed

Processing, 1992. ftp://ftp.cs.rochester.edu/pub/papers/systems/92.ICPP.

locality_vs_load_balancing.ps.Z.

[119] R. B. Lehoucq. Arpack software. http://www.mcs.anl.gov/home/lehoucq/

software.html.

[120] K. Li and T.-Y. Li. An algorithm for symmetric tridiagonal eigenproblems | divide

and conquer with homotopy continuation. SIAM J. Sci. Comp., 14(3), May 1993.

[121] Rencang Li and Huan Ren. An e�cient tridiagonal eigenvalue solver on cm 5 with

laguerre's iteration. Computer Science Division CSD-94-848, University of Califor-

nia, 1994. http://sunsite.berkeley.edu/Dienst/UI/2.0/Describe/ncstrl.ucb%

2fCSD-94-848.

[122] T.-Y. Li and Z. Zeng. Laguerre's iteration in solving the symmetric tridiagonal eigen-

problem - a revisit. Michigan State University preprint, 1992.

[123] T.-Y. Li, H. Zhang, and X. H. Sun. Parallel homotopy algorithm for symmetric

tridiagonal eigenvalue problems. SIAM J. Sci. Stat. Comput., 12:464{485, 1991.

[124] W. Lichtenstein and S. L. Johnsson. Block cyclic dense linear algebra. SIAM J. Sci.

Comp., 14(6), November 1993.

[125] R.J. Little�eld and K. J. Maschho�. Investigating the performance of parallel eigen-

solvers for large processor counts. Theretica Chemica Acta, 84:457{473, 1993.

[126] S.-S. Lo, B. Phillipe, and A. Sameh. A multiprocessor algorithm for the symmetric

eigenproblem. SIAM J. Sci. Stat. Comput., 8(2):155{165, March 1987.

[127] Mi Lu and Xiangzhen Qiao. Applying parallel computer systems to solve symmetric

tridiagonal eigenvalue problems. Parallel Computing, 18:1301{1315, 1992.

[128] S. C. Ma, M. Patrick, and D. Szyld. A parallel, hybrid algorithm for the generalized

eigenproblem. In Garry Rodrigue, editor, Parallel Processing for Scienti�c Comput-

ing, chapter 16, pages 82{86. SIAM, 1989.

164

[129] R. S. Martin, C. Reinsch, and J. H. Wilkinson. Householder's tridiagonalization of a

symmetric matrix. Numerische Mathematik, 11:181{195, 1968.

[130] K. Maschho�. Parpack software. http://www.caam.rice.edu/~kristyn/parpack_

home.html.

[131] R. Mathias. The instability of parallel pre�x matrix multiplication. SIAM J. Sci.

Stat. Comput., 16(4):956{973, July 1995.

[132] Gary Oas. Universal cubic eigenvalue repulsion for random normal matrices. Physical

Review E, 1996. also available as:http://xxx.lanl.gov/abs/cond-mat/9610073.

[133] David C. O'Neal and Raghurama Reddy. Solving symmetric eigenvalue problems on

distributed memory machines. In Proceedings of the Cray User's Group, pages 76{96.

Cray Inc., 1994.

[134] B. Parlett. The Symmetric Eigenvalue Problem. Prentice Hall, Englewood Cli�s, NJ,

1980.

[135] B. Parlett. Acta Numerica, chapter The new qd algorithms, pages 459{491. Cambridge

University Press, 1995.

[136] B. Parlett. The construction of orthogonal eigenvectors for tight clusters by use

of submatrices. Center for Pure and Applied Mathematics PAM-664, University of

California, Berkeley, CA, January 1996. submitted to SIMAX.

[137] B. Parlett. personal communication, 1997.

[138] B. N. Parlett. Laguerre's method applied to the matrix eigenvalue problem. Mathe-

matics of Computation, 18:464{485, 1964.

[139] B.N. Parlett and I.S. Dhillon. On Fernando's method to �nd the most redundant

equation in a tridiagonal system. Linear Algebra and its Applications, 267:247{279,

1997. Nov.

[140] Antoine Petitet. Algorithmic Redistribution Methods for Block Cyclic Decompositions.

PhD thesis, University of Tennessee, 1996.

[141] C. P. Potter. A parallel divide and conquer eigensolver. http://sawww.epfl.ch/

SIC/SA/publications/SCR95/7-95-27a.html.

165

[142] M. Pourzandi and B. Tourancheau. A parallel performance study of jacobi-like eigen-

value solution. http://www.netlib.org/tennessee/ut-cs-94-226.ps.

[143] Earl Prohofsky. Statistcal Mechanics and Stability of Macromolecules. Cambridge

University Press, 1995.

[144] B. Putnam, E. W. Prohofsky, K. C. Lu, and L. L. Van Zandt. Breathing modes and

induced resonant melting of the double helix. Physics Letters, 70A, 1979.

[145] C. Reinsch. A stable rational qr algorithm for the computation of the eigenvalues of

an hermitian, tridiagonal matrix. Num. Math., 25:591{597, 1971.

[146] H. Ren. On error analysis and implementation of some eigenvalue and singular value

algorithms. PhD thesis, University of California at Berkeley, 1996.

[147] J. Rutter. A serial implementation of Cuppen's divide and conquer al-

gorithm for the symmetric eigenvalue problem. Mathematics Dept. Mas-

ter's Thesis http://sunsite.berkeley.edu/Dienst/UI/2.0/Describe/ncstrl.

ucb%2fCSD-94-799, University of California, 1991.

[148] R. Saavedra, W. Mao, D. Park, J. Chame, and S. Moon. The combined e�ectiveness

of unimodular transformations, tiling, and software prefetching. In Proceedings of

the 10th International Parallel Processing Symposium. IEEE Computer Society, April

15{19 1996.

[149] V. Sarkar. Automatic selection of high order transformations in the IBM ASTU

Optimizer. IBM Software Solutions Division Report, 1996.

[150] R. Schreiber. Solving eigenvalue and singular value problems on an undersized systolic

array. SIAM J. Sci. Stat. Comput., 7:441{451, 1986.

[151] D. Scott, M. Heath, and R. Ward. Parallel block Jacobi eigenvalue algorithms using

systolic arrays. Lin. Alg. & Appl., 77:345{355, 1986.

[152] G. Seifert, Th. Heine, O. Knospe, and R. Schmidt. Computer simulations for the

structure and dynamics of large molecules, clusters and solids. In Lecture Notes in

Computer Science, volume 1067, page 393. Springer-Verlag, 1996.

166

[153] B. T. Smith, J. M. Boyle, J. J. Dongarra, B. S. Garbow, Y. Ikebe, V. C. Klema, and

C. B. Moler. Matrix Eigensystem Routines { EISPACK Guide, volume 6 of Lecture

Notes in Computer Science. Springer-Verlag, Berlin, 1976.

[154] C. Smith, B. Hendrickson, and E. Jessup. A parallel algorithm for householder tridiag-

onalization. In Proceedings of the Fifth SIAM Conference on Applied Linear Algebra,

pages 361{365. SIAM, 1994.

[155] D. Sorensen and P. Tang. On the orthogonality of eigenvectors computed by divide-

and-conquer techniques. SIAM J. Num. Anal., 28(6):1752{1775, 1991.

[156] J. Speiser and H. Whitehouse. Parallel processing algorithms and architectures for

real time processing. In Proceedings SPIE Real Time Signal Processing IV, 1981.

[157] V. Strassen. Gaussian elimination is not optimal. Numerische Mathematik, 13:354{

355, 1969.

[158] P. Strazdins. Matrix factorization using distributed panels on the fujitsu ap1000.

In IEEE First International Conference on Algorithms And Architectures for Par-

allel Processing, Brisbsane, April 1995. http://cs.anu.edu.au/people/Peter.

Strazdins/papers.html#DBLAS.

[159] P. Strazdins. A high performance, portable distributed blas implementation. In

Fifth Parallel Computing Workshop for the Fujitsu PCRF, Kawasaki, November 1996.

http://cs.anu.edu.au/people/Peter.Strazdins/papers.html#DBLAS.

[160] P. Strazdins. personal communication, 1997. http://cs.anu.edu.au/people/

Peter.Strazdins.

[161] P. Strazdins. Reducing software overheads in parallel linear algebra libraries. Technical

report, Australian National University, 1997. Submitted to PART'97, The 4th Annual

Australasian Conference on Parallel And Real-Time Systems, 29 - 30 September 1997,

The University of Newcastle, Newcastle, Australia.

[162] P. Swarztrauber. A parallel algorithm for computing the eigenvalues of a symmetric

tridiagonal matrix. To appear in Math. Comp., 1993.

[163] D. Szyld. Criteria for combining inverse iteration and Rayleigh quotient iteration.

SIAM J. Num. Anal., 25(6):1369{1375, December 1988.

167

[164] Thinking Machines Corporation. CMSSL for CM Fortran: CM-5 Edition, version

3.1, 1993.

[165] S. Toledo. Locality of reference in lu decomposition with partial pivoting. SIAM

Journal on Matrix Analysis and Applications, 18-4, 1997. http://theory.lcs.mit.

edu/~sivan/029774.ps.gz.

[166] Alessandro De Vita, Giulia Galli, Andrew Canning, and Roberto Car. A microscopic

model for surface-induced graphite-to-diamond transitions. Nature, 379, Feb 8 1996.

[167] D. Watkins. Fundamentals of Matrix Computations. Wiley, 1991.

[168] R. Whaley. Automatically tunable linear algebra subroutines, 1997. http://www.

netlib.org/utk/projects/atlas.

[169] R. Clint Whaley. Basic linear algebra communication subroutines: Analysis and

implementation across multiple parallel architectures. Technical report, University of

Tennessee, Knoxville, June 1994. LAPACK Working Note #73 http://www.netlib.

org/lapack/lawns/lawn73.ps.

[170] J. H. Wilkinson. The Algebraic Eigenvalue Problem. Oxford University Press, Oxford,

1965.

[171] R. P. Wilson, R. S. French, C. S. Wilson, S. P. Amarasinghe, J. M. Anderson, S. W. K.

Tjiang, Shih-Wei Liao, C. Tseng, MaryW. Hall, M. S. Lam, and J. L. Hennessy. SUIF:

An Infrastructure for Research on Parallelizing and Optimizing Compilers. HTML

from http://suif.stanford.edu/suif/suif-overview/suif.html.

[172] M. Wolfe. High performance compilers for parallel computing. Addison-Wesley, 1996.

[173] Y.-J. J. Wu, A. A. Alpatov, C. Bischof, and R. A. van de Geijn. \a parallel implemen-

tation of symmetric band reduction using plapack". In Scalable Parallel Library Con-

ference, 1996. (Also Prism Working Note #35 ftp://ftp.super.org/pub/prism/

wn35.ps).

[174] Shing-Tung Yau and Ya Yan Lu. Reducing the symmetric matrix eigenvalue problem

to matrix multiplications. SIAM J. Sci. Comput., 14(1):121{136, January 1993.

168

[175] Paul G. Hipes Yi-Shuen Mark Wu, Steven A. Cuccaro and Aron Kuppermann.

Quantum-mechanical reactive scattering using a high-performance distributed-

memory parallel computer. Chem. Phys. Lett., 168:429{440, 1990.

169

Appendix A

Variables and abbreviations

170

Table A.1: Variable names and their uses

Name Meaning

(a; b) The processor in processor row a and processor column b.

A The input matrix (partially reduced).

A(i; j) The i; j element in the (partially reduced) matrix A.c The number of eigenvalues in the largest cluster of eigenvalues.

C The set of all processor columns.ca The current processor column within the sub-grid.cb The current processor column sub-grid.e The number of eigenvalues required.

j The current column, A(j : n; j : n) being the un-reduced portionof the matrix.

j0 The column within the current block column, j0 = mod (j; nb)lg(pp) log2

pp

m The number of eigenvectors required.

mb

The row block size. Used only when we discuss rectangularblocks. In general, the row block size and column block sizeare assumed to be equal and are written as nb.

mullen

A compile time parameter in the PBLAS which controls the panelsize used in PBLAS symmetric matrix vector multiply routine,PDSYMV.

n The size of the input matrix A.

nb

The blocking factor. In PDSYEVX the data layout and algorithmicblocking factor are the same. In HJS the data layout blockingfactor is 1 and nb refers to the algorithmic blocking factor.

p The number of processors used in the computation.

pbfPanel blocking factor. The panel width used in DGEMV in PDSYEVX

and DGEMM in PDSYEVX and HJS is pbf � nb.pr The number of processor rows in the process grid.pr1 The number of processor rows in a sub-grid.pr2 The number of processor sub-grid rows.pc The number of processor columns in the process grid.pc1 The number of processor columns in a sub-grid.pc2 The number of processor sub-grid columns.

R The set of all processor rows.ra The current processor row within the sub-grid.rb The current processor row sub-grid.

spread

In a \spread across", every processor in current processor col-umn broadcasts to every other processor in the same processorrow. In a \spread down", every processor in current processorrow broadcasts to every other processor in the same processorcolumn.

tril(A; 0)The lower triangular part, including the diagonal, of the un-reduced part of the input matrix A, i.e. A(j : n; j : n)

tril(A;�1) The lower triangular part, excluding the diagonal, of the un-reduced part of the input matrix A, i.e. A(j : n; j : n)

171

Table A.2: Variable names and their uses (continued)

Name Meaning

v The vector portion of the householder re ector.

VThe current column of householder re ectors. Size: n� j+ j0 byj0.

V (j � j0 : n; 1 : j 0) The current column of householder re ectors. Size: n� j+ j0 byj0.

vnbThe imbalance in the 2D block-cyclic distribution of the eigen-vector matrix.

wThe companion update vector. i.e. the vector used in A = A �vwT �WvT to reduce A

WThe current column of companion update vectors. Size: n�j+j0by j0.

W (j�j0 : n; 1 : j0) The current column of companion update vectors. Size: n�j+j0by j0.

Abbreviation Meaning

CPU Central Processing Unit

FPU Floating Point Unit

Table A.3: Abbreviations

Symbol Meaning Terms included

� The message initiation cost for BLACS send and re-ceive.

n lg(p);n

�The inverse bandwidth cost for BLACS send and re-ceive.

n2 lg(p)pp

; n2pp;

n�nb lg(p)

�3

DGEMM (matrix-matrix multiply) subroutine overheadplus the time penalty associated of invoking DGEMM onsmall matrices.

n2

nb2�pbf; nnb

3Time required per DGEMM (matrix-matrix multiply) op.

n3

p; n

2�nbpp

�2

DGEMV (matrix-vector multiply) subroutine overheadplus the time penalty associated of invoking DGEMV onsmall matrices.

n

2Time required per DGEMV (matrix-vector multiply) op.

n3

p; n

2�nbpp

� Time required per divide. n2

p;n

p Time required per square root.

1 Time required per BLAS1 (scalar-vector) op. n2

p;n

�1 Subroutine overhead for BLAS1 and similar codes. n2pp

�4 Subroutine overhead for the PBLAS. n

Table A.4: Model costs

172

Appendix B

Further details

B.1 Updating v during reduction to tridiagonal form

Line 4.1, w = w �W V Tv � V WT v in Figure 8.3 can be computed with minimal

communication, minimal computation or with an intermediate amount of both commu-

nication and computation. Indeed, Line 4.1 can be computed with O((n2

p+ n2 nb

pr) 2 +

n log(pr�0:5)�) cost for various r 2 [0:5; 1:0]. r = 1:0 corresponds to the minimal computa-

tion cost option (discussed in section B.1.3) while r = 0:5 corresponds to the minimal (zero)

communication cost option (discussed in section B.1.2). Section B.1.4 describes the inter-

mediate options in a generalized form which includes both the minimum communication

and minimum computation options as special cases.

The plethora of options for the update of v stems from the fact that the input ma-

tricesW;V;WT and V T are replicated across the relevant processors while the input/output

vector v is stored as partial sums across the processor columns in each of the processor rows.

The input matrices are replicated because they will need to be replicated later to update

A. The vector v is stored as partial sums because that is how it is initially computed,

and because the combine operation used to compute v from the partial sums has not been

performed at this point.

Throughout this section we only discuss computing WV T v. VWT v can be com-

puted in a similar manner. Moreover, the two computations, and all associated communi-

cation, can be merged to reduce software overhead and message latency costs.

173

B.1.1 Notation

In describing most parallel linear algebra codes, including all codes in this thesis

outside of this appendix, we need not explicitly state the processor on which a value is

stored. Ai;j is understood to live on the processor that owns row i and column j. The

nb0 element array tmp contains di�erent values on di�erent processors. Therefore, for the

discussion in this appendix, an additional subscript is added to tmp to indicate the processor

column. Furthermore, some entries in tmp are left unde�ned at various stages, therefore we

use j 2 fcag to indicate all columns j owned by processor column ca. i.e. tmpj2fcag;ca = val

means that 8j 2 fcag, tmpj on processor ca is assigned val . For extra clarity within a

display we write this as tmpj;caj2fcag

.

B.1.2 Updating v without added communication

Line 4.1, w = w � W V Tv � V WT v in Figure 8.3 can be computed without

any communication other than that needed to compute v without the update. It initially

appears that w = w � W � V T v � V � WT v requires communication because computing

tmp = V Tv requires summing nb0 values1 within each processor column, and computing

w = w �W � tmp requires that tmp be broadcast within each processor column. However,

W � V Tv can be computed with a single sum within each processor row, and by delaying

the sum needed to compute w, one of them can be avoided completely. Figure B.1 derives

how W � V T v can be computed with a single sum within each processor row.

Line 3 The transformation from line 2 to line 3 is the standard way that a matrix vector

multiply is performed in parallel. The leftmost sum is the local portion, the middle

sum is the sum over all processors in the processor column.

Line 4 Delay the sum over all processors in the processor column until after multiplying

by W . The rightmost two sums involve only local values.

Figure B.2 shows how to compute W � V Tv without added communication.

Line 5 Local computation of V T � v. Operations:nX

i=1;nb

nbX

nb0=1

2i

prnb0 2 = 1

2 n2 nb

pr 2

1nb

0= i� ii� 1 is the number of columns in H

174

Figure B.1: Avoiding communication in computing W � V T v

tmp = W � V Tv (Line 1)

tmpi =X

1�j�nb0Wij

X

k2fCgVkjvk (Line 2)

tmpi =X

1�j�nb0Wij

X

1�R�prk2fCg

Hkjhk (Line 3)

tmpi =X

C2pc1�j�nb0

Wij

X

k2fCgHkjhk (Line 4)

Line 6 Local computation of W � tmp. Operations:

nX

i=1;nb

nbX

nb0=1

2i

prnb0 2 = 1

2 n2 nb

pr 2

Line 7 E�ect of summing resi within each processor row. This operation is merged with

the unavoidable summation of w within each processor row, hence this operation is

not performed and has no cost.

B.1.3 Updating w with minimal computation cost

Figure B.3 shows how W � V Tv can be performed with only O( n2pp+ n2 nb

p) com-

putation by distributing the computation of tmp = V T � v and w = w +W � tmp over all

the processors. Each of the nb columns of V T is assigned to one processor row, hence each

processor row is assigned nbppcolumns of V T . Each processor row computes the portion of

V T �v assigned to it, leaving the answer on the diagonal processor in this row. The diagonal

processors then broadcast the nbppelements of V T �v which they own to all of the processors

within their processor column. Finally, each processor computes w = w +W � tmp for the

values of W and tmp which it owns.

175

Figure B.2: Computing W � V Tv without added communication

tmpj;C =X

k2fCgV Tk;j vk (Line 5)

resi;Ci2fRg

=X

j

Wi;j tmpj;C (Line 6)

=X

j

Wi;j

X

k2fCgV Tk;j vk

X

C

resi;Ci2fRg

=X

j1�C�pc

Wi;j

X


=X

j

Wi;j

X

k

V Tk;j vk


i=1;nb

nbX

nb0=1

2i

pr

nb0

pc 2 =

12

n2 nb

p 2

Line 9 Combine tmpj2fRg;C within each processor column, leaving the answer on the di-

agonal processor. Operations:

nX

i=1;nb

nbX

nb0=1

log(pc) (�+nb0

pc�) = n log(pc)�+ 1

2

n nb

pclog(pc)�

Line 10 Broadcast tmpj2fCg;C within each processor row from the diagonal processor.

Operations:

nX

i=1;nb

nbX

nb0=1

log(pc) (�+nb0

pc�) = n log(pc)�+ 1

2

n nb

pclog(pc)�


nX

i=1;nb

nbX

nb0=1

2i

pr

nb0

pc 2 =

12

n2 nb

p 2

176

Figure B.3: Computing W � V T v with minimal computation

tmpj;Cj2fRg

=X


8R=C

tmpj;Cj2fCg

=X

1�cl�pck2fclg

V Tk;j vk (Line 9)

=X

k

V Tk;j vk

tmpj;Cj2fCg

=X


resi;Ci2fRg

=X

j2fCgWi;j tmpj;C

j2fCg(Line 11)

=X

j2fCgWi;j

X

k2fCgV Tk;j vk

X

C

resi;Ci2fRg

=X

1�C�pcj2fCg

Wi;j

X


=X

j

Wi;j

X

k

V Tk;j vk

177




The update of w in HJS requires similar communication and computation costs

although the patterns of communication are quite di�erent. HJS uses recursive halving to

spread the result of tmp = V T v, computes W � tmp on all processors, and uses recursive

doubling to compute w while simultaneously spreading it to all processor columns. Although

the BLACS do not o�er recursive halving and recursive doubling operations we could build

them out of BLACS sends and receives but that incurs higher latency costs.

B.1.4 Updating w with minimal total cost

Line 4.1, w = w � W WTw � W WTw in Figure 8.3 can be computed with

O(n2 nbpr

2 + n log(pr�0:5)�) cost for any r � 0:5. On a high latency machine, one can

reduce the total number of messages by increasing the load imbalance. On a low latency

machine, one can reduce the load imbalance by using more messages. The two options de-

scribed in the preceding sections are special cases of the general case of methods described

in this section. Section B.1.2 corresponds to r = 0:5. Section B.1.3 corresponds to r = 1:0.

This method has not been implemented and hence has not been proven to result

in decreased execution times in practice.

Methods corresponding to 0:5 < r < 1:0 require what amounts to a four dimen-

sional processor grid. The pr�pc processor grid is divided into pr2�pc2 sub-grids with each

sub-grid consisting of pr1 � pc1 processors. We restrict our attention to square processor

grids and square processor sub-grids, hence pr = pc; pr1 = pc1 and pr2 = pc2. Each processor

column is identi�ed by a pair of numbers, (ca; cb), s.t. 1 � ca � pc1 and 1 � cb � pc2. Like-

wise, each processor row is identi�ed by a pair of numbers, (ra; rb), s.t. 1 � ra � pr1 and

1 � rb � pr2. No modi�cations are needed to the BLACS to support this method because

each processor belongs to only two 2 dimensional processor grids: the normal two dimen-

sional data layout and a two dimensional data layout containing only those processors in

the same processor sub-grid, i.e. with the same rb and cb.

Figure B.4 shows the general method for updating w using a 4 dimensional data

layout. The nb0 elements of tmp are distributed over the pr1 processor rows and columns

within each processor block, such that each processor row and column owns roughly nb0pr1

178

elements of tmp.

Figure B.4: Computing W � V T v on a four dimensional processor grid

tmpj;(ca;cb)j2frag

=X

k2f(ca;cb)gV Tk;j vk (Line 13)

8ra=ca

tmpj;(ca;cb)j2fcag

=X

1�cl�pc1k2f(ca;cb)g

V Tk;j vk (Line 14)

=X

k2f(�;cb)gV Tk;j vk

tmpj;(ca;cb)j2fcag

=X

k2f(ca;cb)gV Tk;j vk (Line 15)

resi;(ca;cb)i2(ra;rb)

=X

j2fcagWi;j tmpj;(ca;cb)

j2fcag(Line 16)

=X

j2fcagWi;j

X

k2f(�;cb)gV Tk;j vk

X

(ca;cb)

resi;(ca;cb)i2(ra;rb)

=X

1�ca�pc11�cb�pc2j2fcag

Wi;j

X

k2f(�;cb)gV Tk;j vk (Line 17)

=X

1�ca�pc1j2fcag

Wi;j

X

1�cb�pc2k2f(�;cb)g

V Tk;j vk

=X

j

Wi;j

X

k

V Tk;j vk

B.1.5 Notes to �gure B.4


i=1;nb

nbX

nb0=1

2i

pr

nb0

pc1 2 =

12 n

2 nb=(pr pc1) 2

Line 14 Combine tmpj2frag;(ca;cb) within each processor sub-grid column, leaving the an-

179

swer on the diagonal processor (i.e. ra = ca) within each sub-grid. Operations:

nX

i=1;nb

nbX

nb0=1

log(pc1) (�+nb0

pc1�) = n log(pc1)�+ 1

2

n nb

pc1log(pc1)�

Line 15 Broadcast tmpj2frag;(ca;cb) within each processor sub-grid row from the diagonal

processor in that sub-grid row. Operations:

nX

i=1;nb

nbX

nb0=1

log(pc1) (�+nb0

pc1�) = n log(pc1)�+ 1

2

n nb

pc1log(pc1)�


nX

i=1;nb

nbX

nb0=1

2i

prnb0=pc1 2 = 1

2

n2 nb

pr pc1 2




B.1.6 Overlap communication and computation as a last resort

There are numerous studies showing that overlapping communication and compu-

tation improves performance, but most of them show only modest improvement. Arbenz

and Slapnicar[9] show a 5% improvement by overlapping communication and computation

while Pourzandi and Tourancheau show a 6% improvement. Those that show the greatest

improvement combine communication and computation overlap with other equally impor-

tant techniques such as pipelining and lookahead[32].

I don't know why overlapping communication and computation leads to only mod-

est improvements. In theory it ought to hide most of the communication costs. There are

several possible explanations, all of which presumably contribute. I suspect that the most

important reason for the disappointing savings from overlap is that overhead and not com-

munication costs are not the primary factor limiting e�ciency. A second important reason

is that most of the cost of communication on todays distributed memory machines is the

cost of moving the data between the node and the network, not moving data within the

network. The cost of moving data to and from the node always involves main memory cy-

cles, unless the main memory is dual ported (i.e. expensive), which must be stolen from the

180

execution of the rest of the code. Further the latency cost is almost all software overhead,

hence during the message setup the cpu is busy and cannot compute.

The disadvantage to communication and computation overlap is that it adds com-

plexity which can be put to better use elsewhere. Both the Pourzandi/Tourancheau and

Arbenz/Slapnicar studies used a 1D data layout in Jacobi although a 2D data layout o�ers

lower communication and costs O( n2pp) versus O(n2) and lower overhead costs. They would

have done better to use a 2D data layout and delayed (potentially forever) consideration of

communication and computation overlap.

B.2 Matlab codes

B.2.1 Jacobi

The following is the matlab code for Table 7.4.

n = 1000;

p = 64;

blacsalpha = 65.9e-6;

blacsbeta=.146e-6;

dividebeta=3.85e-6;

squarerootbeta=7.7e-6;

blasonebeta=.074e-6;

dgemmalpha=103e-6;

dgemmbeta=.0215e-6;

term(1) = 8 * sqrt(p) * ( log2(p) - 3 ) * blacsalpha

term(2) = 7/2 * n^2 / sqrt(p) * blacsbeta

term(3) = 1/8 * n^2 / sqrt(p) * log2(p) * blacsbeta

term(4) = 1/2 * n^2 / sqrt(p) * dividebeta

term(5) = 1/4 * n^2 / sqrt(p) * squarerootbeta

term(6) = 3/8 * n^3 / p * blasonebeta

term(7) = 8 * sqrt(p) * dgemmalpha

term(8) = 5 * n^3 / p * dgemmbeta

time = sum(term)

181

Appendix C

Miscellaneous matlab codes

C.1 Reduction to tridiagonal form

The following matlab code performs an unblocked reduction to tridiagonal form.

It produces the same values, up to roundo�, of D, E and TAU as LAPACK's DSYTRD and

ScaLAPACK's PDSYTRD.

%

% tridi - An unblocked, non-syymetric reduction to tridiagonal form

%

% This file creates an input matrix A, reduces it to tridiagonal form

% and tests to make sure that the reduction was performed correctly.

%

% outputs:

% D, E - The tridiagonal matrix

% tau

% A - The lower half holds the householder updates

%

%

% Produce the input matrix

%

N = 7;

A = hilb(N) + toeplitz( [ 1 (1:(N-1))*i ] );

B = A; % Keep a copy to check our work later.

%

% Reduce to tridiagonal form

%

n = size(A,1);

182

I = eye(N);

for j =1:n-1

%

% Compute the householder vector: v

%

clear v;

v(1:n,1) = zeros(n,1);

v(j+1:n,1) = A(j+1:n,j);

alpha = A(j+1,j);

beta = - norm(v) * real(alpha) / abs( real(alpha) ) ;

tau(j) = ( beta - alpha ) / beta ;

v = v / ( alpha - beta ) ;

v(j+1) = 1.0 ;

%

% Perform the matrix vector multiply:

%

w = A * v ;

%

% Compute the companion update vector: w

%

w = tau(j) * w ;

c = w' * v;

w = (w - (c * tau(j) / 2 ) * v );

D(j) = A(j,j);

E(j) = beta ;

%

% Updte the trailing matrix

%

A = A - v * w' - w * v';

%

% Store the household vector back into A

%

A(j+2:n,j) = v(j+2:n);

end

D(n) = A(n,n);

183

%

% Check to make sure that the reduction was performed correctly.

%

DE = diag(D) + diag(E,-1) + diag(E,1) ;

Q=I;

for j = 1:n-1

clear house

house(1:n,1) = zeros(n,1);

house(j+1:n,1) = A(j+1:n,j);

house(j+1,1) = 1.0;

Q = (I- tau(j)' * house * house') * Q ;

end

norm( B - Q' * DE * Q )

Kendall SwExecution Time of Symme tric Eigensolv ers b y Kendall Sw enson Stanley B.S. (Purdue Univ...

Documents

Transcript of Kendall SwExecution Time of Symme tric Eigensolv ers b y Kendall Sw enson Stanley B.S. (Purdue Univ...