Kendall SwExecution Time of Symme tric Eigensolv ers b y Kendall Sw enson Stanley B.S. (Purdue Univ...
Transcript of Kendall SwExecution Time of Symme tric Eigensolv ers b y Kendall Sw enson Stanley B.S. (Purdue Univ...
Execution Time of Symmetric Eigensolvers
by
Kendall Swenson Stanley
B.S. (Purdue University) 1978
A dissertation submitted in partial satisfaction of the
requirements for the degree of
Doctor of Philosophy
in
Computer Science
in the
GRADUATE DIVISION
of the
UNIVERSITY of CALIFORNIA at BERKELEY
Committee in charge:
Professor James Demmel, ChairProfessor William KahanProfessor Phil Collela
Fall 1997
The dissertation of Kendall Swenson Stanley is approved:
Chair Date
Date
Date
University of California at Berkeley
Fall 1997
Execution Time of Symmetric Eigensolvers
Copyright Fall 1997
by
Kendall Swenson Stanley
1
Abstract
Execution Time of Symmetric Eigensolvers
by
Kendall Swenson Stanley
Doctor of Philosophy in Computer Science
University of California at Berkeley
Professor James Demmel, Chair
The execution time of a symmetric eigendecomposition depends upon the application, the
algorithm, the implementation, and the computer. Symmetric eigensolvers are used in a
variety of applications, and the requirements of the eigensolver vary from application to
application. Many di�erent algorithms can be used to perform a symmetric eigendecom-
postion, each with di�ering computational properties. Di�erent implementations of the
same algorithm may also have greatly di�ering computational properties. The computer on
which the eigensolver is run not only a�ects execution time but may favor certain algorithms
and implementations over others.
This thesis explains the performance of the ScaLAPACK symmetric eigensolver,
the algorithms that it uses, and other important algorithms for solving the symmetric eigen-
problem on today's fastest computers. We o�er advice on how to pick the best eigensolver
for particular situations and propose a design for the next ScaLAPACK symmetric eigensolver
which will o�er greater exibility and 50% better performance.
Professor James DemmelDissertation Committee Chair
iii
To the memory of my father. My most ambitious goal is to be as good a father as
he was to me.
iv
Contents
List of Figures viii
List of Tables x
I First Part 1
1 Summary - Interesting Observations 21.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2 Software overhead and load imbalance costs are signi�cant . . . . . . . . . . 81.3 E�ect of machine performance characteristics on PDSYEVX . . . . . . . . . . 101.4 Prioritizing techniques for improving performance. . . . . . . . . . . . . . . 111.5 Reducing the execution time of symmetric eigensolvers . . . . . . . . . . . . 121.6 Jacobi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.7 Where to obtain this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Overview of the design space 152.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Parallel abstraction and languages . . . . . . . . . . . . . . . . . . . 172.3.2 Algorithmic blocking . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.3 Internal Data Layout . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3.4 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.3.5 Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.3.6 Operating Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.4.1 Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.4.2 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.4.3 Parallel computer con�guration . . . . . . . . . . . . . . . . . . . . 24
2.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.5.1 Input matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.5.2 User request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.5.3 Accuracy and Orthogonality requirements. . . . . . . . . . . . . . . 29
v
2.5.4 Input and Output Data layout . . . . . . . . . . . . . . . . . . . . . 292.6 Machine Load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.7 Historical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.7.1 Reduction to tridiagonal form and back transformation . . . . . . . 302.7.2 Tridiagonal eigendecomposition . . . . . . . . . . . . . . . . . . . . . 322.7.3 Matrix-matrix multiply based methods . . . . . . . . . . . . . . . . . 392.7.4 Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3 Basic Linear Algebra Subroutines 433.1 BLAS design and implementation . . . . . . . . . . . . . . . . . . . . . . . . 433.2 BLAS execution time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.3 Timing methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.4 The cost of code and data cache misses in DGEMV . . . . . . . . . . . . . . . 503.5 Miscellaneous timing details . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4 Details of the execution time of PDSYEVX 52
4.1 High level overview of PDSYEVX algorithm . . . . . . . . . . . . . . . . . . . 524.2 Reduction to tridiagonal form . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2.1 Householder's algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 534.2.2 PDSYTRD implementation (Figure 4.4) . . . . . . . . . . . . . . . . . 574.2.3 PDSYTRD execution time summary . . . . . . . . . . . . . . . . . . . 71
4.3 Eigendecomposition of the tridiagonal . . . . . . . . . . . . . . . . . . . . . 724.3.1 Bisection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.3.2 Inverse iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.3.3 Load imbalance in bisection and inverse iteration . . . . . . . . . . . 734.3.4 Execution time model for tridiagonal eigendecomposition in PDSYEVX 744.3.5 Redistribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4 Back Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5 Execution time of the ScaLAPACK symmetric eigensolver, PDSYEVX on
e�cient data layouts on the Paragon 815.1 Deriving the PDSYEVX execution time on the Intel Paragon (common case) . 835.2 Simplifying assumptions allow the full model to be expressed as a six term
model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.3 Deriving the computation time during matrix transformations in PDSYEVX on
the Intel Paragon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.4 Deriving the computation time during eigendecomposition of the tridiagonal
matrix in PDSYEVX on the Intel Paragon . . . . . . . . . . . . . . . . . . . . 855.5 Deriving the message initiation time in PDSYEVX on the Intel Paragon . . . 865.6 Deriving the inverse bandwidth time in PDSYEVX on the Intel Paragon . . . 865.7 Deriving the PDSYEVX order n imbalance and overhead term on the Intel
Paragon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 865.8 Deriving the PDSYEVX order n2p
p imbalance and overhead term on the Intel
Paragon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
vi
6 Perfomance on distributed memory computers 88
6.1 Performance requirements of distributed memory computers for running PDSYEVXe�ciently . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 886.1.1 Bandwidth rule of thumb . . . . . . . . . . . . . . . . . . . . . . . . 896.1.2 Memory size rule of thumb . . . . . . . . . . . . . . . . . . . . . . . 896.1.3 Performance requirements for minimum execution time . . . . . . . 926.1.4 Gang scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.2 sec:gang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 946.2.1 Consistent performance on all nodes . . . . . . . . . . . . . . . . . . 94
6.3 Performance characteristics of distributed memory computers . . . . . . . . 956.3.1 PDSYEVX execution time (predicted and actual) . . . . . . . . . . . . 95
7 Execution time of other dense symmetric eigensolvers 987.1 Implementations based on reduction to tridiagonal form . . . . . . . . . . . 98
7.1.1 PeIGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 987.1.2 HJS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 997.1.3 Comparing the execution time of HJS to PDSYEVX . . . . . . . . . . . 1017.1.4 PDSYEV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.2 Other techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1067.2.1 One dimensional data layouts . . . . . . . . . . . . . . . . . . . . . . 1067.2.2 Unblocked reduction to tridiagonal form . . . . . . . . . . . . . . . . 1087.2.3 Reduction to banded form . . . . . . . . . . . . . . . . . . . . . . . . 1097.2.4 One-sided reduction to tridiagonal form . . . . . . . . . . . . . . . . 1107.2.5 Strassen's matrix multiply . . . . . . . . . . . . . . . . . . . . . . . . 111
7.3 Jacobi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1127.3.1 Jacobi versus Tridiagonal eigensolvers . . . . . . . . . . . . . . . . . 1127.3.2 Overview of Jacobi Methods . . . . . . . . . . . . . . . . . . . . . . 1137.3.3 Jacobi Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1147.3.4 Computation costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1147.3.5 Communication costs . . . . . . . . . . . . . . . . . . . . . . . . . . 1217.3.6 Blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1247.3.7 Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1257.3.8 Storing diagonal blocks in one-sided Jacobi . . . . . . . . . . . . . . 1267.3.9 Partial Eigensolver . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1267.3.10 Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1287.3.11 Pairing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1297.3.12 Pre-conditioners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1317.3.13 Communication overlap . . . . . . . . . . . . . . . . . . . . . . . . . 1327.3.14 Recursive Jacobi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1327.3.15 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1337.3.16 Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.4 ISDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1347.5 Banded ISDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1357.6 FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
vii
8 Improving the ScaLAPACK symmetric eigensolver 137
8.1 The next ScaLAPACK symmetric eigensolver . . . . . . . . . . . . . . . . . . 1378.2 Reduction to tridiagonal form in the next ScaLAPACK symmetric eigensolver 1388.3 Making the ScaLAPACK symmetric eigensolver easier to use . . . . . . . . . . 1418.4 Details in reducing the execution time of the ScaLAPACK symmetric eigensolver141
8.4.1 Avoiding over ow and under ow during computation of the House-holder vector without added messages . . . . . . . . . . . . . . . . . 142
8.4.2 Reducing communications costs . . . . . . . . . . . . . . . . . . . . . 1438.4.3 Reducing load imbalance costs . . . . . . . . . . . . . . . . . . . . . 1448.4.4 Reducing software overhead costs . . . . . . . . . . . . . . . . . . . . 145
8.5 Separating internal and external data layout without increasing memory usage146
9 Advice to symmetric eigensolver users 148
II Second Part 150
Bibliography 151
A Variables and abbreviations 169
B Further details 172
B.1 Updating v during reduction to tridiagonal form . . . . . . . . . . . . . . . 172B.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173B.1.2 Updating v without added communication . . . . . . . . . . . . . . . 173B.1.3 Updating w with minimal computation cost . . . . . . . . . . . . . . 174B.1.4 Updating w with minimal total cost . . . . . . . . . . . . . . . . . . 177B.1.5 Notes to �gure B.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 178B.1.6 Overlap communication and computation as a last resort . . . . . . 179
B.2 Matlab codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180B.2.1 Jacobi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
C Miscellaneous matlab codes 181
C.1 Reduction to tridiagonal form . . . . . . . . . . . . . . . . . . . . . . . . . . 181
viii
List of Figures
1.1 9 by 9 matrix distributed over a 2 by 3 processor grid with mb = nb = 2 . . 41.2 Processor point of view for 9 by 9 matrix distributed over a 2 by 3 processor
grid with mb = nb = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.1 Performance of DGEMV on the Intel PARAGON . . . . . . . . . . . . . . . . . . 463.2 Additional execution time required for DGEMV when the code cache is ushed
between each call. The y-axis shows the di�erence between the time requiredfor a run which consists of one loop executing 16,384 no-ops after each call toDGEMV and the time required for a run which includes two loops one executingDGEMV and one executing 16,384 no-ops. . . . . . . . . . . . . . . . . . . . . 48
3.3 Additional execution time required for DGEMV when the code cache is ushedbetween each call as a percentage of the time required when the code iscached. See Figure 3.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1 PDSYEVX algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.2 Classical unblocked, serial reduction to tridiagonal form, i.e. EISPACK's
TRED1(The line numbers are consistent with �gures 4.3, 4.4 and 4.5.) . . . 554.3 Blocked, serial reduction to tridiagonal form, i.e. DSYEVX( See Figure 4.2 for
unblocked serial code) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.4 PDSYEVX reduction to tridiagonal form ( See Figure 4.3 for further details) 584.5 Execution time model for PDSYEVX reduction to tridiagonal form (See Fig-
ure 4.4 for details about the algorithm and indices.) . . . . . . . . . . . . . 594.6 Flops in the critical path during the matrix vector multiply . . . . . . . . . 67
6.1 Relative cost of message volume as a function of the ratio between peak oat-ing point execution rate in Mega ops, mfs, and the product of main memorysize in Megabytes, M and network bisection bandwidth in Megabytes/sec,mbs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2 Relative cost of message latency as a function of the ratio between peak oating point execution rate in Mega ops, mfs, and main memory size inMegabytes, M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.1 HJS notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
ix
7.2 Execution time model for HJS reduction to tridiagonal form. Line numbersmatch Figure 4.5(PDSYEVX execution time) . . . . . . . . . . . . . . . . . . 105
7.3 Matlab code for two-sided cyclic Jacobi . . . . . . . . . . . . . . . . . . . . 1157.4 Matlab code for two-sided blocked Jacobi . . . . . . . . . . . . . . . . . . . 1167.5 Matlab code for one-sided blocked Jacobi . . . . . . . . . . . . . . . . . . . 1177.6 Matlab code for an ine�cient partial eigendecomposition routine . . . . . . 1187.7 Pseudo code for one-sided parallel Jacobi with a 2D data layout with com-
munication highlighted . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1197.8 Pseudo code for two-sided parallel Jacobi with a 2D data layout, as described
by Schrieber[150], with communication highlighted . . . . . . . . . . . . . . 121
8.1 Data redistribution in the next ScaLAPACK symmetric eigensolver . . . . . 1388.2 Choosing the data layout for reduction to tridiagonal form . . . . . . . . . 1398.3 Execution time model for the new PDSYTRD. Line numbers match Figure 4.5(PDSYTRD
execution time) where possible. . . . . . . . . . . . . . . . . . . . . . . . . 140
B.1 Avoiding communication in computing W � V T v . . . . . . . . . . . . . . . 174B.2 Computing W � V Tv without added communication . . . . . . . . . . . . . 175B.3 Computing W � V Tv with minimal computation . . . . . . . . . . . . . . . 176B.4 Computing W � V Tv on a four dimensional processor grid . . . . . . . . . . 178
x
List of Tables
3.1 BLAS execution time (Time = �i + number of ops � i in microseconds) . . 45
4.1 The cost of updating the current column of A in PDLATRD(Line 1.1 and 1.2in Figure 4.5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2 The cost of computing the re ector (PDLARFG) (Line 2.1 in Figure 4.5) . . . 634.3 The cost of all calls to PDSYMV from PDSYTRD . . . . . . . . . . . . . . . . . 664.4 The cost of updating the matrix vector product in PDLATRD(Line 4.1 in Fig-
ure 4.5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.5 The cost of computing the companion update vector in PDLATRD (Line 5.1 in
Figure 4.5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694.6 The cost of performing the rank-2k update (PDSYR2K) (Lines 6.1 through 6.3
in Figure 4.5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704.7 Computation cost in PDSYEVX . . . . . . . . . . . . . . . . . . . . . . . . . . 774.8 Computation cost (tridiagonal eigendecomposition) in PDSYEVX . . . . . . . 784.9 Communication cost in PDSYEVX . . . . . . . . . . . . . . . . . . . . . . . . 794.10 The cost of back transformation (PDORMTR) . . . . . . . . . . . . . . . . . . 80
5.1 Six term model for PDSYEVX on the Paragon . . . . . . . . . . . . . . . . . . 825.2 Computation time in PDSYEVX . . . . . . . . . . . . . . . . . . . . . . . . . . 855.3 Execution time during tridiagonal eigendecomposition . . . . . . . . . . . . 855.4 Message initiations in PDSYEVX . . . . . . . . . . . . . . . . . . . . . . . . . 865.5 Message transmission in PDSYEVX . . . . . . . . . . . . . . . . . . . . . . . . 865.6 �(n) load imbalance cost on the PARAGON . . . . . . . . . . . . . . . . . . . . 87
5.7 Order n2pp load imbalance and overhead term on the PARAGON . . . . . . . . 87
6.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 956.2 Hardware and software characteristics of the PARAGON and the IBM SP2. . . 966.3 Predicted and actual execution times of PDSYEVX on xps5, an Intel PARAGON.
Problem sizes which resulted in execution time of greater than 15% greaterthan predicted are marked with an asterix. Many of these problem sizeswhich result in more than 15% greater execution time than expected wererepeated to show that the unusually large execution times are aberrant. . . 97
xi
7.1 Comparison between the cost of HJS reduction to tridiagonal form and PDSYTRDon n = 4000; p = 64; nb = 32. Values di�ering from previous column areshaded. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.2 Fastest eigendecomposition method . . . . . . . . . . . . . . . . . . . . . . . 1127.3 Performance model for my recommended Jacobi method . . . . . . . . . . . 1187.4 Estimated execution time per sweep for my recommended Jacobi on the
PARAGON on n=1000, p=64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1207.5 Performance models ( op counts) for one-sided Jacobi variants. Entries
which di�er from the previous column are shaded. . . . . . . . . . . . . . . 1227.6 Performance models ( op counts) for two-sided Jacobi variants . . . . . . . 1237.7 Communication cost for Jacobi methods (per sweep) . . . . . . . . . . . . . 124
A.1 Variable names and their uses . . . . . . . . . . . . . . . . . . . . . . . . . . 170A.2 Variable names and their uses (continued) . . . . . . . . . . . . . . . . . . . 171A.3 Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171A.4 Model costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
xii
Acknowledgements
I thank those that I have worked with during my wonderful years at Berkeley. Doug
Ghormley taught me all that I know about emacs, X, and tcsh. Susan Blackford, Clint
Whaley and Antoine Petitet patiently answered my stupid questions about ScaLAPACK. I
thank Bruce Hendrickson for numerous insights. Mark Sears and Greg Henry gave me the
opportunity to test out some of my ideas on a real application. Peter Strazdins' study of
software overhead convinced me to take a hard look at code cache misses. Ross Moore gave
me numerous typesetting hints and suggestions. Beresford Parlett helped me with the
section on Jacobi. Oliver Sharp helped convince me to ask Jim Demmel to be my advisor
and gave some early help with technical writing. I am indebted to the members of the
ScaLAPACK team whose e�ort made ScaLAPACK, and hence this thesis, possible.
My graduate studies would not have been possible were it not for my friends and
family who encouraged me to resume my education and continued to support me in that
decision, especially my wife (Marta Laskowski), Greg Lee, and Marta's parents Michael and
Joan. I also thank Chris Ranken for his friendship; my parents for bringing me into a loving
world and teaching me to love mathematics; and Howard and Nani Ranken who proved, by
example, that the two boby problem can be solved and inspired Marta and I to pursue the
dream of two academic careers in one household.
I thank the members of my committee for their help and advice. I thank my
advisor for allowing me the luxury of doing research without worrying about funding1 or
machine access at UC Berkeley2 and the University of Tennesse at Knoxville3 I thank Prof.
Kahan for his sage advice, not just on the technical aspects, but also on the non-technical
aspects of a research career and on life itself. I thank Phil Colella for his interest in my
work and for reading my thesis on extremely short notice.
Most importantly, I thank my wife for her love and never ending support and I
thank my daughter for making me smile.
1This work was supported primarily by the Defense Advanced Research Projects Agency of the Depart-ment of Defense under contracts DAAL03-91-C-0047 and DAAH04-95-1-0077, and with additional supportprovided by the Department of Energy grant DE-FG03-94ER25206. The information presented here doesnot necessarily re ect the position or the policy of the Government and no o�cial endorsement should beinferred.
2National Science Foundation Infrastructure grant Nos. CDA-9401156 and CDA-8722788.3The University of Tennessee, Knoxville, acquired the IBM SP2 through an IBM Shared University
Research Grant. Access to the machine and technical support was provided by the University of Tennessee/Oak Ridge National Laboratory Joint Institute for Computational Science.
1
Part I
First Part
2
Chapter 1
Summary - Interesting
Observations
The symmetric eigendecomposition of a real symmetric matrix is: A = QDQT ,
where D is diagonal and Q, is orthonormal, i.e. QTQ = I . Tridiagonal based methods
reduce A to a tridiagonal matrix through an orthonormal similarity transformation, i.e.
A = ZTZT , compute the eigendecomposition of the tridiagonal matrix T = UDUT and,
if necessary, transform the eigenvectors of the tridiagonal matrix back into eigenvectors of
the original matrix A, i.e. Q = ZU . Non-tridiagonal based methods operate directly on
the original matrix A.
I am interested in understanding and minimizing the execution time of dense sym-
metric eigensolvers, as used in real applications, on distributed memory parallel computers.
I have modeled the performance of symmetric eigensolvers as a function of the algorithm,
the application, the implementation and the computer. Some applications require only a
partial eigendecomposition, i.e. only a few eigenvalues or eigenvectors. Di�erent implemen-
tations may require di�erent communication or computation patterns and they may use
di�erent libraries and/or compilers. This thesis concentrates on the O(n3) cost of reduction
to tridiagonal form and transforming the eigenvectors back to the original space.
I have modeled the execution time of the ScaLAPACK[31] symmetric eigensolver,
PDSYEVX, in detail and validated this model against actual performance on a number of
distributed memory parallel computers. PDSYEVX, like most ScaLAPACK codes, uses calls
to the PBLAS[41, 140] to perform basic linear algebra operations such as matrix-matrix
3
multiply and matrix-vector multiply in parallel. PDSYEVX and the PBLAS use calls to the
Basic Linear Algebra Subroutines, BLAS[63, 62], to perform basic linear algebra operations
such as matrix-matrix multiply and matrix-vector multiply on data local to each processor,
and calls to the Basic Linear Algebra Communications Subroutines, BLACS[169, 69], to move
data between the processors. The level one BLAS involving only vectors and perform O(n)
ops on O(n) data, where n is the length of the vector. The level two BLAS involve one
matrix and one or two vectors and perform O(n2) ops on O(n2) data, where the matrix is
of size n�n. The level three BLAS involve only matrices and perform O(n3) ops on O(n2)
data and o�er the best opportunities to obtain peak oating point performance through
data re-use.
PDSYEVX uses a 2D block cyclic data layout for all input, output and internal
matrices. 2D block cyclic data layouts have been shown to support scalable high perfor-
mance parallel dense linear algebra codes[32, 30, 124] and hence have been selected as the
primary data layout for HPF[110], ScaLAPACK[68] and other parallel dense linear algebra
libraries[98, 164]. A 2D block cyclic data layout is de�ned by the processor grid (pr by pc),
the local block size (mb by nb) and the location of the (1; 1) element of the matrix. In
this thesis, we will assume that the (1; 1) element of matrix A, i.e. A(1; 1) is mapped to
the (1; 1) element of the local matrix in processor (0; 0). Hence, A(i; j) is stored in element
(b i�1mb�pr cmb+ mod (i� 1;mb� pr)+1; b j�1
nb�pc cnb+ mod (j� 1; nb� pc)+1) on processor
( mod (b i�1mbc; pr); mod (b j�1
nbc; pc)). Figures 1.1 and 1.2, reprinted from the ScaLAPACK
User's Guide[31] shows how a 9 by 9 matrix would be distributed over a 2 by 3 processor
grid with mb = nb = 2. In general, we will assume that square blocks are used since this is
best for the symmetric eigenproblem, and we will use nb to refer to both the row block size
and the column block size.
All ScaLAPACK codes including PDSYEVX in version 1.5 use the data layout block
size as the algorithmic blocking factor. Hence, except as noted, we use nb to refer to
the algorithmic blocking factor as well as the data layout block size. Data layouts, and
algorithmic blocking factors are discussed in Section 2.3.3.
PDSYEVX calls the following routines:
PDSYTRD Performs Householder reduction to tridiagonal form.
PDSTEBZ Computes the eigenvalues of a tridiagonal matrix using bisection.
PDSTEIN Computes the eigenvectors of the tridiagonal matrix using inverse iteration and
4
Figure 1.1: 9 by 9 matrix distributed over a 2 by 3 processor grid with mb = nb = 2
9 x 9 matrix partitioned in 2 x 2 blocks
a11a12
a21a22
a13a14
a23a24
a19
a29
a15a16
a25a26
a17a18
a27a28
a43a44
a34a33
a41a42
a32a31
a49
a39
a48a47
a38a37
a45a46
a35a36
a52a51
a61a62
a54a53
a63a64
a55a56
a65a66
a58a57
a67a68
a59
a69
a71a72
a81a82
a73a74
a83a84
a75a76
a85a86
a79
a89
a77a78
a87a88
a91a92 a93a94 a95a96 a97a98 a99
NB
MB
M
N
Gram-Schmidt reorthogonalization.
PDORMTR Transforms the eigenvectors of the tridiagonal matrix back into eigenvectors of
the original matrix.
My performance models explain performance in terms of the following application param-
eters:
n The matrix size.
m The number of eigenvectors required.
e The number of eigenvalues required. (e � m)
the following machine parameters:
p The number of processors (arranged in a pr by pc grid as described below).
� The communication latency (secs/message).
� The inverse communication bandwidth (secs/double precision word). This means that
sending a message of k double precision words costs: �+ k�.
5
Figure 1.2: Processor point of view for 9 by 9 matrix distributed over a 2 by 3 processorgrid with mb = nb = 2
0
0
1
1
2 x 3 process grid point of view
2a11a12
a21a22
a13a14
a23a24
a19
a29
a15a16
a25a26
a17a18
a27a28
a43a44
a34a33
a41a42
a32a31
a49
a39
a48a47
a38a37
a45a46
a35a36
a52a51
a61a62
a54a53
a63a64
a55a56
a65a66
a58a57
a67a68
a59
a69
a71a72
a81a82
a73a74
a83a84
a75a76
a85a86
a79
a89
a77a78
a87a88
a91a92 a93a94 a95a96a97a98 a99
1; 2; 3 Time per op for BLAS1, BLAS2 and BLAS3 routines respectively.
�1; �2; �3; �4 Software overhead for BLAS1, BLAS2, BLAS3 and PBLAS routines respectively.
This means that a call to DGEMM(a BLAS3 routine) requiring c ops costs: �3+c 3. See
Chapter 3 for details on the cost of the BLAS. The cost of the PBLAS routine PDSYMV
is shown in Table 4.3.
My model also uses the following algorithmic and data layout parameters:
pr The number of processor rows in the processor grid.
pc The number of processor columns in the processor grid.
nb The data layout block size and algorithmic blocking factor.
These and all other variables used in this thesis are listed in Table A.1 in Ap-
pendix A.
The rest of this chapter presents the most interesting results from my study of the
execution time of symmetric eigensolvers on distributed memory computers. Section 1.1
6
describes the algorithms commonly used for dense symmetric eigendecomposition on dis-
tributed memory parallel computers. Section 1.2 describes how software overhead and load
imbalance costs are signi�cant. Section 1.3 explains the two rules of thumb for ensuring
that a distributed memory parallel computer can achieve good performance on a dense lin-
ear algebra code such as ScaLAPACK's symmetric eigensolver. Section 1.4 explains that it
is important to identify which techniques o�er the greatest potential for improving perfor-
mance across a wide range of applications, computers, problem sizes and distributed memory
parallel computers. Section 1.5 gives a synopsis of how execution time of the ScaLAPACK
symmetric eigensolver could be reduced. Section 1.6 explains the types of applications on
which Jacobi can be expected to be as fast as, or faster than, tridiagonal based methods.
The rest of my thesis is organized as follows. Chapter 2 provides an introduction
and a historical prospective. Chapter 3 explains the performance of the Basic Linear Algebra
Subroutines (BLAS). Chapter 4 contains my complete execution time model for ScaLAPACK's
symmetric eigensolver, PDSYEVX. Chapter 5 simpli�es the execution time model by concen-
trating on a particular application on a particular distributed memory parallel computer,
the Intel Paragon. Chapter 6 explains the performance requirements of distributed memory
parallel computers and discusses the execution time of PDSYEVX. Chapter 7 explains the
performance of other dense symmetric eigensolvers. Chapter 8 provides a blueprint for re-
ducing the execution time of PDSYEVX. Chapter 9 o�ers concise advice to users of symmetric
eigensolvers.
1.1 Algorithms
There are many widely disparate symmetric eigendecomposition algorithms. Tridi-
agonal reduction based algorithms for the symmetric eigendecomposition require asymptot-
ically the fewest ops and have been historically the fastest and most popular[83, 79, 129,
153, 86, 145, 134, 50].
Iterative eigensolvers, e.g. Lanczos and conjugate gradient methods, are clearly
superior if the input matrix is sparse and only a limited portion of the spectrum is needed[49,
119]. Iterative eigensolvers are out of the scope of this thesis.
Even for tridiagonal matrices, there are several algorithms worthy of attention
for the tridiagonal eigendecomposition. The ideal method would require at most O(n2)
oating point operations, O(n) message volume and O(p) messages. The recent work of
7
Parlett and Dhillon[136, 139] renews hope that such a method will be available in the
near future. Should this e�ort hit unexpected snags, other better known methods, such
as QR[79, 86, 93], QD[135], bisection and inverse iteration[83, 102] and Cuppen's divide
and conquer algorithm[50, 66, 147, 88] will remain common. Parallel codes have been
written for QR[39, 8, 76, 125], bisection and inverse iteration[15, 75, 54, 81] and Cuppen's
algorithm[82, 80, 141]. ScaLAPACK o�ers parallel QR and parallel bisection and inverse
iteration codes and Cuppen's algorithm[50, 66, 88], which has recently replaced QR as
the fastest serial method[147], has been coded for inclusion in ScaLAPACK by Fran�coise
Tisseur. Algorithms for the tridiagonal eigenproblem are discussed in Section 2.2, and
parallel tridiagonal eigensolvers are discussed in Section 7.1.
A detailed comparison of tridiagonal eigensolvers would be premature until Parlett
and Dhillon complete their prototype.
This thesis concentrates on the O(n3) cost of reduction to tridiagonal form and
transforming the eigenvectors back to the original space. Hendrickson, Jessup and Smith[91]
showed that reduction to tridiagonal form can be performed 50% faster than ScaLAPACK
does. Lang's successive band reduction[116], SBR, is interesting at least if only eigenval-
ues are to be computed. But the complexity of SBR has made it di�cult to realize the
theoretical advantages of SBR in practice. A performance model for PDSYEVX, ScaLAPACK's
symmetric eigensolver, section 7.1.2. is given in Chapter 4. By restricting our attention to
a single computer, and to the most common applications, the model is further simpli�ed
and discused in Chapter 5.
Jacobi requires 4-20 times as many oating point operations as tridiagonal based
methods, hence the type of problems on which Jacobi will be faster will always be lim-
ited. Jacobi is faster than tridiagonal based methods[125, 2] on small spectrally diagonally
dominant matrices1 despite requiring 4 times as many ops because it has less overhead.
However, on large problems tridiagonal based methods can achieve at least 25% e�ciency
and will hence be faster than any method requiring 4 times as many ops. And, on matrices
that are not spectrally diagonally dominant, Jacobi requires 20 or more times as many ops
as tridiagonal based methods - a handicap that is simply too large to overcome. Jacobi's
method is discussed in Section 7.3.
Methods that require multiple n by n matrix-matrix multiplies, such as the Invari-
1Spectrally diagonally dominant means that the eigenvector matrix, or a permutation thereof, is diago-nally dominant.
8
ant Subspace Decomposition Approach[97] (ISDA), and Yau and Lu's FFT based method[174]
require roughly 30 times as many oating point operations as tridiagonal based methods
and hence may never be faster than tridiagonal based methods. The ISDA for solving
symmetric eigenproblems is discussed in Section 7.4.
Banded ISDA[26] is an improvement on ISDA that involves
an initial bandwdith reduction. Banded ISDA[26] is nearly a tridiagonal method
and o�ers performance that is nearly as good, at least if only eigenvalues are sought. How-
ever since a banded ISDA code requires multiple bandwidth reduction each of which requires
a back transformation, if even a few eigenvectors are required, a banded ISDA code must ei-
ther store the back transformations in compact form or it will perform an additional O(n3)
ops. No code available today stores and applies these backtransformations in compact
form. At present, the fastest banded ISDA code starts by reducing the matrix to tridiag-
onal form and is neither the fastest tridiagonal eigensolver, nor the easiest to parallelize.
Banded ISDA is discussed in Section 7.5.
In conclusion, reduction to tridiagonal form combined with Parlett and Dhillon's
tridiagonal eigensolver is likely to be the preferred method for eigensolution of dense matrices
for most applications.
In the meantime, until Parlett and Dhillon's code is available, we believe that
PDSYEVX is the best general purpose symmetric eigensolver for dense matrices. It is available
on any machine to which ScaLAPACK has been ported2, it achieves 50% e�ciency even
when the ops in the tridiagonal eigensolution are not counted3 and it scales well, running
e�ciently on machines with thousands of nodes. It is faster than ISDA and faster than
Jacobi on large matrices and on matrices that are not spectrally diagonally dominant.
1.2 Software overhead and load imbalance costs are signi�-
cant
In PDSYEVX, it is somewhat surprsing but true that software overhead and load
imbalance costs are larger than communications costs. In its broadest de�nition, software
overhead is the di�erence between the actual execution time and the cost of communication
2Intel Paragon, Cray T3D, Cray T3E, IBM SP2, and any machine supporting the BLACS, MPI or PVM3Our de�nition of e�ciency is a demanding one: total time divided by the time required by reduction
to tridiagonal form and back transformation assuming that these are performed at the peak oating point
execution rate of the machine. i.e. time=( 103
n3
p 3)
9
and computation. Software overhead includes saving and restoring registers, parameter
passing, error and special case checking as well as those tasks which prevent calls to the
BLAS involving few ops from being as e�cient as calls to the BLAS involving many ops:
loop overhead, border cases and data movement between memory hierarchies that gets
amortized over all the operations in a given call to the BLAS. The cost of any operation
which is performed by only a few of the processors (while the other processors are idle) is
a load imbalance cost.
Because software overhead is as signi�cant as communication latency, the three
term performance model introduced by Choi et al.[40] and used in my earlier work[57],
which only counts ops, number of messages and words communicated, does not adequately
model the performance of PDSYEVX. In addition to these three terms a fourth term, which
we designate �, representing software overhead costs is required.
Software overhead is more di�cult to measure, study, model and reason about
than the other components of execution time. Measuring the execution time required for a
subroutine call requiring little or no work measures only subroutine call overhead, parameter
passing and error checking. For the performance models in this thesis, we measure the
execution time of each routine across a range of problem sizes (with code cached and data
not cached) and use curve �tting to estimate the software overhead of an individual routine.
Because we perform these timings with code cached but data not cached, this gives an
estimate of all software overhead costs except code cache misses.
We use times with the code cached and data for our performance models because,
for most problem sizes, the matrix is too large to �t in cache but it is less clear whether code
�ts in cache or not. It is easy to compute the amount of data which must be cached, but
there is no portable automatic way to measure the amount of code which must be cached.
Furthermore, the data cache needs, for typical problem sizes, are much larger than code
cache needs, hence while it is usually clear that the data is not cached the code cache needs
and code cache size are much closer.
A full study of software overhead costs is out of the scope of this thesis and remains
a topic for future research. The overhead and load imbalance terms in the performance
model for PDSYEVX on the PARAGON are explained in Sections 5.7 and 5.8.
10
1.3 E�ect of machine performance characteristics on PDSYEVX
The most important machine performance characteristic is the peak oating point
rate. Bisection bandwidth essentially de�nes which machines ScaLAPACK can perform well
on. Message latency and software overhead, since they are O(n) terms are important
primarily for small and medium matrices.
Most collections of computers fall into one of two groups: those connected by
a switched network whose bisection bandwdith increases linearly (or nearly so) with the
number of processors and those connected by a network that only allows one processor to
send at a time. All current distributed memory parallel computers that I am aware of have
adequate bisection bandwdith4 to support good e�ciency on PDSYEVX. On the other hand,
no network that only allows one processor to send at a time can allow scalable performance
and none that I am aware of allows good performance with as many as 16 processors. As
long as the bandwidth rule of thumb (explained in detail in Section 6.1.1) holds, bandwidth
will not be the limiting factor in the performance of PDSYEVX.
Bandwidth rule of thumb: Bisection bandwidth per processor5 times the square root
of memory size per processor should exceed oating point performance per processor.
Megabytes/sec
processor�pMegabytes
processor>
Mega ops/sec
processor
assures that bandwidth will not limit performance.
Assuming that the bandwidth is adequate, we consdier next the problem size per
processor:
If the problem is large enough, i.e. (n2=p) > 2 � (Megaflops=processor), then
PDSYEVX should execute reasonably e�ciently. This rule (explained in detail in Section 6.1.2
can be restated as:
Memory size rule of thumb: memory size should match oating point performance
Memory size rule of thumb: memory size should match oating point performance
4Few distributed memory parallel computers o�er bandwidth that scales linearly with the number ofprocessors but most still have adequate bisection bandwidth.
5Bisection bandwidth per processor is the total bisection bandwidth of the network divided by the numberof processors.
11
Megabytes
processor>
Mega ops/sec
processor
assures that PDSYEVX will be e�cient on large problems.
If the problem is not large enough, lower order terms, as explained in Chap-
ter 4 will be signi�cant. Unlike the peak op rate which can be substantially independent
of main memory performance, lower order terms (communication latency, communication
bandwidth, software overhad and load imbalance) are strongly linked to main memory
performance.
PDSYEVX can work well on machines with large slow main memory (on large prob-
lems) and or machines with small fast main memory (on small problems). Most distributed
memory parallel computers have su�cient memory size and network bisection bandwidth
to allow PDSYEVX to achieve high e�ciency on large problem sizes. The Cray T3E is one of
the few machines that has su�cient main memory performance to allow PDSYEVX to achieve
high performance on small problem sizes. The e�ect of machine performance characteristics
on PDSYEVX is discussed in Chapter 6.
1.4 Prioritizing techniques for improving performance.
One fo the most importatn uses of performance modeling is to identify which
techniques o�er the most promise for performance improvement, because there are too
many performance improvement techniques to allow one to try them all. One technique
that appeared to be important early in my work, optimizing global communications, now
appears less important in light of the discovery that software overhead and load imbalance
are more signi�cant than earlier thought. Here we talk about general conclusions; details
are summarized in Section 1.5, and elaborated in Chapters 7 and 8.
Overlapping communication and computation, though it undeniably increases per-
formance, should be implemented only after every e�ort has been made to reduce both
communications and computations costs as much as possible. Overlapping communication
and computation has proven to be more attractive in theory than in practice because not
all communication costs overlap well and communication costs are not the only impediment
to good parallel performance.
12
Although Strassen's matrix multiplication has been proven to o�er performance
better than can be achieved through traditional methods, it will be a long time before
a Strassen's matrix multiply is shown to be twice as fast as a traditional method. A
typical single processor computer would require 2-4 Gigabytes of main memory to achieve
an e�ective op rate of twice the machine's peak op rate6 and 2-4 Terabytes of main
memory to achieve 4 times the peak op rate. Strassen's matrix multiplication will get
increasing use in the coming years, because achieving 20% above \peak" performance is
nothing to sneeze at, but Strassen's matrix multiply will not soon make matrix multiply
based eigendecomposition such as ISDA faster than tridiagonal based eigendecomposition.
1.5 Reducing the execution time of symmetric eigensolvers
PDSYEVX can be improved. It does not work well on matrices with large clusters
of eigenvalues. And, it is not as e�cient as it could be[91], achieving only 50% of peak
e�ciency on PARAGON, Cray T3D and Berkeley NOW even on large matrices. On small
matrices it performs worse. Parlett and Dhillon's new tridiagonal eigensolver promises to
solve the clustered eigenvalue problem so we concentrate on improving the performance of
reduction to tridiagonal form and back transformation.
Input and output data layout need not a�ect execution time of a parallel sym-
metric eigensolver because data redistribution is cheap. Data redistribution requires only
O(p) messages and O(n2=p) message volume per processor. This is modest compared to
O(n log(p)) messages and O(n2=pp) message volume per processor required by reduction
to tridiagonal form and back transformation.
Separating internal and external data layout actually decreases minimum execution
time over all data layouts. Separating internal and external data layouts allows reduction
to tridiagonal form and back transformation to use di�erent data layouts. It also allows
codes to concentrate only on the best data layout, reducing software overhead and allowing
improvements which would be prohibitively complicated to implement if they had to work
on all two-dimensional block cyclic data layouts.
Separating internal and external data layouts increases the minimum workspace
requirement7 from 2:5n2 to 3n2. However with minor improvements in the existing code,
6A dual processor computer would require twice as much memory.7Assuming that data redistribution is not performed in place. It is di�cult to redistribute data in place
13
and without any changes to the interface, internal and external data layout can be separated
without increasing the workspace requirement. See Section 8.5.
Lichtenstein and Johnson[124] point out that data layout is irrelevant to many
linear algebra problems because one can solve a permuted problem instead of the original.
This works for symmetric problems provided that the input data is distributed over a square
processor grid and with a row block size is equal to the column block size.
Hendrickson, Jessup and Smith[91] demonstrated that the performance of PDSYEVX
can be improved substantially by reducing load imbalance, software overhead and commu-
nications costs. Most of the ine�ciency in PDSYEVX is in reduction to tridiagonal form.
Software overhead and load imbalance are responsible for more of the ine�ciency than the
cost of communications. Hence, it is those areas that need to be sped up the most. Pre-
liminary results[91] indicate that by abandoning the PBLAS interface, using BLAS and BLACS
calls directly, and concentrating on the most e�cient data layout, software overhead, load
imbalance and communications costs can be cut in half. Strazdins has investigated reducing
software overheads in the PBLAS[161], but it remains to be seen whether software overheads
in the PBLAS can be reduced su�ciently to allow PDSYEVX to be as e�cient as it could be.
PDSYEVX performance can be improved further if the compiler can produce e�cient code on
simple doubly nested loops, implementing merged BLAS Level 2 operations (like DSYMV and
dsyr2).
For small matrices, software overhead dominates all costs, and hence one should
minimize software overhead even at the expense of increasing the cost per op. An unblocked
code has the potential to do just that.
Although back transformation is more e�cient than reduction to tridiagonal form,
it can be improved. Whereas software overhead is the largest source of ine�ciency in re-
duction to tridiagonal form, communications cost and load imbalance are the largest source
of ine�ciency in back transformation. Load imbalance is hard to eliminate in a blocked
data layout in reduction to tridiagonal form because the size of the matrix being updated
is constantly changing (getting smaller), but in back transformation, all eigenvectors are
constantly updated, so statically balancing the number of eigenvalues assigned to each
processor works well. Therefore the best data layout for back transformation is a two-
dimensional rectangular block-cyclic data layout. The number of processor columns, pc,
between two arbitrary parallel data layouts. If e�cient in-place data redistribution were feasible, separatinginternal and external data layout would require only a trivial increase in workspace.
14
should exceed the number of processor rows by a factor of approximately 8. The optimal
data layout column block size is: dn=(pc k)e for some small integer k. The row blocksize
is less important in back transformation, and 32 is a reasonable choice, although setting it
to the same value as the column block size will also work well if the BLAS are e�cient on
that block size and pr < pc. Many techniques used to improve performance in LU decom-
position, such as overlapping communication and computation, pipelining communication
and asynchronous message passing can also be used to improve the performance of back
transformation. Of these techniques, only asynchronous message passing (which eliminates
all local memory movement) requires modi�cation to the BLACS interface. The modi�cation
to the BLACS needed to support asynchronous message passing would allow forward and
backward compatibility.
All of these methods are discussed in Chapter 8.
1.6 Jacobi
A one-sided Jacobi method with a two-dimensional data layout will beat tridi-
agonal based eigensolvers on small spectrally diagonally dominant matrices. The simpler
one-dimensional data layout is su�cient for modest numbers of processors, perhaps as many
as a few hundred, but does not scale well. Tridiagonal based methods, because they require
fewer ops, will beat Jacobi methods on random matrices regardless of their size on large
(n > 200pp) matrices even if they are spectrally diagonally dominant. Jacobi also remains
of interest in some cases when high accuracy is desired[58]. Jacobi's method is discussed in
Section 7.3.
1.7 Where to obtain this thesis
This thesis is available at: http://www.cs.berkeley.edu/stanley/thesis
15
Chapter 2
Overview of the design space
2.1 Motivation
The execution time of any computational solution to a problem is a single-valued
function (time) on a multi-dimensional and non-uniform domain. This domain includes
the problem being solved, the algorithm, the implementation of the algorithm and the
underlying hardware and software (sometimes referred to collectively as the computer). By
studying one problem, the symmetric eigenproblem, in detail we gain insight into how each
of these factors a�ects execution time.
Section 2.2 discusses the most important algorithms for dense symmetric eigende-
composition on distributed memory parallel computers. Section 2.3 discusses the e�ect that
the implementation can have on execution time. Section 2.4 discusses the e�ect of various
hardware characteristics on execution time. Section 2.5 lists several applications that uses
symmetric eigendecomposition and their di�ering needs. Section 2.6 discusses the direct and
indirect e�ects of machine load on the execution time of a parallel code. Section 2.7 outlines
the most important historical developments in parallel symmetric eigendecomposition.
2.2 Algorithms
The most common symmetric eigensolvers which compute the entire eigendecom-
position use Householder reduction to tridiagonal form, form the eigendecomposition of the
tridiagonal matrix and transform the eigenvectors back to the original basis. Algorithms
that do not begin by reduction to tridiagonal form require more oating point operations.
16
Except for small spectrally diagonally dominant matrices, on which Jacobi will likely be
faster than tridiagonal based methods, and scaled diagonally dominant matrices on which
Jacobi is more accurate[58], tridiagonal based codes will be best for the eigensolution of
dense symmetric matrices. See Section 7.3 for details.
The recent work of Parlett and Dhillon o�ers the promise of computing the tridi-
agonal eigendecomposition with O(n2) ops and O(p) messages. Should some unexpected
hitch prevent this from being satisfactory on some matrix types, there are several other
algorithms from which to choose. Experience with existing implementations shows that for
most matrices of size 2000 by 2000 or larger, the tridiagonal eigendecomposition is a modest
component of total execution time.
Reduction to tridiagonal form and back transformation are the most time con-
suming steps in the symmetric eigendecomposition of dense matrices. These two steps
require more ops (O(n3) vs. O(n2)), more message volume (O(n2pp) vs. O(n2)) and
more messages (O(n log(p)) vs. O(p)) than the eigendecomposition of the tridiagonal ma-
trix. Since the cost of the matrix transformations (reduction to tridiagonal form and back
transformation) grows faster than the cost of tridiagonal eigendecomposition, the matrix
transformations are the dominant cost for larger matrices.
Reduction to tridiagonal form and back transformation require di�erent commu-
nication patterns. Reduction to tridiagonal form is a two-sided transformation requiring
multiplication by Householder re ectors from both the left and right side. Two sided reduc-
tions require that every element in the trailing matrix be read for each column eliminated,
hence half of the ops are BLAS2 matrix-vector ops and O(n log(p)) messages are required.
Equally importantly, two-sided reductions require signi�cant calculations within
the inner loop, which translates into large software overhead. Indeed on the computers
that we considered, software overhead appears to be a larger factor in limiting e�ciency of
reduction to tridiagonal form than communication.
Back transformation is a one-sided transformation with updates than can be
formed anytime prior to their application. Hence back transformation requires O(n=nb)
messages (where nb is the data layout block size) and far less software overhead than re-
duction to tridiagonal form.
Chapters 4 and 5 discuss the execution time of reduction to tridiagonal form and
back transformation, as implemented in ScaLAPACK, in detail.
17
2.3 Implementations
2.3.1 Parallel abstraction and languages
There are three common ways of expressing parallelism in linear algebra codes:
message passing, shared memory and calls to the BLAS. Message passing programs tend
to keep communication to a minimum, in part because the communication is speci�ed
directly. Shared memory codes can outperform message passing codes when load imbalance
costs outweigh communication costs[118]. All calls to the BLAS o�er potential parallelism
though the potential for speedup varies. ScaLAPACK uses message passing while LAPACK
exposes parallelism through calls to the BLAS.
In some cases, recent compilers are able to identify the parallelism in codes that
may not have been written speci�cally for parallel execution[172, 171]. However, experience
has shown that programs designed for sequential machines rarely exhibit the properties
necessary for e�cient parallel execution, hence some research into parallelizing compilers
has switched its emphasis to parallelizing codes which are written in languages such as
HPF[94, 110] which allow the programmer to express parallelism and allow some control
over data layout.
Codes written in any standard sequential language, such as C, C++ or Fortran can
achieve high performance, especially if the majority of the operations are performed within
calls to the BLAS. If the ops are performed within codes written in the language itself, the
execution time will depend upon the code and the compiler more than on the language used.
If pointers are used carelessly in C, the compiler may not be able to determine the data
dependencies exactly and may have to forgo certain optimizations[172]. On the other hand,
carefully crafted C codes, tuned for individual architectures and compiled with modern
optimizing compilers can result in performance that rivals that of carefully tuned assembly
codes[23, 168].
2.3.2 Algorithmic blocking
A blocked code is one that has been recast to allow some of the ops to be per-
formed as e�cient BLAS3 matrix-matrix multiply ops[6, 4]. Typically a block of columns is
reduced using an unblocked code followed by a matrix-matrix update of the trailing matrix.
The algorithmic blocking factor is the number of columns (or rows) in the block column.
18
In serial codes, data layout blocking does not exist and hence the algorithmic blocking fac-
tor is referred to simply as the blocking factor. In ScaLAPACK version 1.5, the algorithmic
blocking factor is set to match the data layout blocking factor.
2.3.3 Internal Data Layout
Most of the ops in blocked dense linear codes involve a rankk update, i.e. A0 =
A+B �C where A 2 Rm;n, B 2 Rm;k, C 2 Rk;n and m;n are O(n) and k is the algorithmic
blocking factor (a tuning parameter typically much smaller than n or m). A may be
triangular and B and/or C may be transposed or conjugate transposed. Hence internal
data layout must support good performance on such rank k updates.
A is typically updated in place, i.e. the node which owns element Ai;j computes
and stores A0i;j . This is called the owner computes rule and is motivated by the high cost
of data movement relative to the cost of oating point computation. If k is large enough a
3D data layout is more e�cient[1] [12], and performance can be improved further by using
Strassen's matrix multiply[157] [96] [70]. Some dense linear algebra codes, including LU,
can be recursively partitioned[165] resulting in large values of k for the majority of the
ops. Nonetheless, though a 3D data layout might be best for a recursively partitioned LU,
reduction to tridiagonal form is most e�cient with a modest algorithmic blocking factor
and hence it is more e�cient to update A in place and we will make that assumption for
the rest of this discussion.
If A is to be updated in place, a 2D layout minimizes the total communication
requirement for rank k updates. The elements of B and C which must be sent to each node
are determined by the elements of A owned by that node. The node that owns element Ai;j
must obtain a copy of row i of B and column j of C. The number of elements of matrices
B and C that a given node must obtain is k times the number of rows and columns of A
for which the node owns at least one element. If a node must own r2 elements, the number
of elements of B and C which must be obtained is minimized if the node owns a square
submatrix of A corresponding to r rows and r columns. In a 2D layout, the processors are
arranged in a rectangular grid. Each row of the matrix is assigned to a row of the processor
grid. Each column is assigned to a column of the processor grid.
The common ways of assigning the rows and columns to the processor grid in a
2D layout are: block, cyclic and block-cyclic. For the following descriptions, we will assume
19
that we are distributing n rows of A over pr processor rows. In a cyclic layout, row i is
assigned to processor row i mod pr. In a block layout, row i is assigned to processor row
b i�1dn=prec. In a block-cyclic data layout, row i is assigned to processor row b i�i
nbmod prc,
where nb is the data layout block-size. The block-cyclic data layout includes the other two
as special cases.
Block-cyclic data layouts simplify algorithmic blocking and are used in most paral-
lel dense linear algebra libraries[68] [98, 164]. However, by separating algorithmic blocking
from data blocking it is usually1 possible to achieve high performance from a cyclic data
layout[91, 140, 44, 158].
One-dimensional data layouts require O(n2) data movement per node (compared
to O(n2=pp) for 2D data layouts) and are generally less e�cient. However, there are certain
situations in which 1D data layouts are preferred. If the communication pattern is strictly
one-dimensional (i.e. only along rows or columns) a 1D data layout requires no communi-
cation. Furthermore, some applications, such as LU, require much more communication in
one direction than the other2. Hence, for modest numbers of processors it may be better
to use a 1D data layout.
A square processor grid can greatly simplify symmetric reductions - allowing lower
overhead codes. Furthermore, I believe that pipelining and lookahead (see section 2.4.2)
can only be used e�ectively on symmetric reductions (such as Cholesky and reduction from
generalized to standard form) when a square processor grid is used3.
All existing parallel dense linear algebra libraries use the same input data layout
as the internal data layout. In Chapter 8 I will demonstrate that this is not necessary to
achieve high performance and that in fact performance can be improved by using a di�erent
data layout internally than the input and output data layout.
1Block-cyclic data layouts still maintain an advantage over cyclic data layouts on machines with highcommunication latency, especially in those algorithms, such as Cholesky and back transformation, thatrequire only O(n=nb) messages, where nb is the data layout block-size.
2LU with partial pivoting requires O(n log(p)) messages within the processor columns but only O(n=nb)messages within the processor rows[31, 40, 30]. The total volume of communication however is similar inboth directions.
3Pipelining and lookahead cannot be used in reduction to tridiagonal form because of its synchronousnature.
20
2.3.4 Libraries
Software libraries can improve portability, robustness, performance and software
re-use. ScaLAPACK is built on top of the BLAS and BLACS and hence will run on any system
on which a copy of the BLAS[63, 62] and BLACS[169, 69] can be obtained.
Libraries, and their interface, have both a positive and a negative e�ect on per-
formance. The existence of a standard interface to the BLAS means that by improving the
performance of a limited set of routines, i.e. the BLAS, one can improve the performance of
the entire LAPACK and ScaLAPACK library and other codes as well. Hence, many manufac-
turers have written optimized BLAS for their machines. In addition, Bilmes et al.[23, 168]
have written a portable high performance matrix-matrix multiply and two other research
groups have written high performance BLAS that depend only on the existence of a high
performance matrix-matrix multiply[51, 103, 104]. Portable high performance BLAS o�ers
the promise of high performance on LAPACK and ScaLAPACK codes without the expense of
hand coded BLAS.
However, adhering to a particular library interface necessarily rules out some possi-
bilities. The BLACS do not support asynchronous receives, a costly limitation on the Paragon.
The BLAS do not meet all computational needs[108], especially in parallel codes[91], hence
the programmer is faced with the choice of reformulating code to use what the BLAS o�ers
or avoiding the BLAS and trusting the compiler to produce high performance code. Fur-
thermore, the interface itself implies some overhead, at the very least a subroutine call
but typically much more than that[161]. Strazdins[161] showed that software overhead in
ScaLAPACK accounts for 15-20% of total execution time even for the largest problems that
�t in memory on a Fujitsu VP1000.
2.3.5 Compilers
Compiler code generation is relatively unimportant to LAPACK and ScaLAPACK
performance, because these codes are written so that most of the work is done in the calls
to the BLAS. By contrast, EISPACK is written in Fortran without calls to the BLAS and hence
its performance is dependent on the quality of the code generated by the Fortran compiler.
Lehoucq and Carr[35] argue that compilers now have the capability to perform
many of the optimizations that the LAPACK project performed by hand. Although no com-
pilers existing today can produce code as e�cient as LAPACK from simple three line loops,
21
the compiler technology exists[149, 115, 148].
Today, most compilers are able to produce good code for single loops, reducing the
performance advantage of the BLAS1 routines. Soon compilers will be able to produce good
code for BLAS2 and even BLAS3 routines. This will require us to rethink certain decisions,
especially where the precise functionality that we would like is lacking. There will be an
awkward period, probably lasting decades, during which some but not all compilers will be
able to perform comparably to the BLAS.
2.3.6 Operating Systems
Operating systems are largely irrelevant to serial codes such as LAPACK but they
can have a signi�cant impact on parallel codes. Consider, for example, the broadcast
capability inherent in Ethernet hardware. That capability is not available because the
TCP/IP protocol does not allow access to that capability. Furthermore, at least 90% of
the message latency cost is attributable to software and the operating system often makes
it di�cult to reduce the message latency cost. Part of the NOW[3] project involves �nding
ways to reduce the large message latency cost inherent in Unix operating systems through
using user-level to user-level communications, avoiding the operating system entirely.
2.4 Hardware
2.4.1 Processor
The processor, or more speci�cally the oating point unit, is the fundamental
source of processing power or the ultimate limit on performance, depending on your point
of view. The combined speed of all of the oating point units is the peak performance,
or speed of light, for that computer. For many dense linear algebra codes, the number of
oating point operations cannot be reduced substantially and hence the goal is to perform
the necessary ops as fast (i.e. as close to the peak performance) as possible.
Floating point arithmetic
The increasing adherence to the IEEE standard 754 for binary oating point
arithmetic[7] bene�ts performance in two ways: it reduces the e�ort needed to make codes
22
work across multiple platforms and it allows one to take advantage of details of the under-
lying arithmetic in a portable code. The developers of LAPACK had to expend considerable
e�ort to make their codes work on machines with non-IEEE arithmetic, notably older Cray
machines. By contrast, the developers of ScaLAPACK chose to concentrate on IEEE standard
754 conforming machines allowing them not only to avoid the hassles of old Cray arithmetic,
but also to check the sign bit directly when using bisection[54] to compute the eigenvalues
of a tridiagonal matrix.
Consistent oating point arithmetic is also important for execution on heteroge-
neous machines. Demmel et al.[54] discuss ways to achieve correct results in bisection on a
heterogeneous machine. I have proposed having each process compute a subset of eigenval-
ues, chosen by index, sharing those eigenvalues among all processes and then having each
process independently sort the eigenvalues[55].
Ironically the one place where the IEEE standard 754 allows some exibility has
caused problems for heterogeneous machines. The IEEE standard 754 allows several options
for handling sub-normalized numbers, i.e. numbers that are too small to be represented as
a normalized number. During ScaLAPACK testing it was discovered that a sub-normalized
number could be produced on a machine that adheres to the IEEE standard 754 completely
and that when this number is then passed to the DEC Alpha 21064 processor, the DEC
Alpha 21064 processor does not recognize them as legitimate numbers and aborts. To �x
this would have required xdr to be smart enough to recognize this unusual situation4 or
make one of the processors work in a manner di�erent from its default5.
2.4.2 Memory
The slower speed of main memory (as compared to cache or registers) a�ects
performance in three ways. It reduces the performance of matrix-matrix multiply slightly
and greatly complicates the task of coding an e�cient matrix-matrix multiply. It bounds
from below the algorithmic blocking factor needed to achieve high performance on matrix-
matrix multiply. And, it limits the performance of BLAS1 and BLAS2 codes.
The last two factors listed above combine in an unfortunate manner: slow main
memory increases the number of BLAS1 and BLAS2 ops and reduces the rate at which they
are executed. The number of BLAS1 and BLAS2 ops are typically O(n2 nb), where nb is the
4This would slow down xdr, possibly signi�cantly.5This too would result in slower execution.
23
algorithmic blocking factor, which as stated above, must be larger when main memory is
slow. The ratio of peak oating point performance to main memory speed is large enough
on some machines that the O(n2 nb) cost of the BLAS1 and BLAS2 ops can no longer be
ignored.
Improving the load balance of the O(n2 nb) BLAS1 and BLAS2 ops.
In a blocked dense linear algebra transformation, such as LU decomposition,
Cholesky or QR, there are O(n2 nb) BLAS1 and BLAS2 ops[30, 53]. PDSYEVX includes two
blocked dense linear algebra transformations: Reduction to tridiagonal form, PDSYTRD, is
described in Section 4.2 and back transformation, PDORMTR, is described in Section 4.4.
In ScaLAPACK version 1.5, the O(n2 nb) BLAS1 and BLAS2 ops are performed by
just one row or column of processors. This leads to load imbalance and causes these ops
to account for O(n2nbpp ) execution time. If these ops can be performed on all p processors,
instead of just one row or column, they will account for only O(n2 nb
p ) execution time.
There are two ways to spread the cost of the O(n2 nb) BLAS1 and BLAS2 ops
over all the processors: take them out of the critical path or distribute them over all
processors. Transformations such as LU, and back transformation (applying a series of
householder vectors) can be pipelined, allowing each processor column (or row) to execute
asynchronously. Pipelining in turn allows lookahead, a process by which the active column
performs only those computations in the critical path before sending that data on to the
next column[32].
Distributing the BLAS1 and BLAS2 ops over all of the processors, as discussed in
the last paragraph, requires a di�erent data distribution, a di�erent broadcast and a sig-
ni�cant change to the code. The di�erence can be best illustrated by considering LU. In a
2D blocked LU, LU is �rst performed on a block of columns, and the resulting LU decom-
position is broadcast, or spread across, to all processor columns. One way to broadcast k
elements to p processors is to combine a Reduce scatter (which takes k elements and sends
k=p to each processor) with an Allgather (which takes k=p elements from each processor and
spreads them out to all processors giving each processor a copy of all k elements). There
are three ways to perform LU on this column block of data: 1) Before the column block is
broadcast to all processors (as ScaLAPACK does) in which case only the current column of
processors is involved in performing the column LU and the Reduce scatter and Allgather
24
combine to broadcast the block LU decomposition. 2) After the broadcast, in which case
the Reduce scatter and Allgather combine to broadcast the block column prior to the LU
decomposition - all processor columns would have a copy of the block column and each pro-
cessor column could perform the column block LU redundantly. 3) After the Reduce scatter
but before the Allgather. In this case, the Reduce scatter operates on the column block
prior to the LU decomposition but the Allgather operates on the block column after the
LU decomposition. All processors can be involved in the LU decomposition.
In HJS, Hendrickson, Jessup and Smith's symmetric eigensolver[91, 154] discussed
in Section 7.1.2, the BLAS1 and BLAS2 ops are analogously distributed over all of the
processors.
Lookahead does not improve performance unless the execution of the code is
pipelined, i.e. proceeds in a wave pattern over the processes. Two-sided reductions, like
tridiagonal reduction, do not allow pipelining. And, pipelining may be limited on reductions
of symmetric or Hermitian matrices (such as Cholesky)6.
Memory size
The amount of main memory limits the size of the problem that can be executed
e�ciently, while the amount of virtual memory limits the size of the problem that can be
run at all. ScaLAPACK's symmetric eigensolvers, PDSYEVX and PDSYEV require roughly 4n2
and 2n2 double precision words of virtual memory space respectively. However, both can
be run e�ciently provided that physical memory can contain7 the n2=2 elements of the
triangular matrix A. Ed D'Azevedo[52] has written an out-of-core symmetric eigensolver
for ScaLAPACK and studied the performance of PDSYEV and PDSYEVX on large problem sizes.
2.4.3 Parallel computer con�guration
I will discuss primarily distributed memory computers with one processor per
node, discussing shared memory computers (SMPs), clusters of workstations and clusters
of shared memory computers only brie y.
Four machine characteristics are important for distributed memory computers:
peak oating point performance, software overhead, communication latency and commu-
6I believe that pipelining can be used in Cholesky if a square processor grid is used. Work in progress.7Depending on the page size, keeping an n by n triangular matrix in memory may require as few as n2=2
memory locations (if the page size is 1) or as many as n2 (if the page size is � n).
25
nication (bisection) bandwidth. Software overhead and communication latency are the
dominant costs for small problems8. Peak oating point performance is the dominant costs
for large problems 9.
Interconnection network
Bisection bandwidth and communication latency are the two important measures
of an interconnection network. Networks which allow only one pair of nodes to communicate
at a time do not o�er adequate bisection bandwidth and hence parallel dense linear algebra
(with the possible exception of huge matrix-matrix multiplies) will not perform well on such
a network.
As long as the bisection bandwidth is adequate, the topology of the interconnection
network has not proven to be an important factor in the performance of parallel dense linear
algebra.
Shared Memory Multiprocessing
Users of dense linear algebra codes have two choices on shared memory multi-
processors. They can use a serial code, such as LAPACK that has been coded in terms of
the BLAS and, provided that the manufacturer has provided an optimized BLAS, they will
achieve good performance. Or, provided that the manufacturer provides MPI[65], PVM[19]
or the BLACS they can use ScaLAPACK.
LeBlanc and Markatos[118] argue that shared memory codes typically get better
load balance while message passing codes typically incur lower communications cost. How-
ever, the real di�erence could well come down to a matter of how e�cient the underlying
libraries are.
Clusters of workstations
Some clusters of workstations, notably the NOW project[3] at Berkeley, o�er com-
parable communication performance to distributed memory computers. However, the vast
majority of networks of workstations in present use are still connected by Ethernet or FDDI
8On current architectures, n < 100pp is small for our purposes
9On current architectures, n > 1000pp is large for our purposes
26
rings and hence do not have the low latency and high bisection bandwidth required to per-
form dense linear algebra reductions in parallel e�ciently.
Cluster of SMPs (CLUMPS)
Dense linear algebra codes have two choices on clusters of SMPs: they can assign
one process to each processor or they can assign one process to each multi-processor node.
The tradeo� will be similar to the shared-memory versus message-passing question on shared
memory computers.
If each processor is assigned a separate process the details of how the processes
will be assigned to what is essentially a two level grid of processors will be important.
For a modest cluster of SMPs (say 4 nodes each with 4 processors) it might make sense
to assign one dimension within the node and the other across the nodes. However, this
will not scale well - adding nodes will require increasing the bandwidth per node else all
dense linear algebra transformations will become bandwidth limited as the number of nodes
increases. A layout which that is 2 dimensional within the nodes and 2 dimensional among
the nodes allows both the number of processors per node and the number of nodes to
increase provided only that bisection bandwidth grow with the number of processors and
that internal bisection bandwidth (i.e. main memory bandwidth) grows with the number
of processors per node.
On the �rst CLUMPS, how well each of the libraries is implemented is likely to
outweigh theoretical considerations. Shared memory BLAS are not trivial, nor will communi-
cation systems that properly handle two levels of processor hierarchy be, i.e. communication
within a node and communication between nodes.
On most distributed memory systems, the logical to physical processor grid map-
ping is of secondary importance. I suspect that this will not be the case for clusters of
SMPs. It will be important to have the processes assigned to the processors on a particular
node nearby in the logical process grid as well.
2.5 Applications
Large symmetric eigenproblems are used in a variety of applications. Some of
these applications include: real-time signal processing[156] [34], modeling of acoustic and
electro-magnetic waveguides[114], quantum chemistry[74] [22, 175], numerical simulations
27
of disordered electronic systems[95], vibration mode superposition analysis[18], statisti-
cal mechanics[132], molecular dynamics[152], quantum Hall systems[112, 106], material
science[166], and biophysics[143, 144].
The needs of these applications di�er considerably. Many require considerable
execution time to build the matrix and hence the eigensolution remains a modest part of the
total execution. However, building the matrix often parallelizes easily and grows much more
slowly than the O(n3) cost of eigensolution. Hence, for these applications, the eigensolver
becomes the bottleneck as larger problems are solved in parallel. Few applications require
the entire spectrum, but most of these listed above require at least 10% of the spectrum
and hence are best solved by dense techniques. Some have large clusters of eigenvalues[74],
while others do not.
2.5.1 Input matrix
Three features of the input matrix a�ect the execution time of symmetric eigen-
solvers: sparsity, eigenvalue clustering and spectral diagonal dominance.
Sparsity
Some algorithms and codes are speci�cally designed for sparse input matrices.
Lanczos[49] has traditionally been used to �nd a few eigenvalues and eigenvectors at the
ends of the spectrum. Recently, ARPACK[119], and PARPACK[130] have been developed
based on Lanczos with full re-orthogonalization. They can therefore compute as much of
the spectrum as the user chooses.
The Invariant Subspace Decomposition Approach and reduction to tridiagonal
form based algorithms can both be run from either a dense or banded matrix. In this
dissertation, I discuss only dense matrices.
Spectrum
Some algorithms are more dependent on the spectrum than others. Most are
dependent in some manner, but that dependence di�ers from one algorithm to another.
It is di�cult to maintain orthogonality of the eigenvectors when computing the
eigendecomposition of matrices with tight clusters of eigenvalues. Such matrices require
special techniques in divide and conquer and in inverse iteration (See section 2.7.4). On
28
the other hand, divide and conquer experiences the most de ation, and hence the greatest
e�ciency, on matrices with clustered eigenvalues.
The Invariant Subspace Decomposition Approach maintains orthogonality on ma-
trices with clustered eigenvalues. However, it may have di�culty picking a good split point
if the clustering causes the eigenvalues to be unevenly distributed.
Spectral Diagonal dominance
Spectral diagonal dominance10 speeds convergence of the Jacobi algorithm. In-
deed, if the input matrix is su�ciently diagonally dominant, Jacobi may converge in as
little as two steps (versus 10 to 20 for non diagonally dominant matrices). But, spectral
diagonal dominance has little e�ect on any of the other algorithms.
2.5.2 User request
The portion of the spectrum that the user needs, i.e. the number of eigenvalues
and/or eigenvectors, a�ects execution time of some, but not all eigensolvers.
Two step band reduction (to tridiagonal form) is most attractive when only eigen-
values are requested because the back transformation task is expensive in two step band
reduction.
The cost of bisection and inverse iteration depends upon the number of eigenvalues
and eigenvectors requested. These costs are O(n2) and generally not signi�cant for large
problem sizes. However, back transformation requires 2n2m ops where m is the number
of eigenvectors required.
Iterative methods, such as Lanczos[49] and implicitly restarted Lanczos[119] are
clearly superior if only a few eigenvectors are required.
10Spectrally diagonally dominant means that the eigenvector matrix, or a permutation thereof, is diago-nally dominant. Most, but not all, diagonally dominant matrices are spectrally diagonally dominant. Forexample if you take a dense matrix with elements randomly chosen from [�1; 1] and scale the diagonalelements by 1e3 the resulting diagonally dominant matrix will be spectrally diagonally dominant. However,if you take that same matrix and add 1e3 to each diagonal element, the eigenvector matrix is unchangedeven though the matrix is clearly diagonally dominant.
29
2.5.3 Accuracy and Orthogonality requirements.
Demmel and Veseli�c,[58] prove that on scaled diagonally dominant matrices11,
Jacobi can compute small eigenvalues with high relative accuracy while tridiagonal based
methods can fail to do so.
At present, the ScaLAPACK o�ers two symmetric eigensolvers: PDSYEVX and PDSYEV.
PDSYEVX, which is based on bisection and inverse iteration (DSTEBZ and DSTEIN from LAPACK)
is faster and scales better but does not guarantee orthogonality among eigenvectors asso-
ciated with clustered eigenvalues. PDSYEV, which is based on QR iteration (DSTEQR from
LAPACK) is slower and does not scale as well but does guarantee orthogonal eigenvectors.
2.5.4 Input and Output Data layout
At present, the execution time of the ScaLAPACK symmetric eigensolver is strongly
dependent on the data layout chosen by the user for input and output matrices. 1D data
layouts are not scalable and lead to both high communication costs and poor load balancing.
Suboptimal block sizes can likewise a�ect performance signi�cantly. In particular, a block
size of 1, i.e. cyclic data layout, causes ScaLAPACK to send a large number of small messages
resulting in unacceptable message latency costs and a huge number of calls to the BLAS. If
the block size is too large, load balance su�ers.
There are a couple ways to reduce this dependence on the data layout chosen by
the user. If algorithmic blocking is separated from data layout blocking[140] [91] [159] small
data layouts can be handled much more e�ciently. However, small block-sizes (especially
cyclic layouts) still require more messages than larger block-sizes. And, large block sizes
still lead to load imbalance.
In Chapter 8 I will show that redistributing the data to an internal format that
is near optimal for the particular machine and algorithm involved allows for improved
performance and performance that is independent of the input and output data layout.
2.6 Machine Load
The load of the machine, in addition to the direct e�ect of o�ering your program
only a portion of the total cycles, can have several indirect e�ects. If each processor is
11A matrix, A, is scaled diagonally dominant if and only if DAD with D = jdiag(A)j1=2 is diagonallydominant.
30
individually scheduled, performance can be arbitrarily poor because signi�cant progress is
only possible when all processes are concurrently scheduled. A loaded machine may also
cause your data to be swapped out to disk, which can greatly reduce peak performance.
Finally, it is the most heavily loaded machine which controls execution time. If your code
is running on 9 unloaded processors and one processor with a load factor of 5, you will
get no more than a factor of 10/5 speedup. A ScaLAPACK user has reported performance
degradation and speedup less than 1, (i.e. more processors take longer to comlete the same
sized eigendecomposition) on the IBM IBM SP2. I have also witnessed this behaviour on
the IBM IBM SP2at the University of Tennesse at Knoxville and I have reason to suspect
that the IBM IBM SP2 isnot gang scheduled and that this fact accounts for a large part of
the poor performance of PDSYEVX that the user and I have witnessed on the IBM SP2.
Space sharing, allocating subsets of the processors, solves all of these problems,
but has its own problems. On some machines, jobs running on di�erent partitions share
the same communications paths and hence if one job saturates the network, all jobs may
su�er.
2.7 Historical notes
2.7.1 Reduction to tridiagonal form and back transformation
Householder reduction to tridiagonal form is a two-sided reduction, which requires
multiplication by Householder re ectors from both the left and right side. Martin et al. im-
plemented reduction to tridiagonal form in Algol[129]. TRED1 and TRED2 perform reduction
to tridiagonal form in EISPACK[153]. Dongarra, Hammarling and Sorensen[64] showed that
Householder reduction to tridiagonal form can be performed using half matrix-vector and
half matrix-matrix multiply ops. This has been implemented as DSYTRD in LAPACK[5, 67]
for scalar and shared memory multiprocessors and PDSYTRD for distributed memory com-
puters in ScaLAPACK[42]. Chang et al. implemented one of the �rst parallel codes for
reduction to tridiagonal form, �rst using a 1D cyclic data layout[37] and then a 2D cyclic
data layout[38].
Smith, Hendrickson and Jessup[91] show that data blocking is not required for
e�cient algorithmic blocking and that PDSYTRD pays a substantial execution time penalty
for its generality (accepting any processor layout) and portability (being built on top of
31
the PBLAS, BLACS and BLAS). By restricting their attention to square processor layouts on
the PARAGON, they were able to dramatically reduce the overhead incurred in reduction to
tridiagonal form in HJS. HJS does not have the redundant communication found in PDSYEVX,
it makes many fewer BLAS calls, avoids the overhead of the PBLAS calls, and spreads the
work more evenly among all the processors (improving load balance). Furthermore, HJS,
by using communication primitives better suited to the task, reduces both the number of
messages sent and the total volume of communication substantially. Some, but not all,
of these advantages necessitate that the processor layout be square. HJS is discussed in
Section 7.1.2.
Other ways to reduce the execution time of reduction to tridiagonal form do not
require that the processor layout be square. Bischof and Sun[25] and Lang[116] showed that
in a two step band reduction to tridiagonal form, all of the ops, asymptotically, can be
performed in matrix multiply routines. Karp, Sahay, Santos and Schauser[107] showed that
subset broadcasts and reductions can be performed optimally. Van de Geijn and others[16]
are working to implement improved subset broadcast and reduction primitives.
Hegland et al.[90] argue that the fastest way to reduce a symmetric matrix A to
tridiagonal form on the VPP500 (a multiprocessor vector supercomputer by Fujitsu) is to
compute L1DLT1 = A and then compute a series of Li using orthonormal transformations
such that Ln+p�1DLTn+p�1 is tridiagonal. Their technique is, in essence, a two step band
reduction in which the two steps are performed within the same loop. Let Li[:; own(�)]
represent the columns of Li, owned by processor �. �Qi means the portion of Qi which
processor � owns.
The code is:
L1DLT1 = A
For i = 1 to n � 1 do:
Each processor independently performs:
�Qi = House(Li[:; own(�)]Di[own(�); own(�)]Li[:; own(�)]T)
Li+1[:; own(�)] =� QiLi[:; own(�)]
The processors together perform:
Allgather(Li+1[:; i+ 1 : i+ p])
Each processor performs redundantly:
Q0i = House(Li+1[:; i+ 1 : i+ p]D[i+ 1 : i+ p; i+ 1 : i+ p]Li+1[:; i+ 1 : i+ p]T )
32
Li+1[:; i+ 1 : i+ p] = Q0iLi+1[:; i+ 1 : i+ p]
In Allgather(Li+1[:; i+ 1 : i+ p]) each processor contributes the column of
Li+1[:; i+ 1 : i+ p] which it owns and all processors end up with identical copies of Li+1[:
; i+ 1 : i+ p].
The loop invariants are as follows:
Let: Ti = (Li)D(Li)T
8j<i;k<i+pTi(j; k) = 0 (Line 1)
Ti(1 : i� p; 1 : i� p)is tridiagonal (Line 2)
For p = 1, the serial case, both of these conditions are identical and meeting them
requires computing the �rst column of (Li)D(Li)T , computing the Householder vector and
applying it to Li to yield Li+1.
For p > 1, the parallel case, the �rst loop invariant is maintained by each processor
independently computing the �rst column of (Li)D(Li)T , using only the local columns12
of Li. A Householder vector is computed from this and applied to the local columns of
Li. The second loop invariant is maintained redundantly on all processors. All processors
obtain copies of columns i to i+ p � 1 of Li and compute: A(1 : p; 1) = Li(i : i+ p � 1; i :
i+ p� 1)D(i : i+ p� 1; i : i+ p� 1)L(i : i+ p� 1; i)T . A Householder vector is computed
from A(1 : p; 1) and applied to Li(i : i+p�1; :), redundantly on all processors, maintainingthe second loop invariant.
This one-sided transformation requires fewer messages than Hessenberg reduction
to tridiagonal form and, for small p, less message volume, but requires twice as many ops.
2.7.2 Tridiagonal eigendecomposition
Sequential symmetric QL and QR algorithms
The implicit QL or QR algorithms have been the most commonly used methods for
solving the symmetric eigenproblem for the last couple decades. Francis[79] wrote the �rst
implementation of the QL algorithm based on Rutishauser's LR transformation. The QL
algorithm is the basis of the EISPACK routine IMTQL1, while the LAPACK routine DSTEQR uses
either implicit QR or implicit QL depending on the top and bottom diagonal elements[86].
12Their implementation uses a column cyclic data distribution.
33
Henry[93] shows that if between each sweep of QR (or QL) in which the eigenvectors are
updated an additional sweep is performed in which the eigenvectors are not updated, better
shifts can be used, reducing the total number of ops from roughly 6n3 to 4n3.
Reinsch[145] wrote EISPACK's TQLRAT which computes eigenvalues without square
roots. LAPACK's DSTERF improves on TQLRAT using a root free variant developed by Pal,
Walker and Kahan[134]. Like DSTEQR, DSTERF uses either implicit QR or implicit QL de-
pending on the top and bottom diagonal elements
Parallel symmetric QL and QR algorithms
QR requires O(n2) e�ort to compute the eigenvalues and O(n3) to compute the
eigenvectors. No one has found a good, stable way to parallelize the O(n2) cost of computing
the eigenvalues and re ectors. Sameh and Kuck[113] use parallel pre�x to parallelize QR
for eigenvalue extraction. They obtain O( 1log(p)) speedup, but they do not show how their
method can be used to generate re ectors and hence eigenvectors.
However, parallelizing the O(n3) e�ort of computing the eigenvectors is straight-
forward as shown by Chinchalkar and Coleman[39]; and Arbenz et al.[8] and implemented
for ScaLAPACK by Fellers[76].
Symmetric QR parallelizes nicely in a MIMD programming style, but e�orts to
parallelize it on a shared memory machine in which the parallelism is strictly within the
calls to the BLAS have produced only modest speedups. Bai and Demmel[13] �rst suggested
using multiple shifts in non-symmetric QR. Arbenz and Oettli[10] showed that blocking
and multiple shifts could be used to obtain modest improvements in the speed (roughly a
factor of 2 on 8 processors) of QR for eigenvalues and eigenvectors on the ALLIANT FX/80.
Kaufman[109] showed that multi-shift QR could be used to speed eigenvalue extraction by
a factor of 3 on a 2-processor Cray YMP despite tripling the number of ops performed.
Sturm sequence methods
Givens[83] used bisection to compute the eigenvalues of a tridiagonal matrix based
on Wilkinson's original idea. Kahan[105] showed that bisection can compute small eigenval-
ues with tiny componentwise relative backward error, and sometimes high relative accuracy.
High relative accuracy is required for inverse iteration on a few matrices. Barlow and Evans
were the �rst to use bisection in a parallel code[15].
34
Computing eigenvalues of a tridiagonal matrix can be split into three phases:
isolation, separation and extraction. The isolation phase identi�es, for each eigenvalue, an
interval which contains that eigenvalue and no other. The separation phase improves the
eigenvalue estimate. And the extraction phase computes the eigenvalue to within some
tolerance. Bisection can be used for all three phases.
Neither existing codes, nor the literature explicitly distinguish between these three
phases, but they have very di�erent computational aspects. Isolation, at least to the point
of identifying p intervals so that each processor is responsible for one interval is di�cult
to parallelize, whereas the other phases are fairly straightforward. The separation phase is
typically the challenge for most root �nders, and the area where they distinguish themselves
from other codes. Divide and conquer techniques which use the eigenvalues from perturbed
matrices as estimates of the eigenvalues of the original matrix, isolate and may separate the
roots.
Techniques for eigenvalue isolation include: multi-section[126] [14], assigning dif-
ferent parts of the spectrum to di�erent processors[95, 20], divide and conquer and using
multiple processors to compute the inertia of a tridiagonal matrix[123]. In multi-section,
each processor computes the inertia at a single point, splitting an interval into p + 1 in-
tervals. Although multi-section requires communication, Crivelli and Jessup[48] show that
the communication cost is often a modest part of the total cost. Divide and conquer splits
the matrix by perturbing or ignoring a couple of elements, typically near the center of the
matrix to separate the matrix into two tridiagonal matrices whose eigenvalues can be com-
puted separately. If a rank 1 perturbation is chosen, the merged set of eigenvalues provides
a set of intervals in which exactly one eigenvalue lies.
There are a number of ways to use multiple processors to compute the inertia of
a tridiagonal matrix. Lu and Qiao[127] discuss using parallel pre�x to compute the Sturm
sequence as the sub-products of a series of 2 by 2 matrices and Mathias[131] did an error
analysis and showed that it was unstable. Ren[146] tried unsuccessfully to repair parallel
pre�x. Conroy and Podrazik[46] perform LU on a block arrowhead matrix. Each block is
tridiagonal and the arrow has width equal to the number of blocks. Swarztrauber[162] and
Krishnakumar and Morf[111] discuss ways of computing the determinant of 4 matrices of
size roughly n by n from the determinants of 8 matrices of size roughly 12n by 1
2n. Each
of these methods performs 2 to 4 times more oating point operations than a serial Sturm
sequence count would and requires O(log(p)) messages. Except for Conroy and Podrazik's
35
method, they all use multiplies instead of divides. Multiplies are faster than divides, but
require special checks to avoid over ow.
The computation of the inertia is slowed by the existence of a divide and a com-
parison in the inner loop. There are also a couple tricks that can potentially be used to
speed computation of the inertia to reduce the number of divides and comparisons or to
make them faster. ScaLAPACK's PDSYEVX uses signed zeroes and the C language ability to
extract the sign bit of a oating point number to avoid a comparison in the inner loop[54].
I have proposed perturbing tiny entries in the tridiagonal matrix to guarantee that negative
zero will never occur, thus allowing a standard C or Fortran comparison against zero. Using
a standard comparison against zero would allow compilers to produce more e�cient code.
I have also proposed reducing the number of divides in the inner loop by taking advantage
of the �xed exponent and mantissa sizes in IEEE double precision numbers. I have not im-
plemented either of these ideas. Some machines have two types of divide: a fast hardware
divide that may be incorrect in the last couple bits and a slower but correct software divide.
Demmel, Dhillon and Ren[54] give a proof of correctness for PDSTEBZ, ScaLAPACK's bisection
code for computing the eigenvalues of a tridiagonal matrix, in the face of heterogeneity and
non-monotonic arithmetic (such as sloppy divides). This shows that bisection can be robust
even in the face of incorrect divides.
Many techniques that have been used to accelerate eigenvalue extraction including:
the secant method[33], Laguerre's iteration[138], Rayleigh quotient iteration[163], secular
equation root �nding[50] and homotopy continuation[120, 45]. Bassermann and Weidner
use a Newton-like root �nder called the Pegasus method[17]. These acceleration techniques
converge super-linearly as long as the eigenvalues are separated.
Li and Ren[121] accelerate eigenvalue separation in their Laguerre based root �nder
by detecting linear convergence and estimating the e�ect of the next several steps. Brent[33]
discusses ways of separating eigenvalues when the secant method is used. Li and Zeng use an
estimate of the multiplicity in their root �nder based on Laguerre iteration[122]. Szyld[163]
uses inverse iteration with a shift set to middle of the interval known to contain only one
eigenvalue to separate eigenvalues before switching to Rayleigh quotient iteration. Cuppen's
method takes advantage of multiple eigenvalues through de ation.
Eigenvalue extraction can be performed in parallel with no communication, or a
small constant amount of communication. However, eigenvalue extraction can exhibit poor
load balance, especially if acceleration techniques are used. Ma and Szyld[128] use a task
36
queue to improve load balance. Li and Ren[121] minimize load imbalance by concentrating
on worst case performance.
ScaLAPACK chose bisection and inverse iteration for its �rst tridiagonal eigensolver,
PDSYEVX, because they are fast, well known, robust, simple and parallelize easily. ScaLAPACK
has since added a QR based tridiagonal eigensolver for those applications needing guarantees
on orthogonality within eigenvectors corresponding to large clusters of eigenvalues. See
section 4.3 for details.
Divide and Conquer
Cuppen[50] showed that by making a small perturbation to a tridiagonal matrix
it could be split into two separate tridiagonal matrices each of which could be solved inde-
pendently, and that the eigendecomposition of the original tridiagonal matrix could then be
constructed from the eigendecomposition of the two independent tridiagonal matrices and
the perturbation.
There are many ways to perturb a tridiagonal matrix such that the result is
two separate tridiagonal matrices. The following four have been implemented. Cuppen's
algorithm[50] subtracts �uuT from the tridiagonal matrix, where u = e 1
2n + e 1
2n+1 and
� = T 1
2n; 1
2n+1. Gu and Eisenstat[89] set all elements in row and column i to zero. Gates
and Arbenz[82] call this a rank-one extension and refer to this as permuting row and column
12n to the last row and column (as opposed to setting all elements in row and column i to
zero). Gates[80] uses a rank two perturbation: T 1
2n; 1
2n+1(e 1
2ne
T1
2n+1
+e 1
2n+1e
T1
2n) is subtracted
from the original tridiagonal.
Cuppen's original divide and conquer method can result in a loss of orthogo-
nality among the eigenvectors. Three methods of maintaining orthogonality have been
implemented. Sorensen and Tang[155] calculate the roots to double precision. Gu and
Eisenstat[89] compute the eigenvectors to a slightly perturbed problem. Gates[81] showed
that inverse iteration and Gram-Schmidt re-orthogonalization could be used in divide and
conquer codes to compute orthogonal eigenvectors.
Several divide and conquer codes are available today. The �rst publically available
divide and conquer code, TREEQL was written by Dongarra and Sorensen[66]. The fastest
reliable serial code currently available for computing the full eigendecomposition of a tridi-
agonal matrix is LAPACK's DSTEDC[147]. It is based on Cuppen's divide and conquer[50] and
37
uses Gu and Eisenstat's[88] method to maintain orthogonality.
There has long been interest in parallelizing divide and conquer codes because
of the obvious parallelism involved in the early stages. There are three reasons why this
technique has proven di�cult to parallelize. The �rst is that the majority of the ops are
performed at the root of the divide and conquer tree and hence the parallelism at the leaves
is less valuable[36]. The second is that de ation, the property that makes DSTEDC the fastest
serial code, leads to dynamic load imbalance in parallel codes. The third is the complexity
of the serial code itself.
Dongarra and Sorensen's parallel code[66], SESUPD, was written for a shared mem-
ory machine. The �rst parallel divide and conquer codes written for distributed memory
computers used a 1D data layout (thus limiting their scalability)[99, 81]. Potter[141] has
written a parallel divide and conquer for small matrices (it requires a full copy of the ma-
trix on each node). Fran�coise Tisseur has written a parallel divide and conquer code for
inclusion in ScaLAPACK.
Inverse Iteration
Inverse iteration with eigenvalue shifts is typically used to compute the eigen-
vectors once the eigenvalues are known[170]. Jessup and Ipsen[102] explain the use of
Gram-Schmidt re-orthogonalization to ensure that the eigenvectors are orthogonal. Fann
and Little�eld[75] found that inverse iteration and Gram-Schmidt can be performed in par-
allel, greatly improving its e�ciency. Parlett and Dhillon[139, 59] are working on a method,
based on work by Fernando, Parlett and Dhillon[77], that may avoid, or greatly reduce the
need for re-orthogonalization.
The Jacobi method
The Jacobi method for the symmetric eigenproblem consists of applying a series
of rotators each of which forces a single o�-diagonal element to zero. Each such rotation
reduces the square of the Frobenius norm of the o�-diagonal elements by the square of the
element which was eliminated. Hence, as long as the o�-diagonal elements to be eliminated
are reasonably chosen, the norm of the o�-diagonal converges to zero[167].
There are several variations in the Jacobi method. Classical Jacobi[100], selects
the largest o�-diagonal element as the element to eliminate at each step, and hence requires
38
the fewest steps. However, O(n2) comparisons are required at each step to select the largest
element, requiring O(n4) comparisons per sweep, rendering it unattractive. Cyclic Jacobi
annihilates every element once per sweep in some speci�ed order. Threshold Jacobi di�ers
from cyclic Jacobi in that only those elements larger than a given threshold are annihilated.
Block Jacobi annihilates an entire block of elements at each step.
Cyclic, threshold and block variants of Jacobi each have their advantages. Cyclic
Jacobi is the simplest to implement. Block Jacobi requires fewer ops (and if done in
parallel, fewer messages) per element annihilated. Threshold Jacobi requires fewer steps
and converges more surely than cyclic Jacobi, however a parallel threshold Jacobi requires
more communication. Scott et al. showed that a block threshold Jacobi method[151] is
the best Jacobi method for distributed memory machines, however, it would also be the
most complex to implement. Little�eld and Maschho�[125] found that for large numbers of
processors, a parallel block Jacobi beat tridiagonal based methods available at that time.
One-sided Jacobi methods apply rotations to only one side of the matrix and force
the columns of the matrix to be orthogonal, hence represent scaled eigenvectors. One-sided
Jacobi methods require fewer ops and may parallelize better[10, 21].
Existing parallel implementations of the Jacobi algorithm are based on a 1D data
layout. Arbenz and Oettli[10] implemented a blocked one-sided Jacobi. Pourzandi and
Tourancheau[142] show that overlapping communication and computation is e�ective in
a Jacobi implementation on the i860 based NCUBE. Although a 1D data layout is not
scalable, the huge computation to communication ratio in the Jacobi algorithm hides this
on all machines available today.
There are two publically available parallel Jacobi codes. Fernando wrote a parallel
Jacobi code for NAG[87]. O'Neal and Reddy[133] wrote a parallel Jacobi, PJAC, for the
Pittsburgh Supercomputing Center.
Demmel and Veseli�c,[58] prove that on scaled diagonally dominant matrices, Jacobi
can compute small eigenvalues with high relative accuracy while tridiagonal based methods
cannot. Demmel et al.[56] give a comprehensive discussion of the situations in which Jacobi
is more accurate than other available algorithms.
The Jacobi method is discussed in Section 7.3.
39
2.7.3 Matrix-matrix multiply based methods
There are several methods for solving the symmetric eigenproblem which can be
made to use only matrix-matrix multiply.
Matrix-matrix based methods are attractive because they can be performed ef-
�ciently on all computers, and they scale well. However, they require many more ops
(typically 6 - 60 times more) than reduction to tridiagonal form, tridiagonal eigensolution
and back transformation. Hence, these methods only make sense if tridiagonal based meth-
ods cannot be performed e�ciently or do not yield answers that are su�ciently accurate.
Invariant Subspace Decomposition Algorithm
The Invariant Subspace Decomposition Algorithm[97], ISDA, for solving the sym-
metric eigenproblem involves recursively decoupling the matrix A into two smaller matrices.
Each decoupling is achieved by applying an orthogonal similarity transformation, QTAQ,
such that the �rst columns of Q span an invariant subspace of A. Such a Q is found by
computing a polynomial function of A, A0 = p(A) which maps all the eigenvalues of A
nearly to 0 or 1, and then taking the QR decomposition of p(A). One such polynomial
can be computed by �rst shifting and scaling A so that all its eigenvalues are known to
be between 0 and 1 (by Gershgorin's theorem) and then repeatedly computing the beta
function, Ai+1 = 3A2i � 2A3
I , until all of the eigenvalues of Ai are e�ectively either 0 or 1.
(All of the eigenvalues of A0 that are less than 0:5 are mapped to 0, all the eigenvalues of
A0 that are greater than 0.5 are mapped to 1.)
The ISDA parallelizes well because each of the tasks involved perform well in
parallel[97]. Unfortunately, the ISDA requires far more oating point operations (roughly
100 n3) than eigensolvers that are based on reducing the matrix �rst to tridiagonal form
(which require 8n3 +O(n2) or fewer ops).
Applying the ISDA for banded matrices greatly reduces the op count[26]. Fur-
thermore, the banded matrix multiplications can still be performed e�ciently, and the
bandwidth does not triple with each application of Ai+1 = 3A2i � 2A3
I as one would expect
with random banded matrices. Nonetheless, the bandwidth does grow enough to necessitate
several band reductions, each of which requires a corresponding back transformation step.
A publically available code based on the ISDA is available from the PRISM
group[28].
40
The ISDA applied directly to the full matrix requires roughly 100n3 ops, or 30
times as many as tridiagonal reduction based methods, and hence will never be as fast.
Banded ISDA is almost a tridiagonal based method, but is not likely to be the fastest
method. The quickest way to compute eigenvalues from a banded matrix is to reduce the
matrix �rst to tridiagonal form. And, if eigenvectors are required, banded ISDA will require
at least twice and probably three times as many ops in back transformation.
FFT based invariant subspace decomposition
Yau and Lu[174] implemented an FFT based invariant subspace decomposition
method. This method requires O(log(n)) matrix multiplications. Tisseur and Domas[60]
have written a parallel implementation of the Yau and Lu method.
FFT based invariant subspace decomposition, like ISDA applied to dense matrices
requires roughly 100n3 ops. Hence, it, like ISDA will never be as fast as tridiagonal
reduction based methods.
Strassen's matrix multiply
Strassen's matrix-matrix multiply[157] can decrease the execution time for very
large matrix-matrix multiplies by up to 20% but will not make ISDA competitive. Several
implementations of Strassen's matrix multiply have been able to demonstrate performance
superior to conventional matrix-matrix multiply[96][43]. However, Strassen's method is only
useful when performing matrix-matrix multiplies in which all three matrices are very large
and Strassen's op count advantage grows very slowly as the matrix size grows. In order
to double Strassen's op count advantage, the matrices begin multiplied must be sixteen
times as large and hence memory usage must increase a thousand fold.
2.7.4 Orthogonality
Some methods, notably inverse iteration, require extra care to ensure that the
eigenvectors are orthogonal. In exact arithmetic, if two eigenvalues di�er, their correspond-
ing eigenvectors will be orthogonal. However, if the input matrix has, say, a double eigen-
value, the eigenvectors corresponding to this double eigenvalue span a two-dimensional
subspace and hence there is no guarantee that two eigenvectors chosen at random from
this space will be orthogonal. In oating point arithmetic, inverse iteration without re-
41
orthogonalization may not produce orthogonal eigenvectors when two or more eigenvalues
are nearly identical. In DSTEIN, LAPACK's inverse iteration code, when computing the eigen-
vectors for a cluster of eigenvalues, modi�ed Gram-Schmidt re-orthogonalization is employed
after each iteration to re-orthogonalize the iterate against all of the other eigenvalues in the
cluster[102]. Modi�ed Gram-Schmidt re-orthogonalization parallelizes poorly because it is a
series of dot products and DAXPY's each of which depends upon the result of the immediately
preceding operation. PeIGs[74] and PDSYEVX[68] have chosen di�erent responses to the fact
that the re-orthogonalization in DSYEVX parallelizes poorly.
PeIGs alternates inverse iteration and re-orthogonalization in a di�erent manner
than DSYEVX. Instead of computing one eigenvector at a time, all of the eigenvectors within
a cluster are computed simultaneously. For each cluster, PeIGs �rst performs a round of
inverse iteration without re-orthogonalization using random starting vectors. Then, PeIGs
performs modi�ed Gram-Schmidt re-orthogonalization twice to orthogonalize the eigenvec-
tors. PeIGs performs a second round of inverse iteration without re-orthogonalization, using
the output from the previous step as the starting vectors, and again repeating until su�cient
accuracy is obtained for each eigenvector. Finally, PeIGs performs modi�ed Gram-Schmidt
re-orthogonalization one last time. They have shown that this method works on application
matrices with large clusters of eigenvalues.
PDSYEVX attempts to assign the computation of all eigenvectors associated with
each cluster of eigenvalues to a single processor. When enough space is available to accom-
plish this, PDSYEVX produces exactly the same results as DSYEVX. When the user does not
provide enough local workspace PDSYEVX relaxes the de�nition of cluster repeatedly until it
can assign all the computation of all eigenvectors associated with each cluster of eigenvalues
to a single processor.
When the input matrix contains one or more very large clusters of eigenvalues,
PDSYEVX performs poorly: If enough workspace is available, PDSYEVX gives the same results
as DSYEVX, but runs very slowly. If insu�cient workspace is available, PDSYEVX does not
guarantee orthogonality. Dhillon explains the fundamental problems in inverse iteration[59].
Recently Parlett and Dhillon have identi�ed new techniques for computing the
eigenvectors of a symmetric tridiagonal matrix[136, 139]. These new results raise the hope
that we will soon have an O(n2) method for computing the eigenvectors of a symmetric
tridiagonal matrix which parallelizes well and avoids the problems with computing the
eigenvectors associated with clustered eigenvalues. ScaLAPACK looks forward to applying
42
these new techniques in a future release.
43
Chapter 3
Basic Linear Algebra Subroutines
3.1 BLAS design and implementation
The BLAS[117, 63, 62], Basic Linear Algebra Subroutines, were designed to allow
portable codes most of whose operations are matrix-matrix multiplications, matrix-vector
multiplications, and related linear algebra operations to achieve high performance provided
that the BLAS achieve high performance. In LAPACK[4], the BLAS were used to re-express the
linear algebra algorithms in the previous libraries Linpack[61] and EISPACK[153], thereby
achieving performance portability.
The BLAS routines are split into three sets. BLAS Level 1 routines involve only
vectors, require O(n) ops (on input vectors of length n) and two or three memory op-
erations for every two ops performed. BLAS Level 2 routines involve one n by n matrix,
O(n2) ops and one or two memory operations for every two ops performed (rectangu-
lar matrices are also supported). BLAS Level 3 routines involve only matrices, O(n3) ops
and O(n2) memory operations. BLAS Level 1, because they involve only O(n) operations
per invocation, have the least exibility in how the operations are ordered, and require
the most memory operations per op. Hence, BLAS Level 1 routines have the lowest peak
oating point operation rate. They also have the lowest software overhead - an important
consideration because they perform few operations. BLAS Level 3 routines have the most
exibility in how the operations are ordered and require the fewest memory operations per
op and hence achieve the highest performance on large tasks. BLAS Level 1 and 2 routines
are typically limited by the speed of memory. BLAS Level 3 routines typically execute very
near the peak speed of the oating point unit.
44
Typical hardware architectures make it possible, but not easy, to achieve high
oating point execution rates for matrix-matrix multiply. Floating point units can initiate
oating point operations every 2 to 5 nanoseconds though oating point operations take 10
to 30 nanoseconds to complete and main memory requires 20 to 60 nanoseconds per random
data fetch. Floating point units achieve high throughput through concurrency, allowing
multiple operations to be performed simultaneously, and pipelining, starting operations
before the previous operation is complete. Register �les are made large enough to provide
source and target registers for as many operations as can be active at one time. Main
memory throughput can be enhanced by interleaving memory banks and by fetching several
words simultaneously (or nearly so) from main memory. Memory performance is further
enhanced by the use of caches. Two levels of caches are now typical and systems are now
being designed with three levels of caches.
High performance BLAS routines typically incur signi�cant software overhead: be-
cause to achieve near the oating point unit's peak performance, BLAS routines need an
inner loop that can keep the oating point units busy, surrounded by one or more levels
of blocking to keep the memory accesses in the fastest memory possible. Managing con-
currency and/or pipelining requires a long inner loop which operates on several vectors at
once. Each level of blocking requires additional control code and separate loops to handle
portions of the matrix that are not exact multiples of the block size. For example, DGEMV1
(double precision matrix-vector multiplication) on the PARAGON has an average software
overhead of 23 microseconds (over 1000 cycles at 50 Mhz) and includes 200 instructions of
error checking and case selection, 750 instructions for the transpose case and 500 for the
non-transpose case2.
3.2 BLAS execution time
The execution time for each call to a BLAS routine depends upon the hardware,
the BLAS implementation, the operation requested and the state of the machine, especially
the contents of the caches, at the time of the call. The time per DGEMV, or BLAS Level 2,
op is limited by the speed of the memory hierarchy level at which the matrix resides. The
1DGEMV performs y = �Ax+ �y or y = �ATx+ �y, where A is a matrix, x and y are vectors and � and
� are scalars.2These instruction counts include all instructions routinely executed during the main loop in reduction
to tridiagonal form. Not all are executed during each call to DGEMV.
45
Table 3.1: BLAS execution time (Time = �i + number of ops � i in microseconds)
BLAS Level 3 BLAS Level 2 BLAS Level 1
peak op rate
softwareoverhead
�3
timeper
op 3(M ops/sec)
softwareoverhead
�2
timeper
op 2(M ops/sec)
softwareoverhead
�1
timeper
op 1(M ops/sec)
PARAGON
Basic MathLibrarySoftware(Release5.0)
50 300 .024
(41) 87 .026(38) 3 .10(10)
IBM SP2
ESSL 2.2.2.2 480 0 .0037(270) 5 .1e 0.0055(180) 1.2 .01(100)
time per DGEMM3, or BLAS Level 3, op is typically limited primarily by the rate at which
the oating point unit can issue and complete instructions. We will concentrate on DGEMM
and DGEMV because they perform most of the ops in PDSYEVX.
Table 3.1 shows the software overhead and time per op for the BLAS routines.
These times are based on independent timings with code cached but not data cached using
invocations that are typical for PDSYEVX. Recall that these parameters are used in a linear
model of performance:
� + number of ops � (Line 1)
In PDSYEVX we are most concerned with the time per op for Level 3 routines and
secondarily concerned with the time per op and software overhead for Level 2 routines.
For n=3840 and p=64, on the Paragon, the three largest components attributable to the
items in Table 3.1 are: 28% of the PDSYEVX execution time is attributable to BLAS Level
3DGEMM performs C = �AB + �C or c = �ATB + �C, where A, B and C are matrices, and � and � are
scalars.
46
3 oating point execution (not including software overhead), 8% is attributable to BLAS
Level 2 oating point execution and 5% is attributable to BLAS Level 2 software overhead.
(See Chapter 5 for details.) The fact that the BLAS3 software overhead for the IBM SP2 is
listed as 0 stems from the fact that matrix-matrix multiply is faster for small problem sizes
because they �t in cache4.
Figure 3.1: Performance of DGEMV on the Intel PARAGON
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x 105
0
1
2
3
4
5
6
7
8
9
Flops
>1
==
Act
ual f
aste
r
DGEMV expected/actual executed time on XPS5
Data cached
Data not cached
Figure 3.1 shows how actual DGEMV performance di�ers from predicted performance
Line 1 on the PARAGON. Each point represents the time required for a call to DGEMV with
parameters that are typical of calls to DGEMVmade in PDSYEVX divided by the time predicted
by our performance model. The timings are made by an independent timer as described in
Section 3.3. The model matches quite well on most calls to DGEMV. It also shows a modest,
but noticeable di�erence between the cost when data is cached versus when it is not. If4We did not pursue this because it BLAS3 software overhead has little impact on PDSYEVX execution time.
47
the software overhead term were removed (i.e using number of ops � 2 as the model) themodel would underestimate execution by a factor of two hundred or more on small problem
sizes.
Some calls to DGEMV require much less time than expected, as little as 1=9, indi-
cating the software overhead is not independent of the type of call made. In particular,
calls which involve very few ops can vary widely in their execution time (for the predicted
time). However, not many calls di�er widely in their execution time and those that do
require few ops (hence little execution time) and the fact that they do not match well
does not signi�cantly a�ect the accuracy of my performance model for PDSYEVX (given in
Chapter 4) and hence I did not study them further.
Figure 3.2 shows that DGEMV on the PARAGON requires 10 to 50 microsends longer if
the code is not cached at the time it is is called. The additional time required is estimated
by subtracting the cost of running DGEMV alone from the cost of running DGEMV followed
by 16,384 no-ops5 while accounting for the execution time of the 16,384 no-ops themselves.
The extra time required increases as the number of ops increases. And the extra time is
greater when the data is not cached than when it is cached6. It is not surprising that the
extra time required when the code is not cached increases as the number of ops increases
because when few ops are involved, the code does not execute as many loops. However
it is surprising that the code cache miss cost in the \Data not cached" case appears to
increase almost linearly with the number of ops, I would expect to see something closer
to a step function. This deserves further study if it is determined that code cache misses
substantially a�ect execution time.
Figure 3.3 shows that the extra time required by DGEMV ranges from 1.5% (when
DGEMV performs many ops) to over 10% (when DGEMV performs few ops). Only calls made
to DGEMV with parameters that are typical of the calls commonly made by PDSYEVX are
shown. The extra time required when code is not cached can be up to 80% on calls made
to DGEMV requiring very few ops, but these are rare in PDSYEVX.
5The code cache holds 8,192 no-ops. Hence, 16,384 guarantees that the no-ops are not in cache, makingtheir execution time independent of what is in the code cache at the time the 16,384 no-ops are executed.
6I compare the execution time when neither code nor data is cached to the execution time when code iscached but data is not when estimating the extra time required when data is not cached.
48
Figure 3.2: Additional execution time required for DGEMV when the code cache is ushedbetween each call. The y-axis shows the di�erence between the time required for a runwhich consists of one loop executing 16,384 no-ops after each call to DGEMV and the timerequired for a run which includes two loops one executing DGEMV and one executing 16,384no-ops.
0 5 10 15
x 104
−1
0
1
2
3
4
5
6x 10
−5
Flops
Tim
e (s
econ
ds)
DGEMV code cache miss cost on XPS5
Data cached
Data not cached
3.3 Timing methodology
Each routine is timed with several sets of input parameters. To time a routine
with a given set of input parameters, the routine is run three times and the time from the
third run is used. Each run consists of calling the routine to be timed repeatedly within a
loop. The �rst run, in which the loop is run only once, ensures that the code is paged in.
The second run, in which the loop is run just long enough to exceed the timer resolution,
provides an estimate that is used to determine how many times to run the third run. The
third run, in which the loop is run for approximately one second, is the only one whose
execution time is recorded. We record both CPU time and wall clock time. These plots are
49
Figure 3.3: Additional execution time required for DGEMV when the code cache is ushedbetween each call as a percentage of the time required when the code is cached. SeeFigure 3.2.
0 5 10 15
x 104
−2
0
2
4
6
8
10
12
14
16
Flops
Per
cent
of t
ime
requ
ired
whe
n co
de is
cac
hed
DGEMV code cache miss cost on XPS5
Data cached
Data not cached
based on CPU time.
The input parameters for each run are randomly selected such that they match
the input parameters made in a typical call to DGEMV from PDSYEVX. Randomly selecting
the input parameters provides advantages over a systematic choice of input parameters.
A systematic choice of input parameters might include, for example, only even values of k
whereas odd values of k might require signi�cantly longer. Random selection means that the
likelihood of identifying anomalous behavior is directly related to how often that behavior
occurs in calls within PDSYEVX. Random selection scales well also: It is easy to increase or
decrease the number of timings and/or the number of processors used.
50
3.4 The cost of code and data cache misses in DGEMV
Each set of input parameters is timed under four di�erent cache situations:
Code and data cached
Code cached but data not cached
Code not cached but data cached
Neither code nor data cached
Data can be allowed to remain in cache (to the extent that it �ts in cache) by
using the same arrays in each call within the timing loop. Likewise data can be prevented
from remaining in cache by using di�erent arrays for each call within the timing loop.
Allowing data to reside in cache reduces execution time in two ways. It reduces
the cost of accessing the data in the arrays being operated on and it reduces the software
overhead cost, because software overhead also involves reading and writing data, notably
while saving and restoring registers.
Code and data cache misses are more important in DGEMV than in DGEMM because
DGEMV is called more often than DGEMM and the ratio of ops to data movement is higher for
DGEMM than for DGEMV, hence reducing the cost of data cache misses in DGEMM.
3.5 Miscellaneous timing details
We make sure that timings are not a�ected by conditions which are not likely to
be encountered in a typical run of PDSYEVX. Exceptional numbers (subnormalized numbers
and in�nities ) will occur only rarely in PDSYEVX7. Hence, we make sure that exceptional
numbers do not appear during our timing runs.
We do not time PDSYEVX on problem sizes that do not �t in physical memory.
Hence, when timing the individual BLAS routines, we make sure that the arrays �t in physical
memory. Ed D'Azevedo has written an out-of-core symmetric eigensolver and studied the
a�ect of paging on PDSYEVX[52].
7The matrix is scaled before reduction to tridiagonal form to avoid being close to the over ow or under owthreshold. Although this does not prevent under ows (or subnormalized numbers) it causes them to be rare.NaNs will never appear in PDSYEVX unless NaNs appear in the input.
51
We measure and report both wall clock time and CPU time. Wall clock time may
di�er from CPU time for several reasons, including: time spent waiting for communication,
time spent on other processes and time spent on paging and other operating system services.
When timing the BLAS, we are primarily interested in CPU time because we there is no
communication and we are not interested in measuring the time spent waiting on other
processes. However, we measure and report wall clock time because for all other timings
we must rely on wall clock timings8. When the wall clock time di�ers substantially from
the CPU time on calls to the BLAS on time shared systems (such as the IBM SP2) we use
the ratio of wall clock time to CPU time as a crude measure of the load on the system.
We use the timing routines included in the BLACS routines developed at Univeristy
of Tennessee at Knoxville[169, 69] (which are not a part of the BLACS speci�cation).
Many modern computers have cycle time counters which would allow much more detailed
measurement of execution time and often other machine characteristics. These detailed
timing routines are not portable and I chose to stick to portable timing techniques. Alter-
natively, Krste Asanovic has developed a portable interface for taking performance related
statistics over an "interval" of a code's execution[11].
8CPU time is often meaningless when communication is involved.
52
Chapter 4
Details of the execution time of
PDSYEVX
4.1 High level overview of PDSYEVX algorithm
Figure 4.1 shows how PDSYEVX reduces the original (dense) matrix to tridiagonal
form (Line 1), uses bisection and inverse iteration to solve the tridiagonal eigenproblem
(Line 2) and then transforms the eigenvectors of the tridiagonal matrix back into the eigen-
vectors of the original dense matrix(Line 3). PDSYEVX uses a two-dimensional block cyclic
data layout with an algorithmic block size equal to the data layout block size in both
Householder reduction to tridiagonal form and back transformation. When using bisec-
tion to compute the eigenvalues, it assigns each process an essentially equal number of
eigenvalues to compute. For inverse iteration, PDSYEVX attempts to assign roughly equal
numbers of eigenvectors to each process while assigning all eigenvectors corresponding to
a given cluster of eigenvalues to the same process. Gram-Schmidt re-orthogonalization is
performed locally within each process and hence orthogonality is not guaranteed for eigen-
vectors corresponding to eigenvalues within a cluster that is too large to �t on a single
process.
We assume that only the lower triangle of the square symmetric matrix A contains
valid data on input and the algorithms only read and write this lower triangle. The general
conclusions of this thesis apply to the upper triangular case as well.
Please refer to Table A.1, Table A.2, and Table A in Appendix Afor the list of
53
Figure 4.1: PDSYEVX algorithm
A = QTQT
A 2 Rn is the matrix whose eigendecompositionwe seek.T is tridiagonal.Q is orthogonal.
(Line 1)
T = U�UT
� = diag(�1; : : : ; �n) is the diagonal matrix ofeigenvalues.The columns of U = [u1 : : :un] are the eigenvec-tors of T .Tui = �iui
(Line 2)
V = QUThe columns of V = [v1 : : : vn] are the eigenvectorsof A.Avi = �ivi
(Line 3)
notation used in this chapter.
Section 4.2 describes and models reduction to tridiagonal form as performed by
PDSYTRD. Section 4.3 describes and models the tridiagonal eigensolution as performed by
PDSTEBZ (bisection) and PDSTEIN(inverse iteration). Section 4.4 describes and models back
transformation as performed by PDORMTR.
4.2 Reduction to tridiagonal form
4.2.1 Householder's algorithm
Figure 4.4 shows Householder's reduction to tridiagonal form, Figure 4.4 shows a
model for the runtime of ScaLAPACK's reduction to tridiagonal form code, PDSYTRD. The
rest of this section explains the computation and communication pattern in PDSYTRD. We
begin by describing the classical (serial and unblocked) algorithm (essentially the EISPACK
algorithm TRED1 and also LAPACK's DSYTD2), then the blocked (but still serial) algorithm
(essentially the LAPACK algorithm DSYTRD) and �nally the parallel blocked ScaLAPACK algo-
rithm PDSYTRD.
54
Classical (serial and unblocked) Householder reduction (Figure 4.2)
Figure 4.2 shows the algorithm for the clasical (serial and unblocked) Householder
reduction to tridiagonal form, (essentially the algorithm used in LAPACK's DSYTD2.
The �rst iteration through the loop performs an orthogonal similarity transforma-
tion of the form: A (I � �vvt)A(I � �vvt) where � = 2= k v2 k22, such that only the �rst
two elements in the �rst column (and hence the �rst two elements in the �rst row) of A
are non-zero. Each iteration through the loop repeats these steps on the trailing submatrix
A(2:n; 2:n) to reduce A to tridiagonal form by a series of similarity transformations.
Compute an appropriate re ector (Line 2.1 in Figure 4.2 )
We seek a re ector of the form: I � �vvt such that � = 2vtv and the �rst row and
column of (I � �vvt)A(I � �vvt) has zeroes in all entries except the �rst two.
Let z be the column vector A(2:n; 1). In exact arithmetic, any vector v = c[z1� kz k2; z2 : : : zn] for any scalar c will su�ce, and de�nes what value � must take. LAPACKand ScaLAPACK choose the sign (� k z k2) to match the sign of z1 to minimize roundo�errors, and choose c such that v(1) = 1:0. c can also be chosen to be 1, avoiding the
need to multiply z by c, at some small risk of over/under ow.
Form the matrix vector product y = Av (Line 3.3 in Figure 4.2 )
This is a matrix vector multiply (Basic Linear Algebra Subroutines Level 2) requiring
2(n�i)2 ops, which when summed from i = 1 to n�1 totals 23n
3 ops.
Compute the companion update vector w = y � 12
��(y)T � v�v (Line 5.1 in Figure 4.2 )
The vector w (which is computed here with a dot product and a DAXPY) has the
property that (I � �vvT )A(I � �vvT ) = A � vwT � wvT .
Update the matrix (Line 6.3 in Figure 4.2 )
Compute A = A�vwT�wvT , a BLAS Level 2 rank-2 update. A rank-2 update requires
4 ops per element updated, only the lower triangular portion of A is updated, so this
requires 2(n�i)2 ops, which summed over i = 1 to ni�1 is 23n
3 ops.
55
Figure 4.2: Classical unblocked, serial reduction to tridiagonal form, i.e. EISPACK'sTRED1(The line numbers are consistent with �gures 4.3, 4.4 and 4.5.)
do i = 1; n
Compute re ector
2.1 [�; v] = house(A(i+1:n; i))
v 2 Rn�i; � is a scalar;House computes a householder vectorsuch that(I � �vvT )A(i+1:n; i)(I� �vvT )is zero except for the top element.
Perform matrix-vector multiply
3.3w = tril(A(i+1:n; i+1:n))v+ tril(A(i+1:n; i+1:n);�1)vT
w 2 Rn�i; tril() is MATLAB notationfor the lower triangular portion ofa matrix (including the diagonal).tril(;�1) refers to the portion of thematrix below the diagonal.
Compute companion update vector5.1 c = w � vT ;
w = � w � (c �=2) v
Perform rank 2 update
6.3A(i+1:n; i+1:n) =tril(A(i+1:n; i+1:n)� wvT � vwT )
Here we use tril to indicate that onlythe lower triangular portion of A needbe updated.
end do i = 1; n
Blocked Householder reduction to tridiagonal form (Figure 4.3)
In the above algorithm, nearly all the ops are performed in the product y = Av,
or the rank-2 update A� vwT �wvT , both of which are BLAS Level 2 operations. Through
blocking, half of the ops can be executed as BLAS 3 ops because k matrix updates
can be performed as one rank-2k update instead of k rank-2 updates. This is done in
Line 6.3 in Figure 4.2. The cost of blocking is signi�cant in PDSYTRD, but the gain is also.
See section 7.2.2. This allows the matrix update to be considerably more e�cient, but it
complicates the computation of the re ector and the computation of the companion update
vector, because PDSYTRD must work with an out-of-date matrix. Starting with A0, the
computation of the �rst re ector v0, the matrix vector product and w0 are unchanged, but
as soon as PDSYTRD attempts to compute the second re ector, v1 it has to deal with the fact
that A1 is known only in factored from, i.e. A1 = A0� v0wT0 �w0v
T0 . This does not greatly
complicate computing the re ector because the re ector needs only the �rst column of A1.
56
Figure 4.3: Blocked, serial reduction to tridiagonal form, i.e. DSYEVX( See Figure 4.2 forunblocked serial code)
do ii = 1; n; nbmxi = min(ii+ nb; n)do i = ii ;mxi
Update current (ith) column of A
1.2A(:; i) = A(:; i)�W (:; ii:i�1)V (i; ii:i�1)T �V (:; ii:i�1)W (i; ii:i�1)T
Compute re ector2.1 [�; v] = house(A(i+1:n; i)) v 2 Rn�i; � is a scalar
Perform matrix-vector multiply3.3 w = tril(A(i+1:n; i+1:n))v w 2 Rn�i
+ tril(A(i+1:n; i+1:n);�1)TvUpdate the matrix-vector product
4.1
w = w �W (:; ii:i�1)V (i; i+1:n)Tv �V (:; ii:i�1)W (i; i+1:n)Tv
Compute companion update vector
5.1 c = w � vT ;w = � w � (c �=2) v
W (i+1:n; i) = w;V (i+1:n; i) = v
end do i = ii ;mxi
Perform rank 2k update
6.3
A(mxi+1:n;mxi+1:n) =tril(A(mxi+1:n;mxi+1:n)�W (mxi+1:n; ii:mxi)V (mxi+1:n; ii:mxi)T �V (mxi+1:n; ii:mxi)W (mxi+1:n; ii:mxi)T )
end do ii = 1; n; nb
However, computing w1 requires the computation of A1v, hence we must either update the
entire matrix A1, returning to an unblocked code, or compute y = (A0 � v0wT0 � w0v
T0 )v.
Computing the re ectors and the companion update vectors now requires that the current
column be updated (Line 1.2 in Figure 4.3). The matrix vector product must be updated
(Line 4.1 in Figure 4.3).
57
4.2.2 PDSYTRD implementation (Figure 4.4)
Figure 4.5 shows Householder's reduction to tridiagonal form along with a model
for the runtime of each step in ScaLAPACK's reduction to tridiagonal form code, PDSYTRD.
The rest of this section explains the computation and communication pattern in PDSYTRD,
and hence the ine�ciencies.
58
Figure 4.4: PDSYEVX reduction to tridiagonal form ( See Figure 4.3 for further details)
do ii = 1; n; nbmxi = min(ii+ nb; n)do i = ii ;mxi
Update current (ith) column of A (Table 4.1)
1.1spread V (i; ii : i�1)T and W (i; ii : i�1)T down
processor owning V (i; ii : i�1) andW (i; ii : i�1) broadcasts to all otherprocessors in its procesor column.
1.2A(i; i) = A(:; i)�W (:; ii:i�1)V (i; ii:i�1)T �V (:; ii:i�1)W (i; ii:i�1)T
V and W are used as they are stored(no data movement required)
Compute re ector (Table 4.2)2.1 [�; v] = house(A(i+1:n; i)) v 2 Rn�i; � is a scalar
Perform matrix-vector multiply (Table 4.3)3.1 spread v across3.2 transpose v, spread down3.3 w1 = tril(A(i+1:n; i+1:n))v; w1 is distributed like row A(i; :)
w2 = tril(A(i+1:n; i+1:n);�1)Tv w2 is distributed like column A(:; i)3.4 sum w row-wise3.5 sum wT column-wise
3.6 w = w1 + w2w is distributed like column A(:; i); hencew1 must be transposed.
Update the matrix-vector product (Table 4.4)
4.1
w = w �W (:; ii:i�1)V (i; i+1:n)Tv �V (:; ii:i�1)W (i; i+1:n)Tv
Compute companion update vector (Table 4.5)5.1 c = w � vT ;
w = � w � (c �=2) v
W (i+1:n; i) = w;V (i+1:n; i) = v
end do i = ii ;mxi
Perform rank 2k update (Table 4.6)
6.1spread V (mxi+1:n; ii:mxi);W (mxi+1:n; ii:mxi) across
processors in current column of pro-cessors broadcasts to processors inother processor columns
6.2transpose V (mxi+1:n; ii:mxi);W (mxi+1:n; ii:mxi), spread down
6.3
A(mxi+1:n;mxi+1:n) =tril(A(mxi+1:n;mxi+1:n)�W (mxi+1:n; ii:mxi)V (mxi+1:n; ii:mxi)T �V (mxi+1:n; ii:mxi)W (mxi+1:n; ii:mxi)T )
end do ii = 1; n; nb
59
Figure 4.5: Execution time model for PDSYEVX reduction to tridiagonal form (See Figure 4.4for details about the algorithm and indices.)
computation communicationoverhead imbalance latency bandwidth
do ii = 1; n; nbmxi = min(ii+ nb; n)do i = ii ;mxi
Update current (ith) column of A1.1 spread V T and WT down 2n lg(
pp)�
1.2 A = A�W V T � V WT 2n �4n2 nbp
p 2 2n lg(pp)�
Compute re ector
2.1 v = house(A) n �4 3n lg(pp)�
Perform matrix-vector multiply
3.1 spread v across n lg(pp)� 1
2n2 lg(
pp)p
p �
3.2 transpose v, spread down n2pp �1 n lg(
pp)� 1
2n2 lg(
pp)p
p �
3.3 w = tril(A)v; (n�4 (23n3
p 2+
wT = tril(A;�1)vT + n2
nbpp �2) 3 n2 nbp
p 2)
3.4 sum w row-wise n lg(pp)� 1
2n2 lg(
pp)p
p �
3.5 sum wT column-wise n lg(pp)� 1
2n2 lg(
pp)p
p �
3.6 w = w + transpose wT
Update the matrix-vector product
4.1 w = w �W V Tv � V WT v 4n �4 2 n2 nbpp 2 6n lg(
pp)�
n2 lg(pp)p
p �
Compute companion update vector
5.1 c = w � vT ; n �4 2n lg(pp)�
w = � w � (c �=2) v
end do i = ii ;mxi
Perform rank 2k update
6.1 spread V;W acrossn2 lg(
pp)p
p �
6.2 transpose V;W , spread downn2 lg(
pp)p
p �
6.3 A = A �W V T � V WT 2 n2
nb2pp�3 (23
n3
p 3 + 3 n2 nbpp 3)
end do ii = 1; n; nb
60
Distribution of data and computation in PDSYTRD
In PDSYEVX, the matrix being reduced, A, is distributed across a 2 dimensional grid
of processors. The computation is distributed in a like manner, i.e. computations involving
matrix element A(i; j) are performed by the processor which owns matrix element A(i; j).
Vectors are distributed across the processors within a given column of processors. At the
ith step, i.e. when reducing A(i :n; i :n) to A(i+1:n; i+1:n), the vectors are distributed
amongst the processors which own some portion of the vector A(i :n; i). Within calls to
the PBLAS, these vectors are sometimes replicated across all processor columns, or even
transposed and replicated across all processor rows. However, between PBLAS calls, each
vector element is owned by just one processor.
Critical path in PDSYTRD
For steps 1.1, 1.2, 2.1, 4.1, 5.1, 6.1, 6.2, 6.3 in Figure 4.5, i.e. all steps except
\forming the matrix vector product", the processor owning the most rows in the current
column of the remaining matrix has the most work to do and hence it is on the critical path.
When the matrix vector product is being formed, (steps 3.1 through 3.6) the processor
which owns the most rows and the most columns in the remaining matrix has the most
work (both communication and computation) and hence is on the critical path.
Load imbalance
Load imbalance occurs when some processor(s) take longer to perform certain op-
erations1, requiring other processors to wait. Each processor is responsible for computations
on the portion of the matrix and/or vectors that it owns. Some processors own a larger
portion of the matrix and/or vectors. Since PDSYTRD has regular synchronization points2,
the processor which takes the longest to complete any given step determines the execution
time for that step.
If row j is the �rst row in a data layout block, the processor which owns A(j; j) will
own the most rows in A(j :n; j :n): bn�j+1pr nb
c nb+min(n� j+1�b n�jpr nb
c nb pr; nb). However,if row j is not the �rst row in a data layout block, even this formula is too simplistic.
1Load imbalance also occurs during communication, but for PDSYTRD on the machines that we studiedthe communicaiton load imbalance was negligible.
2Computing the re ector (Line 2.1) and computing the companion update vector (Line 5.1) require allthe processors in the processor column owning column i of the matrix and are hence synchronization points.
61
Fortunately, n�j+1pr
+ nb2 is an excellent approximation, on average, for the maximum number
of rows of A(j :n; j :n) owned by any processor. n�j+1pr
+ nb2pr�1pr
is more accurate, but the
di�erence is too small to be useful.
The second source of load imbalance is that many of the computations are per-
formed only by the processors which own the current column of the matrix.
Updating the current column of A
As shown in table 4.1, PDSYTRD updates the current column of A through two calls
to PDGEMV, one at line 350 of pdlatrd.f and one at line 355 of pdlatrd.f. Each of these calls
to PDGEMV requires that the �rst few elements of a column vector (W or V) be transposed
and replicated among all the processors in that column. The transposition is fast because
these elements are entirely contained within one processor, but the replication requires a
spread down (column-wise broadcast) of nb or fewer items.
Standard data layout model
By making a few assumptions, we can signi�cantly simplify the model. By assum-
ing that pr = pc =pp, many of the terms coalesce. We also assume that the panel blocking
factor4 , pbf ,= 2, as it is in ScaLAPACK 1.5.
This standard data layout is also assumed in Figure 4.5 and in Chapter 5. The
models used in Figure 4.5 and in Chapter 5 are subsets, including only the most important
terms, of the \standard data layout" models shown in Tables 4.2 through 4.10.
Computing the re ector (Line 2.1 in Figure 4.5)
PDLARFG computes the re ector as shown in table 4.2. First, it broadcasts � = A(j + 1; j)
to all processes that own column A(:,j). Then, it computes the norm � = jA(j + 1:n; j)jleaving the result replicated across all processors that own column A(:; j).
The rest of the computation is entirely local and requires only 2n2pp + O(n) ops,
hence does not contribute signi�cantly to total execution time.
4The matrix vector multiplies are each performed in panels of size pbfnb. See Section 4.2.2.
62
Table 4.1: The cost of updating the current column of A in PDLATRD(Line 1.1 and 1.2 inFigure 4.5)
TaskFile:line numberor subroutine
Execution time con-tribution fromcolumns j = 1 to nshown explicitly
Execution time(simpli�ed)
Broadcast W (j;1:j0�1)Twithin current column3.
pdlatrd.f :350pdgemv .c
pbdgemv.f :560dgebs2d
nPj=1
(dlog2(pr)e�+�4+
j0 dlog2(pr)e�)
ndlog2(pr)e�+n �4+0:5nnb dlog2(pr)e�
Compute local portion ofA(j:n)=A(j:n;j)�V (j:n;1:j0�1)�W (j;1:j0�1)T
pdlatrd.f :350pdgemv .c
pbdgemv.f :580dgemv
nPj=1
(�2+2(n�j) j0
pr 2) n �2+0:5n
2 nbpr
2
Broadcast V (j;1:j0�1)Twithin current column.
pdlatrd.f :355pdgemv .c
pbdgemv.f :560
nPj=1
(dlog2(pr)e�+�4
j0 dlog2(pr)e�)
n dlog2(pr)e�+n�4+0:5n nb dlog2(pr)e�
Compute local portion ofA(j:n)=A(j:n;j)�W (j:n;1:j0�1)�V (j;1:j0�1)T
pdlatrd.f :355pdgemv .c
pbdgemv.f :580dgemv
nPj=1
(�2+2(n�j) j0
pr 2) n �20:5
n2 nbpr
2
Total 2n dlog2(pr)e�+n nb dlog2(pr)e�+2n �2+n2 nbpr
2+2n �4
Standard data layout(See section 4.2.2)
2n dlog2(pp)e�+n nb dlog2(
pp)e�+2n �2+
n2 nbpp
2+2n �4
63
Table 4.2: The cost of computing the re ector (PDLARFG) (Line 2.1 in Figure 4.5)
TaskFile:line numberor subroutine
Execution time con-tribution fromcolumns j = 1 to nshown explicitly
Execution time(simpli�ed)
� = A(j + 1; j)pdlatrd.f :364pdlarfg.f :213dgebs2d
nPj=1
dlog2(pr)e� ndlog2(pr)e�
xnorm = jA(j + 1:n; j)jpdlatrd.f :364pdlarfg.f :229pdnrm2
nPj=1
(2 dlog2(pr)e�+ 12 �4) 2n dlog2(pr)e�+n
2 �4
� = �(� + �)=�pdlatrd.f :364pdlarfg.f :271
negligible negligible
A(j + 2; j) = A(j+2:n;j)(�+�)
pdlatrd.f :364pdlarfg.f :272pdscal
nPj=1
12�4 n
2�4
E(j) = A(j + 1; j) = �pdlatrd.f :364pdlarfg.f :273
negligible negligible
Total 3ndlog2(pr)e� + n �4
Standard data layout(See section 4.2.2)
3ndlog2(pp)e�+ n �4
64
Forming the matrix vector product using PDSYMV(Lines 3.1 through 3.6 in Fig-
ure 4.5)
The matrix A is laid out in a block cyclic manner as described in section 2.5.4.
Computing the matrix vector product y = Av requires that v be copied to all processes
that own a part of A that needs to be multiplied by v. The vector v must be transposed5.
Each element is sent directly from the processor (in the processor column) Each processor
in the processor column that owns v sends to each processor in the processor row v exactly
the elements and spread down and because only half of A is stored, v must also be spread
across. Then, the matrix vector multiplies6 , w1 = tril(A; 0)v and w2 = tril(A;�1)Tv are
performed locally. w1 is summed within columns, transposed and added to the result of
tril(A; 0)Tv which is summed to the active column of processors. The algorithm used by
PDSYMV is:
Algorithm 4.1 PDSYMV as used to compute Av
1 Broadcast v within each row of processors (Line 3.1 in Figure 4.4)
2 Transpose v within each column of processors (Line 3.2 in Figure 4.4)
3 Broadcast vT within each column of processors (Line 3.2 in Figure 4.4)
4 Form diagonal portion of A (Line 3.3 in Figure 4.4)
5 w1 = locally available portion of tril(A; 0)v (Line 3.3 in Figure 4.4)
6 w2 = vT tril(A;�1) (Line 3.3 in Figure 4.4)
7 Sum w1 within each column of processors (Line 3.4 in Figure 4.4)
8 Sum w2 within each row of processors (Line 3.5 in Figure 4.4)
9 Transpose w1 and add to w2 (Line 3.6 in Figure 4.4)
The two transpose operations, steps f2,3g and step 9 in algorithm 4.1 though both
are performed by PBDTRNV, use di�erent communication patterns. The transpose performed
in steps 2 and 3, is an all-to-all. It takes v replicated across the processor columns and
distributed across the processor rows and produces vT replicated across the processor rows
and distributed across the processor columns. The transpose performed in step 9 is a one-
to-one transpose. It takes yTu distributed across the processor columns within one processor
5The non-transposed v is distributed like column A(:; i), the transposed v is distributed like row A(i; :).6tril() is MATLAB notation for the lower triangular portion of a matrix (including the diagonal). tril(;�1)
refers to the portion of the matrix below the diagonal.
65
row. It produces yu distributed across the processor rows within the current processor
column.
The all-to-all transposition is performed in two steps (steps 2 and 3 in algo-
rithm 4.1). Since each column of processors contains a complete copy of the vector v,
each acts independently, �rst collecting the portion of vT that belongs to this processor col-
umn to one processor7 and then broadcasting it to all processor columns. The operation of
collecting the portion of vT that belongs to this processor column to one processor is done as
a tree-based reduction, requiring dlog2(lcm(pr; pc))e messages, and a total oflcm(pr;pc)�1lcm(pr;pc)
jpc
words which I model as npc
words. The broadcast which completes the transpose (step 3),
requires dlog2(pc)e messages and dlog2(pc)e jpc
words.
The one-to-one transpose (step 9) is accomplished as a single set of direct messages.
Every word in yTu is owned by exactly one processor. Every word in yu should be sent to one
processor. Every word in yTu is sent from the processor that owns it to the processor that
needs the corresponding word in yu. All words being sent between the same two processors
are sent in a single message. The number of words sent by each processor that owns a part
of yTu sends every word that it owns, i.e. jpc
in lcm(pr; pc)=pc messages. Every processor that
needs a part of yu receives the number of words that it needs: jpr
in lcm(pr; pc) messages.
The two matrix vectors multiplies are each performed in panels of size: pbf nb.
pbf , the panel blocking factor, is set to max(mullen; lcm(pr; pc)=pc), where mullen is a tuning
parameter set at compile-time to 2 in ScaLAPACK 1.5.
The cost of the matrix vector multiply is detailed in table 4.3.
The number of ops in the matrix vector multiply which any given processor must
perform is controlled by the size and shape of the local portion of the trailing matrix.
The processor holding the largest portion of the trailing matrix holds a matrix of size
approximately8 d n�jmb pr
emb�d n�jnb pc
e nb. Because we update only the lower triangular portionof the matrix, each element in the lower triangular portion of the matrix is used in twomatrix
vector multiplies. And, because the shape of the local portion of the matrix is irregular
(a column block stair step with some diagonal steps) the matrix vector computation is
performed by column blocks. The irregular patterns repeats every lcm(pr;pc)pc
nb, so pbf , the
panel blocking factor is chosen to be: max(mullen; lcm(pr;pc)pc
), where mullen is a compile time
7If pr = lcm(pc; pr) the portion of data that belongs othis processor column is already on one processorand hence this \collection" is a null operation.
8The largest local matrix size di�ers from this only when mod (n�j;nb pr) < nb or mod (n�j;nb pc) <nb.
66
Table 4.3: The cost of all calls to PDSYMV from PDSYTRD
Broadcast v withineach processor row(Line 3.1)
pdlatrd.f :370pdsymv .c
pbdsymv.f :406dgebr2d
nPj=1
(dlog2(pc)e�+
dlog2(pc)e( (n�j)pr+ nb
2 )�)
n dlog2(pc)e�+0:5
n2dlog2(pc)epr
�+
0:5n nb dlog2(pc)e�
Transpose v(Line 3.2)
pdlatrd.f :370pdsymv .c
pbdsymv.f :421pbdtrnv.f :385pbdtrget
nPj=1
(dlog2( lcm(pr;pc)pc
)e�
+n�jpc
�)
ndlog2(lcm(pr;pc))e�+0:5n
2
pc�
Broadcast vT downwithin each processorcolumn(Line 3.2)
pdlatrd.f :370pdsymv .c
pbdsymv.f :421pbdtrnv.f :400dgebs2d
nPj=1
(dlog2(pr)e�+
dlog2(pr)e(n�jpc+ nb
2)�)
n dlog2(pr)e�+0:5
n2dlog2(pr)epc
�+
0:5nnb dlog2(pr)e�
Form diagonal portionof matrix, paddedwith zeroes(Line 3.3)
pdlatrd.f :370pdsymv .c
pbdsymv.f :685pbdlacp1
nPj=1
(n�jpc
�1n�jpc
�1) 12n2
pc�1+
12n2
pc�1
w = tril(A; 0)v,wT = vT tril(A;�1)local computation(Line 3.3)
pdlatrd.f :370pdsymv .c
pbdsymv.f :702,704,757,759
dgemv
2nP
j=1(2 n�j
pbf nbpc�2+
(d n�jnb pr
enb) (d n�jnb pc
enb) 2+j pbf nb
pr 2)
2 n2
pbf nb pc�2+
23n3
p 2+
12n2 nbpr
2+
12n2 nbpc
2+n2 nbpbf
pr 2
Sum, row-wise, w
(Line 3.4)
pdlatrd.f :364pdsymv .c
pbdsymv.f :801dgsum2d
nPj=1
( dlog2(pc)e�+
dlog2(pc)e( (n�j)pr+ nb
2 )�)
n dlog2(pc)e�+0:5
n2 dlog2(pc)epr
�+
0:5nnb dlog2(pc)e�
Sum, columnwise, wT
(Line 3.5)
pdlatrd.f :364pdsymv .c
pbdsymv.f :809dgsum2d
nPj=1
(dlog2(pr)e�+
dlog2(pr)e(n�jpc+ nb
2 )�)
n dlog2(pr)e�+0:5
n2 dlog2(pr)epc
�+
0:5n nb dlog2(pr)e�
Transpose wT
and sum into w(Line 3.6)
pdlatrd.f :370pdsymv .c
pbdsymv.f :811pbdtrnv
Pi=1
n( lcm(pr ;pc)pr
�+
lcm(pr;pc)pc
�+ npc�+ n
pr�+
nnb pc
�1+n
nb pr�1)
n lcm(pr;pc)pr
�+n lcm(pr;pc)
pc�+
0:5n2
pc�+0:5n
2
pr�+
0:5 n2
nb pc�1+0:5 n2
nb pr�1
Total
2n dlog2(pc)e�+2n dlog2(pr)e�+n lcm(pr;pc)pr
�+n lcm(pr ;pc)pc
�+
n dlog2(lcm(pr;pc))e�+n2 dlog2(pc)epr
�+n2 dlog2(pr )e
pc�+n2
pc�+0:5n
2
pr�+
n nb dlog2(pc)e�+nnb dlog2(pr)e�+2 n2
pbf nbpc�2+
23n3
p 2+
12n2 nbpr
2+12n2 nbpc
2+n2 nb pbf
pr 2+
n2
pc�1+n �4
Standard data layout(See section 4.2.2)
4n dlog2(pp)e�+2n�+2
n2 dlog2(pp)ep
p�+
1:5 n2pp�+2n nb dlog2(
pp)e�+ n2
nbpp�2+
23n3
p 2+3n
2 nbpp
2+n2pp�1+n �4
67
parameter, set to 2 in the standard PBLAS release. The column panels are �lled out with
zeroes to make the matrix vector multiply e�cient. Even the act of �lling the diagonal
blocks with zeroes, because it is done ine�ciently, is noticeable on modest problem sizes.
The number of ops required for a global (n� j)� (n� j) matrix vector multiply
is approximately:
2� 2� (1
2(dn� j
nb pre nb)(dn� j
nb pce nb) + (n� j)
pbf nb
2 pr) :
The �rst 2 is because multiplies and adds are counted separately. Each element in the lower
triangular portion of the matrix is involved twice, hence the second 2. The �rst term stems
directly from the size of the local matrix. The second term stems from the odd shape of
the local matrix and is primarily the result of the unnecessary ops (zero matrix elements)
added to reduce the number of dgemv calls.
We use the following equality, dropping the O(n) term:
nXi=1
d iaed ibe = n3
3+n2 a
4+n2 b
4+O(n) :
f lops = 2� 2nX
j=1
�12dn � j
nb pre nb dn � j
nb pce nb + j
pbf nb
2 pr
�
= 2� 21
2
� n3
3 pr pc+
1
4
n2
nb pcnb2 +
1
4
n2
nb prnb2 +
n2
2
pbf nb
pr
�
=2
3
n3
p+n2 nb
2pr+n2 nb
2pc+n2 nb pbf
pr
Figure 4.6: Flops in the critical path during the matrix vector multiply
Updating the matrix vector product
Updating the matrix vector product, y = y � VWT v �WV T v, requires four matrix vector
multiplies. temp = WT v and temp = V T v are both (n � j) � j0 by n � j matrix vector
multiplies, where j0 = j mod nb . Both the matrix and the vector are stored in the current
process column. No data movement is required to perform the computation, however the
result, a vector of length j0�1 is the sum of the matrix vector multiplies performed on each
of the processes in the process column.
68
Table 4.4: The cost of updating the matrix vector product in PDLATRD(Line 4.1 in Figure 4.5)
TaskFile:line numberor subroutine
Execution time con-tribution fromcolumns j = 1 to nshown explicitly
Execution time(simpli�ed)
Broadcast WT
unnecessarilyfor temp = WT v
pdlatrd.f :373pbdgemv.f :826dgebs2d
nPj=1
(�4+dlog2(pc)e�+
dlog2(pc)e (n�j)pr�+
dlog2(pc)e nb2�)
n �4+n dlog2(pc)e�+0:5
n2 dlog2(pc)epr
�
+0:5n nb dlog2(pc)e�
Local computationof temp = WT v
pdlatrd.f :373pbdgemv.f :846dgemv
nPj=1
(�2+2 npr
nb2 2) n �2+0:5n
2 nbpr
2
Sum the contributionof temp from all processesin the column
pdlatrd.f :373pbdgemv.f :858dgsum2d
nPj=1
(dlog2(pr)e�+
dlog2(pr)e nb2 )�)
n dlog2(pr)e�+0:5n nb dlog2(pr)e�
Broadcast temp
(row-wise) to all processesin this column
pdlatrd.f :376pbdgemv.f :579dgebs2d
nPj=1
(�4+dlog2(pr)e�+
dlog2(pr)e nb2�)
n �4+n dlog2(pr)e�+0:5n nb dlog2(pr)e�
Local computationof y = V � temp
pdlatrd.f :376pbdgemv.f :600dgemv
nPj=1
(�2+2 npr
nb2 2) n �2+0:5n
2 nbpr
2
y = y +WV T is identicalto y = y + VWT
pdlatrd.f :379pdlatrd.f :382
n dlog2(pc)e�+2n dlog2(pr)e�+0:5n2 dlog2(pc)e
pr�+
0:5nnb dlog2(pc)e�+n nb dlog2(pr)e�+2n�2+
n2 nbpr
2+2n �4
Total2n dlog2(pc)e�+4n dlog2(pr)e�+n2 dlog2(pc)e
pr�+
n nb dlog2(pc)e�+2n nb dlog2(pr)e�+4n �2+2 n2 nbpr
2+4n �4
Standard data layout(See section 4.2.2)
6n dlog2(pp)e�+n2 dlog2(
pp)e
pr�+
3n nb dlog2(pp)e�+4n �2+2 n2 nbp
p 2+4n �4
The other two matrix vector multiplies, y = V � temp and y = W � temp, are both
(n� j)� (j0 � 1) by j0 � 1 matrix vector multiplies. Again, the computation is performed
entirely within the current process column. The 1 by j0 � 1 vector, temp, must be spread
down, i.e. broadcast column-wise, to all processes in this process column, however no further
communication is necessary in order to update y, as y is perfectly aligned with V .
Details are given in the table 4.4.
Computing the companion update vector
The details involved in computing the companion update vector are shown in table 4.5.
69
Table 4.5: The cost of computing the companion update vector in PDLATRD (Line 5.1 inFigure 4.5)
TaskFile:line numberor subroutine
Execution time con-tribution fromcolumns j = 1 to nshown explicitly
Execution time(simpli�ed)
Compute y = � y pdlatrd.f :385pdscal
nPi=1
13 �4
13 n �4
Compute � = �0:5� yT v pdlatrd.f :386pddot
nPi=1
(dlog2(pr)e�+ 13 �4) n dlog2(pr)e�+ 1
3n �4
Compute w = y � �vpdlatrd.f :390pdaxpy
nPi=1
(dlog2(pr)e�+ 13�4) n dlog2(pr)e�+ 1
3n �4
Total 2n dlog2(pr)e�+ n �4
Standard data layout(See section 4.2.2)
2n dlog2(pp)e�+ n �4
Performing the rank 2k update
The rank 2-k update is performed once per block column (i.e. n=nb times):
A = A� vwT � wvT :
PDSYTRD broadcasts v and w along processor tows, transposes them and then
broadcasts them along processor columns. I ignore the � (latency) cost of the transpose
here, because it is less signi�cant (by a factor of nb) than the similar cost for the transpose in
the matrix-vector multiply and because it is only relevant whenlcm(pr;pc)
pcis very large. The
third � term in the transpose and broadcast operation should be multiplied bylcm(pr ;pc)
pc�1
lcm(pr;pc)pc
but the added complexity is not justi�ed for a small term.
The number of ops performed during the rank two update of A(j : n; j : n) is
modeled as:
2� 2� nb (1
2(n� j
pr+
nb
2)(n� j
pc+
nb
2) +
n nb pbf
2 pr) :
The number of ops performed per matrix element involved in the rank-2 update is 2�2�nb.
The number of elements in the lower triangular matrix is given by the sum of the terms
within the parentheses.
The total number of ops for all rank two updates is modeled as the sum of this
quantity as j ranges from nb to n by nb.
70
Table 4.6: The cost of performing the rank-2k update (PDSYR2K) (Lines 6.1 through 6.3 inFigure 4.5)
TaskFile:line numberor subroutine
Execution time con-tribution fromcolumns j = 1 to nshown explicitly
Execution time(simpli�ed)
Broadcast V and W
within process rows(Line 6.1)
pdsytrd.f :354pdsyr2k .c
pdsyr2k.f :454,477dgebs2d
nPj=nb;nb
(2 dlog2(pc)e�+
2(n�jpr
+ nb2 )dlog2(pc)e�)
2 nnbdlog2(pc)e�
+n2
prdlog2(pc)e�
�n nbdlog2(pc)e�
Transpose andbroadcast V and W
within process columns(Line 6.2)
pdsytrd.f :354pdsyr2k .c
pdsyr2k.f :491,847pbdtran
nPj=nb;nb
(2dlog2(pr)e�+
2(n�jpc
+ nb2 )nbdlog2(pr)e�
+n�jpc
�)
2 nnbdlog2(pr)e�
+n2
pcdlog2(pr)e�
�n nbdlog2(pr)e�+n2
pc�
tril(A; 0) = tril(A; 0)+V �WT +W � V T
(Line 6.3)
pdsytrd.f :354pdsyr2k .c
pdsyr2k.f :655{60 ,
1052{57
pdgemm
nPj=nb;nb
(4 n�jnb pc pbf
�3+4nb
�( 12 (n�jpr+ nb
2 )(n�jpc
+ nb2 )
+n nb pbf
2 pr) 3)
2 n2
nb2 pc pbf�3+
23 3+
12n2 nbpr
3+
12
n2 nbpc
3+n2 nbpbf
pr 3
Total
2 nnbdlog2(pc)e�+2 n
nbdlog2(pr)e�+n2
prdlog2(pc)e�+n2
pcdlog2(pr)e�+n2
pc�
�n nbdlog2(pc)e��n nbdlog2(pr)e�+2 n2
nb2 pc pbf�3+
23 3+
12n2 nbpr
3
+ 12n2 nbpc
3+n2 nb pbf
pr 3
Standard data layout(See section 4.2.2)
4 nnbdlog2(
pp)e�+2 n2p
pdlog2(
pp)e�+ n2p
p�
�2n nb dlog2(pp)e�+ n2
nb2pp�3+
23 3+3n
2 nbpp 3
The negative term (�2n2 nbp 3), which results from the fact that j starts at nb, is
ignored because it is O(n2 nbp ) and hence too small.
Details are given in table 4.6.
71
4.2.3 PDSYTRD execution time summary
Table 4.7 shows that the computation cost in PDSYTRD is:
2
3
n3
p 3 +
2
3
n3
p 2 +
n2 nb pbf
pr 2 +
7
2
n2 nb
pr 2 +
1
2
n2 nb
pc 2 +
n2 nb pbf
pr 2 +
n2 nb pbf
pr 3 +
1
2
n2 nb
pr 3 +
1
2
n2 nb
pc 3 +
2n2
nb2 pc pbf�3 + 6n �2 + 2
n2
pr pbf nb�2 +
n2
pc�1 + 9n �4 :
The most important terms in the computation cost are the O(n3
p ) ops. The
relative importance of the other (o(n3)) terms depends on the computer. On the PARAGON
none stand out above the rest. Indeed on the PARAGON none of the o(n3) terms accounts
for more than 3% of the total execution time of PDSYEVX when n = 3480 and p = 64.
However, all of o(n3) terms combined account for 21% of the total execution time on that
same problem.
Figure 4.8 shows that the computation cost in the tridiagonal eigendecomposition
in PDSYEVXis:
53ne
p � + 3
nm
p � + 112n � + 265
ne
p 1 + 45
nm
p 1 + 620n 1 + 6nc2 1 :
The execution time of tridiagonal eigendecomposition is dominated by the cost
of divides, and the size of the largest cluster, c. The load imbalance terms (112n � and
620n 1) are neglible.
Table 4.9 shows that the communication cost in PDSYTRD is:
4n dlog2(pc)e�+ 13n dlog2(pr)e�+ n lcm(pr; pc)=pr�+
n dlog2(lcm(pr; pc))e� +
3n2
prdlog2(pc)e� + 2
n2
pcdlog2(pr)e� +
1
2
n2
pr� + 2
n2
pc� :
Most of the messages are in broadcasts and reductions (i.e. the O(n log(p))
terms) and most of the broadcasts and reductions (13n) are within processor rows, ver-
sus only 4n broadcasts and reductions within processor columns. By contrast, the message
volume is fairly evenly split between broadcasts and reductions within processor rows (
72
3 n2
prdlog2(pc)e� ) and broadcasts and reductions within processor columns ( 2 n2
pcdlog2(pr)e�
).
The lcm terms are negligible unless p is very large, in which case it is important
to make sure that lcm(pr; pc) is reasonable (say < 10max(pr; pc)).
4.3 Eigendecomposition of the tridiagonal
The execution time of tridiagonal eigendecomposition is dominated by two factors:
the size of the largest cluster of eigenvalues and the speed of the divide.
4.3.1 Bisection
During bisection, in DSTEBZ, each Sturm count requires n divisions and 5n other
ops to produce one additional bit of accuracy. Hence, it takes roughly 53n divisions and
53 � 5n ops9 for each eigenvalue and 53n e total divisions for all eigenvalues in IEEE
double precision, where e is the number of eigenvalues to be computed. The exact number
of divisions and ops depends on the actual eigenvalues, the parallelization strategy and
other factors. However, this simple model su�ces for our purposes.
4.3.2 Inverse iteration
Inverse iteration typically requires 3n divides and 45n ops per eigenvalue plus
the cost of re-orthogonalization.
In PDSYEVX the number of ops performed by any particular processor, pi, during
re-orthogonalization is:P
C2fclusters assigned to pig 4Psize(C)
i=1 n iter(i)n (i� 1). Where: n iter(i)
is the number of inverse iterations performed for eigenvalue i (typically 3). If the size of
the largest cluster is greater than np , the processor which is responsible for this cluster will
not be responsible for any eigenvalues outside of this cluster.
Hence, if the size of the largest cluster is greater than np , the number of ops
performed by the processor to which this processor is assigned is (on average):
4 n itern1
2c2 = 6n c2
9Although these are not all BLAS Level 1 ops, they have the same ratio memory operations to ops thatare typical of BLAS Level 1 operations.
73
where: c = maxC2fclustersg size(C) i.e. the number of eigenvalues in the largest cluster, and
n iter = 3 is the average number of inverse iterations performed for each eigenvalue.
As the problem size and number of processors grows, the largest cluster that
PDSYEVX is able to reorthogonalize properly gets smaller (relative to n). As a consequence,
reorthogonalization will not require large execution time10 Speci�cally, if the largest cluster
has fewer than np eigenvalues, (i.e. �ts easily on one processor) the number of eigenvalues
that will be assigned to any one processor, and hence the total number of ops it must
perform, is limited. The worst case is where there are p + 1 clusters each of size np+1 .
In this case, one processor must be assigned 2 clusters of size np+1 , requiring (on average)
2� 6n ( np+1)
2 or roughly 12n3
p2. 11
Our model for the execution time of Gram Schmidt re-orthogonalization (Pc
i=1 4n i
= 2n c2 1, where c is the size of the largest cluster.) assumes that the processor to which
the largest cluster is assigned is not assigned any other clusters. This is true if the largest
cluster has more than n=p eigenvalues in it. If the largest cluster of eigenvalues contains
fewer than n=p eigenvalues, reorthogonalization is relatively unimportant.
Inderjit Dhillon, Beresford Parlett and Vince Fernando's recent work[139, 77] on
the tridiagonal eigenproblem substantially reduces the motivation to model the existing
ScaLAPACK tridiagonal eigensolution code in great detail, since we expect them to replace
the current code with something that costs O(n2
p ) ops, O(n2
p ) message volume and O(p)
messages, which is negligible compared to tridiagonal reduction.
4.3.3 Load imbalance in bisection and inverse iteration
Load imbalance during the tridiagonal eigendecomposition is caused in part by
the fact that not all processes will be assigned the same number of eigenvalues and eigen-
vectors and in part by the fact that di�erent eigenvalues and eigenvectors will require
slightly di�erent amounts of computation. Our experience indicates that the load imbal-
ance corresponds roughly to the cost of �nding two eigenvalues (2� (53n � + 53� 5n 1))
and two eigenvectors (2 � (3n � + 45n 1)) on one processor. Hence, our execution time
model for the load imbalance during tridiagonal eigedecomposition is: ((2� 53 + 2� 3) =
10This is not to suggest that reorthogonalization in PDSYEVX gets better as n and p increase. (indeedPDSYEVX may fail to reorthogonalize large clusters for large n and p) It just means that reorthogonalizationin PDSYEVX will not take a long time for large n and large p.
11The appearance of p2 in the denominator stems from the restriction c � np, meaning that as p increases
the largest cluster size that PDSYEVX can handle e�ciently decreases.
74
112n � + (2� 53� 5 + 2� 45) = 620n 1)
In evaluating the cost of load imbalance in tridiagonal eigendecomposition, one
must include load imbalance in Gram Schmidt reorthogonalization. Indeed if the input
matrix has one cluster of eigenvalues that is substantially larger than all others (yet small
enough to �t on one processor so that PDSYEVX can reorthogonalize it) Gram Schmidt
reorthogonalization is very poorly load balanced and could be treated almost enttirely as a
load imbalance cost.
We do not separate the load imalance cost of Gram Schmidt from what the exe-
cution time for Gram Schmidt would be if the load were balanced, because doing so would
complicate the model without making it match actual execution time any better.
4.3.4 Execution timemodel for tridiagonal eigendecomposition in PDSYEVX
The cost of tridiagonal eigedecomposition in PDSYEVX is the sum of the cost of
bisection, inverse iteration and reorthogonalization. Hence:
53n e
p � + 3
nm
p � + 112n � + 265
n e
p 1 + 45
nm
p 1 + 620n 1+ 2n c2 1
The load imbalance terms 112n � and 530n 1 stem partly from the fact that
some processors will typically be assigned at least one more eigenvalue and/or eigenvector
than other processors and from the fact that both bisection and inverse iteration are iterative
procedures requiring more time on some eigenvalues than on others.
4.3.5 Redistribution
Inverse iteration typically leaves the data distributed in a manner in which it would
be awkward and ine�cient to perform back transformation. If each eigenvector is computed
entirely within one processor, as PDSTEIN does, inverse iteration requires no communication,
provided that all processors have a copy of the tridiagonal matrix and the eigenvalues. This,
however, leaves the eigenvector matrix distributed in a one-dimensional manner in which
back transformation would be ine�cient. Furthermore, since di�erent processors may have
been assigned to compute a di�erent number of eigenvectors (to improve orthogonality
among the eigenvectors) the eigenvector matrix will typically not be distributed in a block
cylic manner. Since PDORMTR (and all ScaLAPACK matrix transformations) requires that the
data be in a 2D block cyclic distribution, the eigenvectors must,at least, be redistributed to
75
a block cyclic distribution. For convenience and potential e�ciency12, PDSTEIN redistributes
the eigenvector matrix.
The simplest method of data redistribution is to have each processor send one
message to each of the other processors. That message contains the data owned by the
sender and needed by the receiver. Redistributing the data in this manner requires that
each processor send every element that it owns to other processors13 and receive what
it needs from other processors. Since each processor owns14 roughly (nm)=p elements
and needs roughly (nm)=p elements, the total data sent and received by each processor
is roughly 2(nm)=p. In our experience, data redistribution is slightly less e�cient than
other broadcasts and reductions and hence we use 4(nm)=p� as our model for the data
redistribution cost.
4.4 Back Transformation
Transforming the eigenvectors of the tridiagonal matrix back to the eigenvectors of
the original matrix requires multiplying a series of Householder vectors. The Householder
updates can be applied in a blocked manner with each update taking the form: (I+V TV T ),
where V 2 Rn0;nb is the matrix of Householder vectors, and T is an (nb � nb a triangular
matrix[27].
The following steps compute Z0 = (I + V TV T )Z. These are performed for each
block Householder update. The major contributors to the cost are noted below.
Compute T
Computing the nb by nb triangular matrix T requires nb calls to DGEMV, a summation
of nb2=2 elements within the current processor column and nb calls to DTRMV. The
computation of T need not be in the critical path. There are n=nb di�erent matrices
T that need to be computed, and they could be computed in advance in parallel.
Compute W = V TZ.
Spread V across. Compute V TZ locally. Sum W within each processor column.
12The actual e�ciency depends upon the data distribution chosen by the user for the input and outputmatrices
13Although some data will not have to be sent because it is owned and needed by the same processor, thiswill typically be a minor savings.
14In the absence of large clusters of eigenvalues assigned to a single processor.
76
The spread across of V is performed on a ring topology because the processor columns
need not be synchronized. Each processor column must receive V and send V , hence
the cost for each processor column is: (2n0 nb)=pr
The local computation of V TZ is a call to DGEMM involving 2(m=pc + vnb)(n0=pr +
nb=2)nb ops. Ignoring the lower order vnb nb2 term, this is:
2 (n0m nb)=p+ 2 (n0 vnb)=pr + 2 (m nb)=pc :
Compute W = TW
Local.
Compute Z = Z � VW
Spread W down. Local computation. (Note: V has already been spread across.)
The local computation of Z � VW , like the computation of V TZ involves a call to
DGEMM involving 2(m=pc + vnb)(n0=pr + nb=2)nb ops.
Back transformation di�ers from reduction to tridiagonal form in many ways. It
requires many fewer messages: O(n=nb) versus O(n). Because the back transformation of
each eigenvector is independent, the Householder updates can be applied in a pipelined
manner, allowing V to be broadcast in a ring instead of a tree topology. PDLARFB does
not use the PBLAS, allowing V to be broadcast once but used twice. Since the number of
eigenvectors does not change during the update, half of the load imbalance depends on
mod (n; nb pc) and can be reduced signi�cantly if mod (n; nb pc) = 0. In the following
table vnb is the imbalance in the 2D block-cyclic distribution of eigenvectors15
The cost of back transformation, shown in table 4.10, is asymmetric, the (O(n2=pr))
cost is smaller than the (O(n2=pc)) cost. Furthermore, the (O(n2=pr)) cost can be reduced
further by computing T in parallel, and choosing a data layout which will minimze vnb.
Reducing the O(n2=pr) cost would allow pr < pc, reducing the O(n2=pc) costs. This is
discussed further in Chapter 8.
15vnb is computed as follows: extravecsonproc1� extravecs=pr . Where: extravecs = mod (n; nb pc) andextravecsonproc1 = min(nb; extravecs).
77
Table 4.7: Computation cost in PDSYEVX
scale factor updatecurrentcolumn
(Table4.1)
computere ector
(Table4.2)
matrixvectorproduct
(Table4.4)
updatevectorproduct
(Table4.5)
computeupdatevector
(Table4.6)
performrank2kupdate
(Table4.10)
tridiagonaleigendecomposition
(Section4.3)
backtransformation
(Table4.10)
totaln3
p 3
23
23
n2mp
3 2 2
n3
p 2
23
23
n2 nb pbf
pr 2 1 1
n2nb
pr 2 1 1
2 2 12 4
n2 nbpc
212
12
n2 nb pbf
pr 3 1 1
n2 nbpr
312
12
n2 vnbpr
3 2 2
n2 nbpc
312
12
nm nbpc
3 3 3
nnb
�3 3 3
n2
nb2 pc pbf�3 2 2
n �2 2 4 2 8
n2
pr pbf nb�2 2 2
n2
pc�1 1 1
n�4 2 1 1 4 1 9
78
Table 4.8: Computation cost (tridiagonal eigendecomposition) in PDSYEVX
scale factor updatecurrentcolumn
(Table4.1)
computere ector
(Table4.2)
matrixvectorproduct
(Table4.4)
updatevectorproduct
(Table4.5)
computeupdatevector
(Table4.6)
performrank2kupdate
(Table4.10)
tridiagonaleigendecomposition
(Section4.3)
backtransformation
(Table4.10)
total
nep � 53 53
nmp � 3 3
n � 112 112
nep 1 265 265
nmp 1 45 45
n 1 620 620
nc2 1 6 6
79
Table 4.9: Communication cost in PDSYEVX
scale factor updatecurrentcolumn
(Table4.1)
computere ector
(Table4.2)
matrixvectorproduct
(Table4.4)
updatevectorproduct
(Table4.5)
computeupdatevector
(Table4.6)
performrank2kupdate
(Table4.10)
tridiagonaleigendecomposition
(Section4.3)
backtransformation
(Table4.10)
total
n dlog2(pc)e� 2 2 4
n dlog2(pr)e� 2 3 2 4 2 13
n lcm(pr; pc)=pr� 1 1
n lcm(pr ; pc)=pc� 1 1
n dlog2(lcm(pr;pc))e� 1 1
n2
prdlog2(pc)e� 1 1 1 3
n2
pcdlog2(pr)e� 1 1 2
n2
pr� 1
212
n2
pc� 1 1 2
n nb dlog2(pc)e� 1 1 -1 1
n nb dlog2(pr)e� 1 1 2 -1 3
80
Table 4.10: The cost of back transformation (PDORMTR)
TaskFile:line numberor subroutine
Execution time con-tribution fromcolumns j = 1 to nshown explicitly
Execution time(simpli�ed)
Compute T
pdsyevx.f :855pdormtr.f :408pdormqr.f :394pdlarft.f :
nPn0=1;nb
(dlog2(pr)e nb2
2 �+
2nb �2+2n0 nb22pr
2+)
2n �2+0:5n2 nbpr
2
Compute W = V TZ
pdsyevx.f :855pdormtr.f :408pdormqr.f :412pdlarfb.f :322,398,405
nPn0=1;nb
(2n0 nbpr
�
+dlog2(pr)em nbpc
�+�3
+2m nb2
2 pc 3+2 vnbn0 nb
pr 3
+2nn0 nbp
3)
n2
pr�+
nmpc
dlog2(pr)e�+nnb�3+
nm nbpc
3+
n2 vnbpr
3+n2 mp
3
Compute W = TW
pdsyevx.f :855pdormtr.f :408pdormqr.f :412pdlarfb.f :412
nPn0=1;nb
(�3+2m nb2
2 pc 3) n
nb�3+
nm nbpc
3
Compute Z = Z � VW
pdsyevx.f :855pdormtr.f :408pdormqr.f :412pdlarfb.f :415,425
nPn0=1;nb
(dlog2(pr)em nbpc
�
+�3+2m nb2
2 pc 3+
2 vnbn0 nbpr
3+2mn0 nbp
3)
nmpc
dlog2(pr)e�+nnb�3+
nm nbpc
3+
n2 vnbpr
3+n2 mp
3
Totaln2
pr�+2nm
pcdlog2(pr)e�+2n �2+0:5n
2 nbpr
2+3 nnb�3+3nm nb
pc 3
+2n2 vnbpr
3+2n2 mp
3
Standard data layoutn2pp�+2nmp
pdlog2(
pp)e�+2n �2+0:5n
2 nbpp 2+3 n
nb�3+3nm nbp
p 3
+2n2 vnbpp
3+2n2 mp
3
81
Chapter 5
Execution time of the ScaLAPACK
symmetric eigensolver, PDSYEVX on
e�cient data layouts on the
Paragon
The detailed execution time model gives us con�dence that we understand the
execution time of PDSYEVX_It explains performance on a wide range of problem sizes, data
layouts, input matrices, computers and user requests. However, the same complexity that
allows the detailed model to explain performance over such a large domain makes it di�cult
to grasp, understand and interpret. The simple six term model shown in this chapter is
designed to explain the performance of the common, e�cient case on a well known computer.
PDSYEVX takes 205 seconds to compute the eigendecomposition of a 3840 by 3840
symmetric random matrix on a 64 node Paragon in double precision. Counting only the
103 n3 ops, PDSYEVX achieves 920 Giga ops per second which equals 14 Mega ops per second
per node.
For large, well behaved1, matrices, PDSYEVX is e�cient, as detailed in Table 5.1.
For well behaved 3840 � 3840 matrices, PDSYEVX spends 63% = (28+35)% of its time on
necessary computation and only 35% of its time on communication, load imbalance and
1For PDSYEVX's purpose, a well behaved matrix is one which does not have any large clusters of eigenvalueswhose associated eigeventers must be computed orthogonally.
82
Table 5.1: Six term model for PDSYEVX on the Paragon
Component Modeln = 3840, p = 64
% timematrix transformation computation(See section 5.3)
103n3
p ( = :0215) 35
tridiagonal eigendecompositioncomputation (See section 5.4)
239 n2
p 28
message initiation(See section 5.5)
17n log2(pp) (� = 65:9) 10
message transmission(See section 5.6)
7 n2pp log2(
pp) (� = :146) 4
order n overhead & imbalance(See section 5.7)
2780n 7
order n2 overhead & imbalance(See section 5.8)
14:0 n2pp 14
overhead required for execution in parallel.
n Matrix size
p Number of processors
Matrix-matrix multiply time (= .0215 microseconds/ op)
� Message latency time (= 65.9 microseconds/message)
� Message throughput time (= .146 microseconds/word)
Although PDSYEVX is e�cient on the PARAGON2, Table 5.1 shows us that there is
room for improvement. Ignoring the execution time required for solution of the tridiagonal
eigenproblem for the moment, we note that the matrix transformations reach only about
50% of peak performance (35% vs. 35+10+4+7+14=72%) for this problem size (roughly
the largest that will �t on this PARAGON). Furthermore, e�ciency will be lower for smaller
problem sizes.
Unfortunately, there is no single culprit that accounts for the ine�ciency. Com-
munication accounts for a bit less than half of the ine�ciency, while software overhead
accounts for a bit more than half of the ine�ciency.
2Details about the hardware and software used for this timing run are given in table 6.3
83
One could argue that while n=3840 on 64 nodes is the largest problem that
PDSYEVX can run on this particular computer, it is still a relatively small problem. How-
ever, there are several reasons not to ignore this result. First, while it is true that newer
machines have more memory, they also have much faster oating point units, steeper mem-
ory hierarchies and few o�er communication to computation ratios as high as the PARAGON.
Furthermore, we should strive to achieve high e�ciency across a range of problem sizes,
not just for the largest problems that can �t on the computer. Achieving high e�ciency on
small problem sizes means that users can e�ciently use more processors and hence reduce
execution time.
In summary, PDSYEVX is a good starting point, but leaves room for improvement.
However, signi�cantly improving performance will require attacking more than one source
of ine�ciency.
The fact that PDSYEVX spends 28% of its total time in solving the tridiagonal eigen-
problem is a result of the slow divide on the PARAGON. The PARAGON o�ers two divides: a fast
divide and a slow divide that meets the IEEE 754 spec[7]. Although the ScaLAPACK's bisec-
tion and inverse iteration codes are designed to work with an inaccurate divide, ScaLAPACK
uses the slow correct divide by default.
5.1 Deriving the PDSYEVX execution time on the Intel Paragon
(common case)
This six term model is based on the detailed model described in section 4 which
has been validated on a number of distributed memory computers and a wide range of data
layouts and problem sizes.
5.2 Simplifying assumptions allow the full model to be ex-
pressed as a six term model
I assume that a reasonably e�cient data layout is chosen. I set the data layout
parameters as follows:
nb = 32. The optimal block size on the Paragon is about 10, however the reduction in
execution time obtained by using nb = 10 rather than nb = 32 is less than 10%, so
84
we stick to our standard suggested value of nb.
pr = pc =pp. PDSYEVX achieves the best performance3 when pc � pr � pc
4 . Assuming that
pr = pc =pp allows the pr and pc terms to be coalesced into a single
pp term.
pbf = 2. The panel blocking factor4 , pbf = max(2; lcm(pr; pc)=pc) in ScaLAPACK version
1.5.
vnb = 0. vnb is the imbalance in the number of rows in the original matrix as distributed
amongst the processors. I assume that the matrix is initially balanced perfectly
amongst all processors, i.e. n is a multiple of pr nb.
2 = 3 We assume for the simpli�ed model that all ops are performed at the peak op
rate. This introduces an error equal to 2=3n3=p( 2 � 3) which is typically no more
than 2-5% of the total time on the PARAGON.
m = e = n Assume that a full eigendecomposition is required. i.e. all eigenvalues are
required e = n and all eigevectors are required m = n.
c = 1 Assume that the input matrix has no clusters of eigenvalues.
In addition, we set all of the machine parameters to constants measured or esti-
mated on the Intel Paragon as shown in table 6.3 in order to coalesce the overhead, load
imbalance, and tridiagonal eigen decomposition terms into just three terms.
5.3 Deriving the computation time during matrix transfor-
mations in PDSYEVX on the Intel Paragon
Table 5.2 shows that PDSYTRD performs 43n3
p + O(n2) ops per process. Of these,
23n3
p +O(n2) are matrix vector multiply ops and 23n3
p +O(n2) are matrix matrix multiply
ops. PDSYTRD performs the same oating point operations that the LAPACK routine, DSYTRD,
does. And 43n
3 is the textbook[84] number of ops for reduction to tridiagonal form.
3Performance of PDSYEVX is not overly sensitive to the data layout, provided that nb is su�ciently largeto allow good DGEMM performance, that the processor grid is reasonably close to square and that lcm(pr ; pc)is not outrageous compared to pc and pr . (The latter factor is only relevant when one is dealing withthousands of processors.) I have not performed a detailed study of when using fewer processors results inlower execution time. However, if you drop processors only when necessary to make pc � pr �
pc16 and
lcm pr ; pc � 10pc the processor grid chosen will allow performance within 10% of the optimal processor grid.4The matrix vector multiplies are each performed in panels of size pbfnb. See Section 4.2.2.
85
Table 5.2: Computation time in PDSYEVX
Task Full model Six term model
computation time duringreduction to tridiagonal form(See section 4.2)
23n3
p 2+
23n3
p 3
43n3
p 3
computation time duringback transformation(See table 4.10)
2n2 mp
3 2n3
p 3
Total 103
n3
p 3
Table 5.3: Execution time during tridiagonal eigendecomposition
Task Full model Paragon model Paragon time
computation time duringtridiagonal eigendecomposition(See section 4.3)
265 n ep
1+45nmp
1+
53nep
�+3nmp
�+
2n c2 1
(310�:074+56�3:85+0)n
2
p
239: n2
p
Total 239: n2
p
PDORMTR performs 2n3
p + O(n2) ops per process. Again this is the same as the
LAPACK routine.
5.4 Deriving the computation time during eigendecomposi-
tion of the tridiagonal matrix in PDSYEVX on the Intel
Paragon
The computation time during tridiagonal eigendecomposition, in the absence of
clusters of eigenvalues is O(n2) and hence for large n becomes less important.
The simpli�ed model for the execution time of the tridiagonal eigensolution on the
PARAGON in table 5.3 is obtained from the detailed model by replacing 1 and � with their
values on the PARAGON and by assuming that all clusters of eigenvalues are of modest size.
Load imbalance during the tridiagonal eigendecomposition is caused in part by the
fact that not all processes will be assigned the same number of eigenvalues and eigenvectors
and in part by the fact that di�erent eigenvalues and eigenvectors will require slightly di�er-
ent amounts of computation. Our experience indicates that the load imbalance corresponds
roughly to the cost of �nding two eigenvalues and two eigenvectors.
86
Table 5.4: Message initiations in PDSYEVX
Task Full model Six term model
message initiation duringreduction to tridiagonal form(See table 4.9)
(13dlog2(pr)e+4dlog2(pc)e)n� 17n log2(pp)�
Total 17n log2(pp)�
Table 5.5: Message transmission in PDSYEVX
Task Full model Six term modelmessage transmission timeduring reduction totridiagonal form (Seetable 4.9)
(3dlog2(pc)en2
pr+2dlog2(pr)en
2
pc)� 5 n2p
plog2(
pp)�
message transmission timeduring back transformation(See table 4.10)
2dlog2(pr)enmpc
� 2 n2pplog2(
pp)�
Total 7 n2pplog2(
pp)�
5.5 Deriving the message initiation time in PDSYEVX on the
Intel Paragon
Table 5.4 shows that PDSYEVX requires 17n log(pp) message initiations.
5.6 Deriving the inverse bandwidth time in PDSYEVX on the
Intel Paragon
Table 5.5 shows that PDSYEVX transmits 7n2=pp log(
pp) words per node.
5.7 Deriving the PDSYEVX order n imbalance and overhead
term on the Intel Paragon
Table 5.6 shows the origin of the �(n) load imbalance cost on the Intel Paragon.
87
Table 5.6: �(n) load imbalance cost on the PARAGON
Task Full model Paragon model Paragon time
load imbalance duringeigendecomposition(See section 4.3)
620 1 + 112 �620�0:0740+112�3:85
477n
order n overhead termin reduction to tridiagonal form(See table 4.7)
9�4 + 6�2 9�239+6�23:5 2256n
order n overhead termin back transformation(See table 4.10)
2 �2 2� 23:5 47n
Total 2780n
Table 5.7: Order n2pp load imbalance and overhead term on the PARAGON
Task Full model Paragon model Paragon time
Order n2=pp overhead term
in reduction to tridiagonal form(See table 4.7)
2 n2
nb pbf pc�2
+2 n2
nb2 pbf pc�3+
n2
pc�1
�2�23:532�2 + 2�103
32�32�2
+3:97�n2pp
4:70 n2pp
Order n2=pp load imbalance
term in reduction to tridiagonalform (See table 4.7)
72n2 nbpr
2+12n2 nbpc
2
+n2 nb pbf
pr 2+
n2 nbpr
3+
12n2 nbpc
3+12n2 nb pbf
pr 3
( 72�32+ 1
2�32
+32�2)�0:0247+( 12�32+ 1
2�32+2�32)�0:0215
6:81 n2pp
Order n2=pp load imbalance
term in back transformation(See table 4.10)
0:5 n2 nbpr
2+3nm nbpc
3
+2n2 vnbpc
3
(0:5�32�0:0247+3�32�0:0215+2�0:0215�0)n2p
p
2:46 n2pp
Total 14:0 n2pp
5.8 Deriving the PDSYEVX order n2ppimbalance and overhead
term on the Intel Paragon
The order n2pp load imbalance and overhead term on the Intel Paragon, ?/.0 n2p
p is
shown in table 5.7
See section 5.2 for details on the assumptions made to simplify the full model to
the six term model. Note that vnb is assumed to be zero and that pbf is assumed to be 2.
88
Chapter 6
Perfomance on distributed memory
computers
6.1 Performance requirements of distributed memory com-
puters for running PDSYEVX e�ciently
The most important feature of a parallel computer is its peak op rate. Indeed,
everything else is measured against the peak op rate. The second most important feature
is main memory, but which feature of main memory is most important depends on whether
you want peak e�ciency (i.e. using as few processors as possible) or minimum execution
time (i.e. using more processors). If you plan to use only as many processors as necessary,
�lling each processor's memory completely, then main memory size is the most important
factor controlling e�ciency. If you plan to use more processors, main memory random
access time becomes the most important factor.
Network performance of today's distributed memory computers is good enough to
keep communication cost from being the limiting factor on performance. Furthermore, if
the network performance (either latency or bandwidth) were the limiting factor, there are
ways that we could reduce the communication cost by as much as log(pp)[107]. Still, if one
has a network of workstations connected by a single ethernet or FDDI ring, the very low
bisection bandwidth will always keep e�ciency low See section 8.4.2 for details.
89
6.1.1 Bandwidth rule of thumb
Bandwidth rule of thumb: Bisection bandwidth per processor1 times the square root
of memory size per processor should exceed oating point performance per processor.
Megabytes/sec
processor�pMegabytes
processor>
Mega ops/sec
processor
assures that bandwidth will not limit performance.
The bandwidth rule of thumb shows that if memory size grows as fast as peak
oating point execution rate, the network bisection bandwidth need only grow as the square
root of the peak oating point execution rate. This is very encouraging for the future of
parallel computing. This rule also shows that the bandwidth requirement grows as the
problem sizes decreases. This rule does not make as wide a claim as the memory rule of
thumb, it does not promise that PDSYEVX will be e�cient, only that bandwidth will not be
the limiting factor.
Provided the bandwidth rule of thumb holds, execution time attributable to mes-
sage volume will not exceed 40% of the time devoted to oating point execution in PDSYEVX
on problems that nearly �ll memory.
6.1.2 Memory size rule of thumb
Memory size rule of thumb: memory size should match oating point performance
Megabytes
processor>
Mega ops/sec
processor
assures that PDSYEVX will be e�cient on large problems.
This rule is su�cient because it holds even if message latency and software over-
head hold constant as peak performance increases and network bisection bandwidth and
BLAS2 performance increase as slowly as the square root of the increase in the peak op rate.
1Bisection bandwidth per processor is the total bisection bandwidth of the network divided by the numberof processors.
90
message transmission time
oating point execution time=
7:5n2=pp dlog2(
pp)e �
10=3 n3=p 3Table 5.1
=7:5 dlog2(pp)e �10=3 n=
pp 3
Cancel n2=pp
=7:5 dlog2(pp)e �
10=3pM 106=(6� 8) 3
n=pp =
pM 106=(6� 8)PDSYEVX uses 6 n2=pDP words
=7:5� 3� 8 � 10�6=mbs
10=3pM 106=(6� 8) 10�6=mfs
� = 8 � 10�6=mbsdlog2(pp)e = 3 3 = 10�6=mfs
=7:5� 8
p6� 8mfs
10=3 103mbsSimplify
= :374mfspM mbs
:374 =7:5� 3� 8
p6� 8
10=3 103
Figure 6.1: Relative cost of message volume as a function of the ratio between peak oatingpoint execution rate in Mega ops,mfs, and the product of main memory size in Megabytes,M and network bisection bandwidth in Megabytes/sec, mbs.
Message latency and software overhead are limited by main memory access time, which de-
creases slowly, but bisection bandwidth and BLAS2 performance (which is limited by main
memory bandwidth) continue to improve though not as rapidly as peak performance.
When the number of megabytes of main memory equals the peak oating point
rate (in mega ops/sec), message latency will typically account for ten times less execution
time than the time devoted to oating point execution in PDSYEVX on problems that nearly
�ll memory. The arithmetic in �gure 6.2 justi�es this statement provided that message
latency does not exceed 100 microseconds.
The memory rule of thumb is too simple to capture all aspects of any computer,
nonetheless we have found it to be useful. The derivation in �gure 6.2 makes two main
assumptions: latency is around 100 microseconds and dlog2(pp)e = 3. Selcom will either
be exactly correct, but in our experience neither will tend to be small by more than a factor
of 2 (i.e. p leq4096). The memory rule of thumb also depends on su�cient bandwidth and
on reasonable BLAS2 and software overhead costs. As we will show next, network bandwidth
capacity and BLAS2 performance need not grow rapidly to support this rule and software
overhead costs need only remain constant.
The memory rule of thumb holds for all computers marketed as distributed memory
91
message latency time
oating point execution time=
17n dlog2(pp)e�
10=3n3=p 3Table 5.1
=17 dlog2(pp)e�10=3n2=p 3
Cancel n
=17 dlog2(pp)e�
10=3� (M 106=48)� 3n=pp =
pM 106=(6� 8)PDSYEVX uses 6 n2=p DP words
=17� 3� 100 10�6
10=3� (M 106=48)� (10�6=mfs)� = 100 10�6dlog2(pp)e = 3 3 = 10�6=mfs
= 0:073mfs
M17�3�100�10�6
10=3=48= 0:073
Figure 6.2: Relative cost of message latency as a function of the ratio between peak oatingpoint execution rate in Mega ops, mfs, and main memory size in Megabytes, M .
computers, but does not hold for non-scalable or extremely low bandwidth networks. One
could design a distributed memory computer for which this rule does not hold, but the
features that are necessary for this rule to hold are also important for a range of other
applications and hence we expect this rule to hold for essentially all distributed memory
computers.
The memory rule of thumb while su�cient is not necessary. It is possible to achieve
e�ciency on PDSYEVX on computers whose memory is smaller than that suggested by this
rule2. In section 6.1.3 I discuss what properties a computer must have to allow e�cient
execution on smaller problem sizes.
Though meeting the memory rule of thumb is not necessary to achieve high perfor-
mance, there are reasons to believe that it will be useful for several years. Software latencies
are not decreasing rapidly. Software overhead, since it is tied to main memory latency, is
not decreasing rapidly either. Bisection bandwidth and BLAS2 performance is increasing,
but not as fast as peak oating point e�ciency.
On the other hand, improvements to PDSYEVX will make it possible to achieve high
performance with less memory and may someday obsolete the memory rule of thumb.
2The PARAGON is an example.
92
6.1.3 Performance requirements for minimum execution time
If you intend to use as many processors as possible to minimize execution time,
the second most important machine characteristic (after peak oating point rate) is main
memory speed. Main memory speed a�ects three of the four sources of ine�ciencies in
PDSYEVX: message initiation, load imbalance and software overhead. Message initiation and
software overhead costs are controlled by how long it takes to execute a stream of code
with little data or code locality. Since the communication software initiation code o�ers
little code or data locality, its execution time is largely dependent on main memory latency.
Load imbalance consists mainly of BLAS2 row and column operations. The BLAS2 op rate
is controlled by main memory bandwidth. Smaller main memory bandwidth also requires a
larger blocking factor in order to achieve peak oating point performance in matrix matrix
multiply. Larger blocking factors mean more BLAS2 row and column operations. Hence
reduced main memory speed has a double e�ect on the cost of row and column operations:
increasing the number of them while increasing the cost per operation.
Caches can be used to improve memory performance, however the value of caches
is reduced by several factors: The inner loop in reduction to tridiagonal form, the source
of most of the ine�ciency in PDSYEVX, is substantial and includes many subroutine calls.
ScaLAPACK is a layered library which includes the PBLAS, BLAS, BLACS and the underlying
communication software. The inner loop in reduction to tridiagonal form touches every
element in the unreduced (trailing) part of the matrix. The second level cache is typically
shared between code and data. Even the way that BLAS routines are typically coded impacts
the value of caches in PDSYEVX. The fact that the inner loop in reduction to tridiagonal form
includes many subroutine calls combined with ScaLAPACK's layered approach means that
this inner loop typically involves many code cache misses. Indeed even the much simpler
inner loop in LU involves many code cache misses in ScaLAPACK[160]. Since this same inner
loop touches every element in the matrix, the secondary cache, typically shared by both
code and data, will be completely ushed each time through the loop meaning that code
cache misses will have to be satis�ed by main memory.
The way that BLAS routines are typically optimized leads to a high code cache miss
rate. BLAS routines are typically coded and optimized by timing them on a representative
set of requests[92]. Each request however is typically run many times and the times are
averaged. Each run may involved di�erent data to ensure that the times represent the cost
93
of moving the data from main memory. However, no e�ort is made3 to account for the
cost of moving the code from main memory. Hence, the code cache is a resource to which
no cost is assigned during optimization. Loop unrolling can vastly expand the code cache
requirements but it can also improve performance, at least if the code is in cache. Hence
it is likely that in optimizing BLAS codes, some loops get unrolled to the point where they
use half or more of the code cache. If two such codes are called in the same loop, code
cache misses are inevitable. The unfortunate aspect of this is that the hardware designer is
powerless to prevent it. Increasing the size of the code cache might lead to even more loop
unrolling and even worse performance.
There are two ways that hardware manufacturers could make caches more useful.
One would be to improve the way that BLAS codes are optimized to ensure that the code
cache is a recognized resource (either by measuring code cache use in each call or by having
the codes optimized on a system with smaller cache sizes than those o�ered to the public).
The second would be to allow a path from main memory to the register �le that bypasses
the cache. In the inner loop of reduction to tridiagonal form, every element of the matrix
is touched, but there is no temporal locality and no point in moving these elements up the
cache hierarchy. If these calls to the BLAS matrix-vector multiply routine, DGEMV, could be
made to bypass the caches, these caches would remain useful in the other portions of the
code: i.e. software overhead and communication latency. Even row and column operations
would bene�t because these operations involve data locality across loop iterations, this data
locality is made worthless by the fact that the loop touches every element in the matrix
each time through but could be useful if certain DGEMV calls could be made to bypass the
caches. This would require a coordinated software and hardware e�ort.
Secondary caches are of little importance in determining PDSYEVX execution time
because the inner loop traverses the entire matrix without any data temporal locality within
the loop. Secondary caches are important to achieving peak matrix-matrix multiply perfor-
mance, but that is their only use in PDSYEVX. This is because in principle if the secondary
cache were large enough and the problem small enough, secondary cache could hold the
entire matrix and hence act as fast main memory. Unfortunately, secondary caches are
never large enough to support an e�cient problem size.
I would hope that, if there are other applications like PDSYEVX that could make
3It is di�cult to account for the cost of moving the code from main memory.
94
e�cient use of smaller faster memories, some vendor or vendors will build some machines
with smaller faster main memory. I suspect that more applications need large slow mem-
ory, than small fast memory. Indeed, PDSYEVX, can work well either way. But, especially
with improvements to PDSYEVX that will allow it to achieve high performance on smaller
problem sizes, PDSYEVX could achieve impressive results on a distributed memory machine
with half the main memory now typical of distributed memory parallel computers if that
smaller main memory could be made modestly, say 20%, faster. With the out-of-core sym-
metric eigensolver being developed by Ed D'Azevedo (based on my suggestion to reduce
main memory requirements from 4n2 to 12n
2 by using symmetric packed storage during
the reduction to tridiagonal form and two passes trhough back transformation), the main
memory requirements of PDSYEVX will drop by a factor of 6 to 12, furthering the argument
for smaller, faster main memory.
As ScaLAPACK improves, it will be able to achieve high e�ciency on smaller problem
sizes. This will mean that the best machines for ScaLAPACK will have less memory than
that suggested by the memory rule of thumb at the top of this chapter.
6.1.4 Gang scheduling
6.2 sec:gang
A code which involves frequent synchronizations, such as reduction to tridiagonal
form, requires either dedicated use of the the nodes upon which it runs or gang scheduling.
If even one node is not participating in the computation, the computation will stall at the
next synchronization point.
6.2.1 Consistent performance on all nodes
A statically load balanced code, such as PDSYEVX, will executed only as fast as
the slowest node on which it is run. This, like the need for gang scheduling, is obvious.
Yet, occasionally nodes which have identical speci�cations perform di�erently. Kathy Yelick
noticed that some nodes CM5 at Berkeley were slower than others.. And, I have reason to
believe that at least two of the nodes on the PARAGON at Univeristy of Tennessee at
Knoxville are slower than the others (See Table 6.3).
The people who design and maintain distributed memory parallel computers should
95
Table 6.1: Performance
messagelatency�
transmissioncostperword�
BLAS1 oprate 1
matrix-vectormutiply
softwareoverhead� 2
matrix-vectormutiply
oprate 2
matrix-matrixmutiply
oprate 3
divide �
IBM SP2 54 0.12(67)
.0037270
.25(4)
5 ?? P
PARAGON 66 0.14(57)
0.0235(42)
3.8(.26)
80 P ??
make sure that slow nodes are identi�ed and marked as such or taken o�-line.
6.3 Performance characteristics of distributed memory com-
puters
6.3.1 PDSYEVX execution time (predicted and actual)
Table 6.3 compares predicted and actual performance on the Intel PARAGON. Actual
PDSYEVX performance never exceeds the performance predicted by our model and usually
is within 15% of the predicted performance. Every run which shows actual execution time
which is more than 15% greater than expected execution time is marked with an asterisk.
I would be satis�ed with a performance model that is within 20% to 25%, and would not
expect this performance model to match to within 15% on other machines. I have checked
several of these and have noticed that in these runs one or two processors have noticeably
slower performance on DGEMV than the other processors. I have also rerun many of these
aberrant timings and for each that I have rerun, at least one of the runs completed within
15% of predicted performance. Nonetheless, this aberrant behavior deserves further study.
96
PARAGON MP IBM SP24
Processor 50 Mhz i860 XP 120 Mhz POWER2 SC
Location xps5.ccs.ornl.gov chowder.ccs.utk.edu
Data cache
16K bytes4way set-associatedwrite-back32-byte lines5
128K bytes
Code cache
16 Kbytes4way set-associated32-byte blocks
32K bytes
Second level cache None NoneProcessors per node 1 1Memory per node 32 Mbytes 256 Mbytes
Operating system Paragon OSF/1 xps51.0.4 R1 4 5
AIX
ScaLAPACK 1.5 1.5
BLAS -lkmath -lesslp2
BLACS NX BLACS MPL BLACS
Communication software NX MPI
PrecisionDouble64 bits
Double64 bits
Table 6.2: Hardware and software characteristics of the PARAGON and the IBM SP2.
97
Table 6.3: Predicted and actual execution times of PDSYEVX on xps5, an Intel PARAGON.Problem sizes which resulted in execution time of greater than 15% greater than predictedare marked with an asterix. Many of these problem sizes which result in more than 15%greater execution time than expected were repeated to show that the unusually large exe-cution times are aberrant.
n nprow npcol nb
Actualtime(seconds)
Estimatedtime(seconds)
ActualEstimated
375 2 4 32 8.51 8.24 0.97
375 4 8 32 6.34 4.65 0.73*
750 2 4 32 31.2 30.1 0.96
750 2 4 32 31.3 30.1 0.96
750 2 4 32 31.5 30.1 0.96
750 2 4 32 41.2 30.1 0.73*
750 2 4 32 43.3 30.1 0.7*
750 4 4 32 20.3 18.9 0.93
750 4 6 32 16.5 15.3 0.93
750 4 6 32 22.3 15.3 0.69*
750 4 6 32 23.1 15.3 0.66*
750 4 8 32 14.1 13.2 0.93
1000 2 4 32 55.8 53.8 0.96
1000 2 4 8 52.9 54.4 1
1000 4 2 32 56.5 54.9 0.97
1000 4 2 8 56.2 59.3 1.1
1125 2 4 32 72.2 68.8 0.95
1125 4 8 32 38.2 26.7 0.7*
1500 2 4 32 133 127 0.95
1500 2 4 32 134 127 0.95
1500 2 4 32 134 127 0.95
1500 2 4 32 176 127 0.73*
1500 2 4 32 183 127 0.7*
1500 4 4 32 77.2 72.9 0.94
1500 4 6 32 77 55 0.71*
1500 4 6 32 59.3 55 0.93
1500 4 6 32 80.9 55 0.68*
1500 4 8 32 48.6 45.2 0.93
1875 4 8 32 99.7 70.9 0.71*
2250 4 4 32 186 175 0.94
2250 4 6 32 138 127 0.92
2250 4 6 32 179 127 0.71*
2250 4 6 32 182 127 0.7*
2250 4 8 32 112 102 0.91
2625 4 8 32 203 144 0.71*
3000 4 8 32 214 191 0.89
98
Chapter 7
Execution time of other dense
symmetric eigensolvers
In this chapter, I present models for performance of other symmetric eigensolvers. These
models have not been fully validated, although some have been partly validated.
7.1 Implementations based on reduction to tridiagonal form
7.1.1 PeIGs
PeIGs[74], like PDSYEVX, uses reduction to tridiagonal form, bisection, inverse iteration and
back transformation to perform the parall eigendecomposition of a dense symmetric matrix.
The execution time of PeIGs di�ers from that of PDSYEVX for two signi�cant reasons: PeIGs
is coded di�erently, (using a di�erent language and di�erent libraries) than PDSYEVX and
it uses a di�erent re-orthogonalization strategy. I am more interested in the di�erence
resulting from the di�erent re-orthogonalization strategy.
In PDSYEVX the number of ops performed by any particular processor, pi, during
re-orthogonalization is:P
C2fclusters assigned to pig 4Psize(C)
i=1 n iter(i)n (i� 1). Where: n iter(i)
is the number of inverse iterations performed for eigenvalue i (typically 3). If the size of
the largest cluster is greater than np , the processor which is responsible for this cluster will
not be responsible for any eigenvalues outside of this cluster.
Hence, if the size of the largest cluster is greater than np , the number of ops
99
performed by the processor to which this processor is assigned is (on average):
4 n itern1
2c2 = 6n c2
where: c = maxC2fclustersg size(C) i.e. the number of eigenvalues in the largest cluster, and
n iter = 3 is the average number of inverse iterations performed for each eigenvalue.
If the largest cluster has fewer than np eigenvalues, the number of eigenvalues that
will be assigned to any one processor, and hence the total number of ops it must perform,
is limited. The worst case is where there are p + 1 clusters each of size np+1 . In this case,
one processor must be assigned 2 clusters of size np+1 , requiring (on average) 2� 6n ( n
p+1)2
or roughly 12n3
p2 ops.
In contrast, PeIGs uses multiple processors and simultaneous iteration to maintain
orthogonality among eigenvectors associated with clustered eigenvalues. Traditional inverse
iteration[102] computes one eigenvector at a time, re-orthogonalizing against all previous
eigenvectors associated with eigenvalues in the same cluster, after each iteration. PeIGs,
in what they refer to as simultaneous iteration, performs one step of inverse iteration on
all eigenvectors associated with a cluster of eigenvalues and then reorthogonalizes all the
eigenvectors. This allows the re-orthogonalization to be performed e�ciently in parallel.
PeIGs is more accurate but slower than PDSYEVX if the input matrix has large
clusters of eigenvalues1 The cost of re-orthogonalization in PeIGs is O(n2 c=p) ops versus
O(nc2) ops in PDSYEVX.
7.1.2 HJS
Hendrickson, Jessup and Smith[91] wrote a symmetric eigensolver, HJS, for the PARAGON
which is signi�cantly faster than PDSYEVX, but which has never been released, and only
works on the Intel PARAGON.
HJS requires that the data layout block size be 1, i.e. a cyclic distribution, that
the processor grid be square, i.e. pr = pc and that intermediate matrices be replicated
across processor columns and distributed across processor rows. The requirement that the
processor grid be square limits e�ciency when used on a non-square processor gird. They
show that the algorithmic block size need not be tied to the data layout block size. At the
time that PDSYTRD was written, the PBLAS could not e�ciently use a cyclic distribution and
1PDSYEVX can maintain orthogonality among eigenvectors associated with clusters up to n
peigenvalues
easily and e�ciently.
100
did not support matrices replicated in one processor dimension and distributed across the
other.
HJS has several advantages over PDSYEVX. It uses a more e�cient transpose oper-
ation, eliminates redundant communication, reduces the number of messages by combining
some and reduces the number of words transmitted per process by using recursive halving
and doubling. HJS also reduces the load imbalance by a factor ofpp by using a cyclic data
layout and using all processors in all calculations2 . ScaLAPACK will incorporate several of
these ideas into the next version of PDSYEVX.
HJS notation
HJS also di�ers in a couple other rather minor aspects. They compute the norm
of v in a manner which could over ow, and they represent the re ector in a manner could
likewise over ow. These reduce execution time and program complexity slightly.
Their manner of counting the cost of messages in their performance model di�ers
from ours also. They count the cost of a message swap (sending a message to and simul-
taneously receiving a message from another processor) as equal to cost of sending a single
message. This re ects reality on the PARAGON and many but not all distributed memory ma-
chines. Using their method would not signi�cantly change the model for PDSYEVX because
PDSYEVX does not use message swap operations.
In their paper[91], they use di�erent variable names for the result of each compu-
tation, and show all indices explicitly. Figure 7.1 relates their notation to ours.
Figure 7.1: HJS notation
HJS our equivalent details
L tril(A)x� w tril(A)vy� wT tril(A;�1)vTp w w + transpose wT
c not mathematically identical
2PDSYTRD uses only pr processors in many computations
101
7.1.3 Comparing the execution time of HJS to PDSYEVX
The HJS implementation of parallel blocked Household tri-diagonalization performs essen-
tially the same computation as PDSYEVX. The di�erence is in the communication, load
balance and overhead costs. However, the operations are not performed in the same order,
and hence the steps don't match exactly. Some of the costs, particularly communication
costs, could easily have been assigned to a di�erent operation than the one that I assigned
them to. Hence, the execution time models for each of the individual tasks should not be
taken in isolation but understood as an aid in understanding the total.
Updating the current column of A (Line 1.1 in Figure 7.2)
As shown in table 4.1, the cost of updating the current column of A in PDSYTRD is:
2ndlog2(pp)e�+n nb dlog2(
pp)e�+2n �2+n2 nbp
p 2+2n �4
In Figure 6[91] steps Y2, 10.1, 10.2 and 10.3 of HJS are involved in updating the current
column of A and the cost of these steps is:
n�+1
2
n2pp� + 2n �2+
n2 nb
p 2 :
In PDSYEVX, a small part of vT and wT must be broadcast within the current column
of processors. In HJS, there is no need to broadcast vT because it is already replicated across
all processor rows. Instead of broadcasting the piece of wT that is necessary for this update,
HJS transposes all of wT , (cost: n�+1=2n2=pp�) anticipating the need for this in the rank
2k update.
The number of DGEMV ops performed does not change, but they are distributed
across all of the processors instead of being shared only by one column of processors. In
order to allow these ops to be distributed across all the processors, this update is performed
in a right-looking manner, i.e. the entire block column of the remaining matrix is updated
with the Householder re ector. In PDSYEVX, this update is performed in a left looking
manner, only the current column is updated (with a matrix vector multiply). In PDSYEVX,
the right-looking variant does not spread the work any better and hence the left-looking
variant is preferred because it involves a matrix-vector multiply, DGEMV, rather than a rank-
one update, DGER. Matrix-vector multiply requires only that every matrix element be read.
A rank-one update requires that every matrix element be read and then re-written.
102
The �4 term does not exist for HJS because they do not use the PBLAS, avoiding
the error checking and overhead associated with the PBLAS.
Computing the re ector (Line 2.1 in Figure 7.2)
As shown in table 4.2, the cost in PDSYTRD is: 3n dlog2(pr)e�+n �4 .In Figure 6[91] steps 2, 3, 4, 5, 6 and X of HJS are involved in computing the re ector, and
the cost of these steps is: n dlog2(p)e� , a little less than the cost in PDSYTRD.
Step 1 in HJS is also used in the computation of the re ector in HJS, however step
1 isn't necessary to compute the re ector, and it is necessary for the matrix-vector multiply,
hence I assign the cost of Step 1 to the matrix-vector multiply.
Both routines perform essentially the same operations. HJS appends the broadcast
of A(J+1; J) to the computation of xnorm (though HJS actually computes xnorm2), which
HJS performs as a sum-to-all. On the other hand, they involve all processors rather than
just one column of processors, hence the sum costs dlog2(p)e rather than dlog2(pr)e.The di�erence in performance would appear more dramatic if I included the cost
of the BLAS1 operations in my PDSYEVX model. I do not because they account for an
insigni�cant O(n2
pr) 1 execution time. HJS performs fewer BLAS1 ops (because they do not
go the extremes that PDSYTRD does to avoid over ow) and the ops that they perform are
distributed over all processors instead of over only one column of processors.
The cost of matrix vector multiply(Lines 3.1-3.6 in Figure 7.2)
As shown in table 4.3, the cost of the matrix vector multiply in PDSYTRD is:
4n dlog2(pp)e�+2n �+2
n2dlog2(pp)epp
�+1:5n2pp�+2n nb dlog2(
pp)e �+ n2
nbpp�2
+2
3
n3
p 2+3
n2 nbpp
2+n2pp�1+n �4
In Figure 6[91] steps 1, Y1, 7.1, 7.2, and 7.3 are involved in matrix vector multiply and the
cost of these steps is:
2n dlog2(pp)e�+ 2n�+
1
2
n2ppdlog2(
pp)e �+3
2
n2pp� + 2n �2+
1
3
n3
p 2 :
The model for HJS is much simpler because 1) the local portion of the matrix-vector multiply
requires just a single call to DGEMV and 2) the load imbalance in HJS is negligible (O(n2
p )
versus O( n2pp)) in PDSYEVX).
103
The communication performed in HJS during the matrix vector multiply includes:
Figure 6[91] Execution time model
Broadcast v within a row Step 1) n dlog2(pp)e�+ 1
2n2ppdlog2(
pp)e�
Transpose v and y Steps Y1, 7.3 2n�+n2pp�
Recursive halve p Step 7.3 n dlog2(pp)e�+1
2n2pp�
The transpose operations take advantage of the fact that pr = pc. Each processor
(a; b) simply sends its local portion of the vector to processor (b; a) while receiving the
transpose from that same processor.
The recursive halving operation is a distributed sum in which each of the pc pro-
cessors in the row starts with k values and end up with kpc
sums.
Updating the matrix vector multiply (Line 4.1 in Figure 7.2)
As shown in table 4.4, the cost of updating the matrix vector multiply in PDSYTRD is:
6n dlog2(pp)e�+n2dlog2(
pp)e
pr�+3n nb dlog2(
pp)e �+4n �2+2 n
2 nbpp
2+4n �4 :
In Figure 6[91] step 7.4 updates the matrix vector multiply and the cost of this step is:
2n �2+n2 nb
p 2+ :
Computing the companion update vector, w (Line 5.1 in Figure 7.2)
As shown in table 4.5, the cost of computing the companion update vector in PDSYTRD is:
2n dlog2(pp)e�+n �4+ :
In Figure 6[91] steps 8 and 9 compute the companion update vector and the cost of these
steps is:
5n dlog2(pp)e�+
n2pp� :
Just as in the computation of the re ector, the O(n2) costs of the BLAS1 operations
is insigni�cant. HJS performs these more e�ciently than PDSYEVX, because it uses all the
processors in these computations.
104
Perform the rank 2k update(Line 6.3 in Figure 7.2)
As shown in table 4.6, the cost of the rank 2k update in PDSYTRD is:
4n
nbdlog2(
pp)e�+2 n2p
pdlog2(
pp)e �+ n2p
p��2n nb dlog2(
pp)e �
+4n2
nb2pp pbf
�3+2
3 3+3
n2 nbpp
3 :
In Figure 6[91] step 10.4 performs the rank 2k update and the cost of this step is:
2n2
nb2pp�3+
2
3 3+ :
HJS does not require any communication here because W and V , are already
replicated across the processor rows, while WT and V T are already replicated across all
the processor columns.
Both HJS and PDSYEVX must perform the rank 2k update as a series of panel
updates using DGEMM. Both PDSYTRD and HJS use a panel width of twice the algorithmic
blocking factor.
Figure 7.2 summarizes the main sources of ine�ciencies in HJS reduction to tridi-
agonal form.
Table 7.1 compares the execution time in PDSYEVX and HJS reduction to tridiagonal
form. Each row represents a particular operation. The second column is the time (in
seconds) associated with the given operation in PDSYEVX. The third column shows the
number of the given operation performed in PDSYEVX. The product of the third column
with the �rst column, after substituting the cost given for the operation given in section 5.2
and n = 4000 and p = 64 is the second column. For example the cost of matrix multiply ops
in PDSYTRD on the PARAGON is: 2=3 (n = 4000)3=(p = 64) ( 3 = :0215e�6) = 14:3. Likewise,
the second to last column (the number of the given operation performed in reduction to
tridiagonal form in HJS) times the �rst column equals the last column (the time associated
with the given operation in reduction to tridiagonal form in HJS.)
Columns 4 through 10 represent unimplemented intermediate variations on reduc-
tion to tridiagonal form. Column 4, labeled \minus PBLAS ine�ciencies" assumes that a
couple ine�ciencies of the PBLAS are removed: (a bug in the PBLAS causing unnecessary
communication and the PBLAS overhead). Column 5, labeled \be less paranoid", assumes
that in addition PDSYTRD computes re ectors in the slightly faster, slightly riskier manner
105
Figure 7.2: Execution time model for HJS reduction to tridiagonal form. Line numbersmatch Figure 4.5(PDSYEVX execution time)
computation communicationoverhead imbalance latency bandwidth
do ii = 1; n; nbmxi = min(ii+ nb; n)do i = ii ;mxi
Update current (ith) column of A
1.1 transpose w n lg(pp)� 1
2n2pp�
1.2 A = A�W V T � V WT
Compute re ector2.1 v = house(A) 2n lg(
pp)�
Perform matrix-vector multiply
3.1 spread v across n lg(pp)� 1
2n2 lg(
pp)p
p �
3.2 transpose v 12
n2pp �
3.3 w = tril(A)v; 23n3
p 2wT = tril(A;�1)vT
3.5 recursive halve w 12n2pp �
3.6 w = w + transpose wT 12n2pp �
Update the matrix-vector product
4.1 w = w �W V Tv � V WT v 3n lg(pp)� n2p
p �
Compute companion update vector
5.1 c = w � vT ; 2n lg(pp)�
w = � w � (c �=2) v
end do i = ii ;mxi
Perform rank 2k update
6.3 A = A �W V T � V WT 2 n2
nb2pp�3
23n3
p 3
end do ii = 1; n; nb
that HJS does. Column 6 assumes direct transpose operations. Column 7 assumes that
certain messages are combined, reducing the message latency cost. Column 8 assumes that
sum-to-all is used instead of sum-to-one follow by a broadcast, reducing the latency cost.
Column 9, assumes that V;W; VT;WT are stored replicated across processor columns, this
eliminates all communication in the rank 2k update. Storing the data replicated also allows
all processors to be involved in all computations, but this is not assumed until column 11.
Column 10 assumes a cyclic data layout, eliminating some load imbalance. Column 11 as-
106
sumes that all processors are involved in all computations, eliminating the load imbalance
which was not eliminated by using a cyclic data layout.
7.1.4 PDSYEV
PDSYEV uses the QR algorithm to solve the tridiagonal eigenproblem. Each eigenvector is
spread evenly among all the processors. Each processor redundantly computes the rotations
and updates the portion of each eigenvector which it owns. Computing the rotations requires
O(n2) ops, whereas updating the eigenvectors requires O(n3) ops. Hence PDSYEV scales
reasonably well as long as all the eigenvectors are required.
Each rotation requires 2 divides, 1 square root and approximately 20 to compute
and 6 ops to apply.
The cost of the QR based tridiagonal eigensolution in PDSYEV is:
nXj=1
sweeps(j) (n� j) (2 �+ p + 20 1+1
p6n 1)
On average, it takes two sweeps per eigenvalue, so we set sweeps(j) = 2 and simplify:
2n2 � + 1n2 p + 20n2 1 + 6mn2
p 1
7.2 Other techniques
7.2.1 One dimensional data layouts
One dimensional data layouts can improve the performance of dense linear algebra codes
on modest numbers of processors, especially on one-sided reductions like LU and QR de-
composition. In general, one dimensional data layouts require fewer communication calls
in the inner loop but more words transmitted per process. One-sided reductions typically
require fewer messages within rows than within columns, sometimes by a factor as high as
nb, other times the advantage is a more modest log(pp). One-sided reductions often require
fewer words to be transmitted between columns than between rows of processors, usually
by a factor of nb.
One dimensional data layouts also o�er less overhead. Often an entire block column
can be computed by a call to the corresponding LAPACK code rather than the ScaLAPACK
code, saving signi�cant overhead costs.
107
Table 7.1: Comparison between the cost of HJS reduction to tridiagonal form and PDSYTRD
on n = 4000; p = 64; nb = 32. Values di�ering from previous column are shaded.
PDSYTRDestimatedtime
PDSYTRDcounts
minusPBLAS
ine�ciency
belessparanoid
directtranspose
mergeoperations
usesum-to-all
StoreV;W;VT;WT
replicated
nodatablocking
Allprocessorscompute
(i.e.HJS)
HJSestimatedtime
scale factorn3
p 3 14.3 23
23
23
23
23
23
23
23
23 14.3
n3
p 2 16.4 23
23
23
23
23
23
23
23
23 16.4
n2ppnb2 pbf
�3 2.1 2 2 2 2 12
12
12
12
12 0.5
n �2 0.6 6 6 6 6 5 5 5 4 4 0.4n2p
p nb pbf�2 4.7 2 2 2 2 1 1 1 0 0 0.0
n2pp�1 4.0 1
2 0 0 0 0 0 0 0 0 0.0n2pp�1 1.7 1
2 0 0 0 0 0 0 0 0 0.0
n �4 8.9 9 12 0 0 0 0 0 0 0 0 0.0
n2 nb pbfpp 2 0.9 1 1 1 1 1 1 1 0 0 0.0
n2 nbpp 2 1.7 4 4 4 4 4 4 4 3 0 0.0
n2 nb pbfpp 3 1.0 1 1 1 1 1 1 1 1 1 1.0
n2 nbpp 3 0.5 1 1 1 1 1 1 1 0 0 0.0
n dlog2(pp)e� 13.5 17 15 14 12 9 6 6 6 9 7.1
n� 0.5 2 2 2 4 4 4 3 3 3 0.8nnb
dlog2(pp)e� 0.3 4 4 4 2 1 1 0 0 0 0.0
n2pp dlog2(
pp)e� 4.4 5 4 4 2 2 2 11
2 112
12 0.8
n2pp � 0.6 2 2 2 2 2 2 11
2 112 21
2 0.7
n nb dlog2(pp)e� 0.02 8 7 7 5 5 5 21
2 0 0 0
Total est. time 76 59 58 55 51 49 48 41 42
Actual time 93. 61.
108
Both LU decomposition and back transformation would bene�t considerably from
one-dimensional data layouts when p is small, although the advantage would be most pro-
nounced on LU. One-sided reductions require O(n) reductions across processor rows but
only O( nnb) reductions across processor columns. On a high latency system such as a net-
work of workstations, the performance improvement from using a one-dimensional data
layout could be substantial since LU requires O(nb) fewer messages on a one-dimensional
data layout.
ScaLAPACK does not take full advantage of one-dimensional data layouts because
it calls the ScaLAPACK code even when the LAPACK code would do the job faster.
Two-sided reductions, such as reduction to tridiagonal form, do not bene�t from
one dimensional data layouts. Two-sided reductions require O(n) reductions across pro-
cessor rows and O(n) reductions across processor columns,hence eliminating the reductions
across processor rows (by using a 1D data decomposition) will not substantially reduce the
number of messages in two-sided reductions.
7.2.2 Unblocked reduction to tridiagonal form
Unblocked reduction to tridiagonal form can outperform blocked reduction for small and
modest sized problems, especially if a good compiler is available for the inner kernel. Un-
blocked reduction to tridiagonal form must perform all of its ops as BLAS2 ops, whereas
blocked reduction to tridiagonal form performs half of its ops as BLAS3 ops. However,
unblocked reduction to tridiagonal form requires much less overhead. Blocked reduction
to tridiagonal form requires at least 6n calls to DGEMV. unblocked reduction to tridiagonal
form requires only n calls to DSYMV and n calls to DGER.
If a compiler is available that will e�ciently compile the following kernel, unblocked
reduction to tridiagonal form could require only n BLAS2 calls and still attain near peak
performance on large problem sizes, especially for Hermitian eigenproblems3. The kernel
shown below only requires each element of A be read once and written once, while per-
forming 8 ops. This ratio, 1 memory read, 1 memory to 8 ops is one that many modern
computers can handle at near peak speed, even from main memory - in part because the
access are essentially all stride 1.
3Complex arithmetic requires only half as much memory tra�c per op
109
for i = 1, n {
for j = 1, i {
A(i,j) = A(i,j) - v(i) * wt(j) - w(i) * vt(j);
nwt(i) = nwt(i) + A(i,j) * nv(j);
nw(j) = nw(j) + A(i,j) * nvt(i);
}
}
7.2.3 Reduction to banded form
Reducing a dense matrix to banded form can be more e�cient than reduction to tridiagonal
form[24, 25, 116], however it is not clear that this can be made to be fast enough to overcome
the added costs to the rest of the code. Reduction to banded form requires less execution
time than reduction to tridiagonal form because it requires fewer messages O(n=nb) instead
of O(n) and because asymptotically all of the ops can be performed as BLAS3 ops rather
than half BLAS2 ops.
An e�cient eigensolver based on reduction to banded form could be designed as follows:
Reduce to banded formReduce from banded form to tridiagonal form (do not save rotations)Compute eigenvalues using bisection on tridiagonal formPerform inverse iteration on banded formBack transform the eigenvectors
This would be even simpler if only eigenvalues were required, as that eliminates the inverse
iteration and back transformation steps.
If only a few eigenvectors are required, one could reduce from banded form to
tridiagonal form, saving the rotations. This would allow the eigenvectors to be computed
on the tridiagonal using inverse iteration (or the new Parlett/Dhillon work). Then the
rotations could be applied as necessary and �nally the eigenvectors would be transformed
back. This would result in a complex code.
If two step band reduction to tridiagonal form were performed as above and the
eigenvectors were computed on the tridiagonal matrix, the cost of transforming them back to
the original problem would be at least 4n3, adding 60% more O(n3) ops to full tridiagonal
eigendecomposition. This could be done in two steps, applying �rst the rotations accrued
during reduction from banded to tridiagonal form and then transforming the eigenvectors
of the banded form back to the original problem. A cleaner, though more costly solution
110
would be to form the back transformation matrix after (or during) reduction to banded
form, update that during reduction to banded form and then use this to transform the
eigenvectors of the tridiagonal back to the original problem.
Using reduction to banded form in an eigensolver requires, at a minimum, that
two step band reduction to tridiagonal form be faster than direct reduction to tridiagonal
form. If eigenvectors are required, it must be signi�cantly faster in order to overcome the
additional 2n3 cost of back transformation.
So far, no one has demonstrated that two step reduction to tridiagonal form can
be performed faster than direct reduction on distributed memory computers. Alpatov,
Bischof and van de Geijn's two-step reduction to tridiagonal form[173] is not faster than
PDSYTRD. They assert that it can be optimized, but that is also true of PDSYTRD. So, it
is not yet clear whether two-step reduction to tridiagonal form will be signi�cantly faster
than direct reduction to tridiagonal form on any important subset of distributed memory
parallel computers.
I believe that software overhead plays a signi�cant role in limiting the performance
of two step reduction to banded form.
7.2.4 One-sided reduction to tridiagonal form
Hegland et al.[90] show that one can reduce the Cholesky factor (of a shifted input
matrix) to bidiagonal form updating from only one side. The result, in their implementation,
is a code which requires (10=3n3=p+n2�p) ops per processor, (n2�p) words communicatedper processor and (n � p) messages per processor.
They argue that this technique, despite requiring 2.5 times as many ops, yields
better performance on their target machine than conventional methods for reduction to
tridiagonal form. They use a 1D processor grid, a unblocked algorithm, a non-scalable
pattern communication and computation and ignore symmetry. By ignoring much of the
conventional wisdom they have achieved a simple, high performance code for their target
machine (vector).
111
7.2.5 Strassen's matrix multiply
The number of ops in Strassen's' matrix matrix multiply is:
2mnk�min(m;n; k)
s 12
�3�log 7:
Where s 12is the break even point for a particular Strassen's implementation, i.e.
the point at which one additional Strassen's divdied and conquer step neither increases nor
decreases execution time. Three factors contribute to preventing the use of Strassen's in
reduction to tridiagonal form and back transformation:
s 12is still too large
Lederman et al.[96] have reduced s 12to the range 100 to 500.
k is modest (where k is the block size.)
We can increase the block size, but only at the cost of additional load imbalance.
n3�log 7 = n�:193 shrinks slowly
Increasing n by enough to improve the ratio of \Strassen ops" to standard matrix
multiply ops by 50% requires a thousand-fold increase in the amount of memory
required. (5�:193 � :5, hence n must increase by a factor of 32 to to improve the ratio
of Strassen ops to standard matrix multiply ops.) Improving the ratio of \Strassen
ops" to standard matrix multiply ops by increasing the number of processors in-
volved is even di�cult. Although Chou et al.[43] have shown that 7k processors can
be used to do the work of 8k , it takes 75 = 16807 processors to get a factor of two
advantage this way. (75=85 = :51)
It is this last point that prevents Strassen's from rescuing ISDA (which is described below).
Because 32�:193 � 0:5, the problem size must be 32 s 12in order to halve the number of ops
required in ISDA. Halving the number of ops again would require that n be increased by
another factor of 30, increasing memory by another factor of 900 and the total number
of ops, even after the factor of two savings, by 1230
3 = 13; 500. I have not yet seen a
Strassen's matrix matrix multiply that achieves twice the performance of a regular matrix
matrix multiply.
112
Table 7.2: Fastest eigendecomposition method
n > 500pp n < 500
pp
Random matricesTridiagonal(> 4 times faster)
Tridiagonal
Spectrally diagonallydominant matrices
Tridiagonal Jacobi
7.3 Jacobi
7.3.1 Jacobi versus Tridiagonal eigensolvers
This section is based on models that have only been informally validated. I have compared
my models to those used by Arbenz and Slapni�car[9] and Little�eld and Maschho�[125] as
well as against the execution times reported in these papers but have not performed any
independent validation. Hence, the opinions that I express in this section should be taken
as conjectures.
Large matrices4 can be solved faster by a tridiagonal based eigensolver than by a
Jacobi eigensolver, but it is likely that Jacobi will outperform tridiagonal based eigensolvers
on small spectrally diagonally dominant matrices5. Since tridiagonal based methods require,
asymptotically, no more than a quarter as many ops as blocked Jacobi methods, even on
spectrally diagonally dominant matrices, I expect that tridiagonally based methods will
win on large matrices, even spectrally diagonally dominant ones, because tridiagonal based
methods can achieve 25% of peak performance on large matrices as shown in Chapter 5. I
also expect that tridiagonal based eigensolvers will beat Jacobi eigensolvers on random ma-
trices regardless of their size because on random matrices, tridiagonal eigensolvers perform
roughly 16 times fewer ops6 and I don't think that Jacobi methods will be 16 times faster
per op regardless of the input size. Table 7.2 summarizes which eigensolution method I
expect to be faster as a function of these input matrix characteristics.
4On current machines (n > 500pp) is su�ciently large to allow a tridiagonal eigensolver to outperform
Jacobi.5Spectrally diagonally dominant means that the eigenvector matrix, or a permutation thereof, is diago-
nally dominant. Most, but not all, diagonally dominant matrices are spectrally diagonally dominant. Forexample if you take a dense matrix with elements randomly chosen from [�1; 1] and scale the diagonalelements by 1e3 the resulting diagonally dominant matrix will generally) be spectrally diagonally dominant.However, if you take that same matrix and add 1e3 to each diagonal element, the eigenvector matrix isunchanged even though the matrix is clearly diagonally dominant.
6Assuming Jacobi converges in an optimistic 8 sweeps
113
7.3.2 Overview of Jacobi Methods
Despite Jacobi's simplicity there are several possible variants, especially for a par-
allel code, each of which have advantages. In section 7.3.16 I describe the code that I would
write if I were going to write a parallel code. I recommend a 2D data layout if one wishes
to be able to run e�ciently on large numbers of processors (say 48 or more). However, a
1D data layout is considerably simpler to implement and simpler implementation translates
into less software overhead. On some computers, Jacobi with a 1D data layout might be
e�cient for hundreds of processors. I recommend using a one-sided, blocked, non-threshold
Jacobi[9] with a caterpillar track pairing[150] and distinct communication and computation
phases, but other methods cannot be entirely rejected. For a spectrally diagonally matrix
the fastest serial Jacobi algorithm is a threshold Jacobi, hence threshold methods cannot be
ignored. A threshold method would almost certainly have to be two-sided, use a di�erent
pairing strategy and either a non-blocked code or some unconventional blocking strategy.
Non-blocked codes may make sense for small matrices and large numbers of processors
as well as for machines, such as vector architectures, which o�er comparable BLAS1 and
BLAS3 performance. Overlapping communication and computation will save time, but my
experience indicates that the savings is limited.
My recommendation is weighted toward small matrices that are modestly spec-
trally diagonally dominant, but not so dominant that certain matrix entries can be com-
pletely ignored. If the input matrix is sparse and so strongly spectrally diagonally dominant
that the matrix never �lls in, one would have to consider threshold methods and methods
that don't update parts of the matrix that remain zero. On the other hand, if the matrix
is quite large, performance could be further improved by using a di�erent data layout from
the one that I recommend.
There are many implementation options available to anyone writing a Jacobi code.
I will discuss many of these implementation options in the following sections. Section 7.3.3
explains the basic variants and data layout options. Section 7.3.4 explains the computa-
tion requirements of each of the basic variants. Section 7.3.5 explains the communication
requirements of each of the basic variants. Section 7.3.6 discusses blocking (both commu-
nication and computation). Section 7.3.7 discusses the importance of exploiting symmetry.
Section 7.3.8 explains that one-sided methods need not recompute diagonal blocks of ATA.
Section 7.3.9 discusses options for the partial eigendecomposition required by a blocked
114
Jacobi method. Section 7.3.10 discusses threshold strategies. Section 7.3.12 discusses pre-
conditioners. Section 7.3.13 discusses overlapping communication and computation.
7.3.3 Jacobi Methods
The matlab code for the classical, two-sided, Jacobi method shown in �gure 7.3 di�ers from
textbook descriptions only in that the rotation is computed by calling parteig and the
o� diagonals are compared to the diagonals (in the threshold test) in an unusual manner.
Figure 7.6 gives ine�cient matlab code for parteig which calls matlab's eig() routine and
sorts the eigenvalues to guarantee convergence. In a real implementation, parteig would
be one or two sweeps of two-sided Jacobi.
A two-sided blocked Jacobi matlab code is given in �gure 7.4. Because the code in
�gure 7.3 uses parteig to compute the rotations and norm in the threshold test, the only
di�erence between the blocked and unblocked versions is the de�nition of I and J. parteig
is not typically a full eigendecomposition, more often it is a single sweep of Jacobi.
The one-sided Jacobi variants can operate on any matrix whose left singular vectors
are the same as, or related to, the eigenvectors of the input matrix. This allows many choices
for pre-conditioning the input matrix, several of which are discussed in section 7.3.12
The one-sided Jacobi methods lose symmetry, but still require fewer ops than the
two-sided Jacobi methods because they do not have to update the eigenvectors separately7.
Furthermore, the one-sided Jacobi methods always access the matrix in one direction (by
column for Fortran). A typical one-sided Jacobi method is shown in �gure 7.5.
Parallel Jacobi methods require two forms of communication. The columns and/or
rows of the matrix must be exchanged in order to compute the rotations and the rotations
must be broadcast. The basic communication for one-sided Jacobi is shown in �gure 7.7
while the communication pattern for two-sided Jacobi is given in �gure 7.8.
7.3.4 Computation costs
The computation and communication cost for the Jacobi method which I recommend for
non-vector distributed memory computers with many nodes, a one-sided blocked Jacobi on
a 2 dimensional (pr � pc) processor grid, is shown in table 7.3. De�nitions for all symbols
used here can be found in Appendix A.
7They also avoid applying rotations from both sides, but this advantage is negated by the fact that they
115
Figure 7.3: Matlab code for two-sided cyclic Jacobi
function [Q,D] = jac2(A)
%
% Classical two-sided threshold Jacobi
%
thresh = 1e-15;
maxiter = 25;
n = size(A,2)
iter = 0
mods = 1
Q = eye(n);
while (iter < maxiter & mods > 0 )
mods = 0;
for I = 1:n
for J = 1:I-1
blkA = A([J,I],[J,I]) ;
if ( norm(blkA-diag(diag(blkA))) > ( norm(blkA)*thresh))
mods = mods + 1;
[R,D] = parteig(A([J,I],[J,I]));
A([J,I],:) = R' * A([J,I],:) ;
A(:,[J,I]) = A(:,[J,I]) * R ;
Q(:,[J,I]) = Q(:,[J,I]) * R ;
end % if
end % for J
end % for I
iter = iter + 1
end % while
D = diag(diag(A)) ;
116
Figure 7.4: Matlab code for two-sided blocked Jacobi
function [Q,D] = bjac2( A )
%
% Two sided blocked threshold Jacobi
%
maxiter = 25 ;
thresh = 1e-15;
nb = 1;
n = size(A,2)
iter = 0;
mods = 1;
Q = eye(n);
while (iter < maxiter & mods > 0 )
A = ( A + A' ) / 2; % restore symmetry
mods = 0;
for i = 1:nb:n
maxi = min(i+nb-1,n);
I = i:maxi;
for j = 1:nb:I-1
maxj = min(j+nb-1,n);
J = j:maxj;
blkA = A([J,I],[J,I]) ;
if ( norm(blkA-diag(diag(blkA))) > ( norm(blkA)*sqrt(nb)*thresh))
mods = mods + 1 ;
[R,D] = parteig(A([J,I],[J,I])) ;
A([J,I],:) = R' * A([J,I],:) ;
A(:,[J,I]) = A(:,[J,I]) * R ;
Q(:,[J,I]) = Q(:,[J,I]) * R ;
end % if
end % for j
end % for i
iter = iter + 1
end % while
D = diag(diag(A)) ;
117
Figure 7.5: Matlab code for one-sided blocked Jacobi
function [ Q, D ] = bjac1( A )
%
% One sided blocked Jacobi
%
thresh = 1e-15 ;
nb = 2 ;
maxiter = 25;
n = size(A,2)
B = A;
iter = 0 ;
mods = 1 ;
while (iter < maxiter & mods > 0)
mods = 0 ;
for i = 1:nb:n
maxi = min(i+nb-1,n);
I = i:maxi;
for j = 1:nb:I-1
maxj = min(j+nb-1,n);
J = j:maxj;
blkA = A(:,[J,I])' * A(:,[J,I]) ;
if (norm(blkA-diag(diag(blkA))) > norm(blkA)*sqrt(nb)*thresh)
mods = mods + 1 ;
[R,D] = parteig(blkA) ;
A(:,[J,I]) = A(:,[J,I]) * R ;
end % if
end % for j
end % for i
iter = iter + 1
end % while
D = A' * A;
Q = A * diag(1./sqrt(diag(D))) ;
D = Q' * B * Q ;
D = diag(diag(D)) ;
118
Figure 7.6: Matlab code for an ine�cient partial eigendecomposition routine
%
% parteig - eigendecomposition with eigenvalues sorted
%
function [ Q, D ] = parteig( A )
[QQ,DD ] = eig(A) ;
[tmp,Index] = sort(- diag(DD));
D = DD(Index,Index) ;
Q = QQ(:,Index) ;
Table 7.3: Performance model for my recommended Jacobi method
TaskCost perparallel pairing
Cost per sweepi.e. (n=nb2)=(2pc)parallel pairings
Cost for recommendeddata layout(nb=n=(2pc) pc=16pr=4
pp)
Move columnfor this pairinga
2 2nbpcn
�(�+nnb
pr�)
nnb�+n2
pr� 8
pp�+ 4n2p
p�
diag=A([I;J ];:)0�A(I;J ];:) b �3+2nnb2
pr 3 ( n
nb)2 1
2pc�3+
n3
p 3 8
pp�3+
n3
p 3
Sum diag within eachprocessor column
lg(pr)2nbpcn
�
+ lg(pr)nb2�
( nnb) lg(pr)�
+ n2
2pclg(pr)�
4pp(lg(p)�4)�
+n2=( 16pp(lg(p)�4)�
[Q;D] = parteig(diag) c 2nb2(2 �+ p )
+6(2nb)3 1
2n2
pc �+n2
pc p
+ 24n2nbpc
1
12n2pp �+ 1
4n2pp p
+ 34n3
p 1 (see noted)
Broadcast Q withineach processor column
lg(pr)2nbpcn
�
+ lg(pr)nb2�
( nnb) lg(pr)�
+ n2
2pclg(pr)�
4pp(lg(p)�4)�
+ n2
16pp(lg(p)�4)�
A = QA �3+2 npr
(2nb)2 3 ( nnb)2 1
2pc�3+4n
3
p 3 8
pp�3+4n
3
p 3
Total
nnb�+2( n
nb) lg(pr)�+
n2
pr�+n2
pclg(pr)�
+2n2
pc �+n2
pc p
+24n2nb
pc 1
+( nnb)2 1
pc�3+5n
3
p 3
8pp(lg(p)�3)�+ 7
2n2pp�+
n2
8pplg(p)�+ 1
2n2pp �
+ 14n2pp p
+ 34n3
p 1 8
pp�3+5n
3
p 3
aMy models assume that sends and receives do not overlap, hence the factor of 2. The factor of (2nbpc=n)represents the number of parallel pairings that can be performed on the data local to one processor column.
bOnly A(I; :)0 �A(J; :) need be computed. See section 7.3.8cPartial eigendecomposition of the (2nb)� (2nb) matrix performed with one pass of an unblocked two-
sided Jacobi method exploiting symmetry, see column labeled \exploiting symmetry" in table 7.6d( 24n
2nb
pc) � (( n
nb)2 1
2pc) = 24 n3
2pc= 24=36n
3
p= 3=4n
3
p
119
Figure 7.7: Pseudo code for one-sided parallel Jacobi with a 2D data layout with commu-nication highlighted
Until convergence do:
Foreach paring do:
Move column data (A) to adjacent columns of processors
Compute ATA locally (i.e. blkA = A(:,[I,J])' * A(:,[I,J]))
Combine ATA within each column of processors
Partial eigendecomposition of diagonal block (i.e. [R;D] = eig(ATA))
Broadcast R within each row of processors
Compute A R locally
End Foreach
End Until
Table 7.4 shows the estimated execution time for one sweep of my recommended
Jacobi on a matrix of size 1000 by 1000 on a 64-node PARAGON. As this model has not
been validated, these estimates must be viewed with caution. Actual performance will be
di�erent, but the model gives some idea of how important the various aspects may be.
This model is given in matlab form in section B.2.1. Table 7.4 suggests that Jacobi is
indeed e�cient (1.68/2.69 = 62%) even on such small problems. It also suggests that the
optimal data layout may be even taller and thinner than my recommended data layout:
pc = 32; pr = 2. A taller and thinner layout (speci�cally pc = 64; pr = 1) would double the
cost of message transmission between columns but would decrease the cost of the partial
eigensolver. The cost of the divides and square roots in the partial eigensolver would
decrease by a factor of 64=32 because all 64 processors would participate in the partial
eigensolver. And the cost of accumulating the rotations within the partial eigensolver would
decrease by 2� 2 = 4. The �rst factor of 2 stems the fact that all processors would share
in the work, while the second factor of 2 stems from the fact that the block size would be
smaller by a factor of 2 and the cost of accumulating rotations grows as O(n2nb).
Table 7.5 gives computation cost models for 6 one-sided Jacobi variants. These models are
not complete (they overlook many overhead and load imbalance costs), nor have they been
validated. This table is designed mainly to put the various variants in perspective and not
must perform dot products to form the square sub matrices to be diagonalized.
120
Table 7.4: Estimated execution time per sweep for my recommended Jacobi on the PARAGONon n=1000, p=64
TaskPerformanceModel
Operationcosta
Estimatedtime (seconds)
Messagelatency
8pp(lg(p)� 3)� � = 65:9e� 6 0.01
Messagetransmissionbetweencolumns
72n2pp� � = :146e� 6 0.06
Messagetransmissionwithincolumns
18n2pp lg(p)� � = :146e� 6 0.01
Computingrotations
12n2pp � � = 3:85e� 6 0.24
Computingrotations
14n2pp p p = 7:7e� 6 0.24
Accumulatingrotationsin partialeigensolver
34n3
p 1 1 = :074e� 6 0.43
Softwareoverhead
8pp�3 �3 = 103e� 6 0.01
A = QA 5n3
p 3 3 = :0215e� 6 1.68Total(per sweep)
2.68
aSee 6.1
121
Figure 7.8: Pseudo code for two-sided parallel Jacobi with a 2D data layout, as describedby Schrieber[150], with communication highlighted
Until convergence do:
Foreach paring do:
Move row and column data (A) to diagonally adjacent processors
Compute partial eigendecomposition of diagonal block
Broadcast R within each row of processors
Broadcast R' within each column of processors
Compute R A R' locally
Compute Q R locally
End Foreach
End Until
to establish which is best. Communication costs are considered in section 7.3.5
I have attempted to list the variants that have been implemented as well as the
most promising suggestions. For each variant I have, where appropriate, followed my rec-
ommendations for implementing a Jacobi code made in section 7.3.16.
Table 7.6 gives performance models for 5 commonly mentioned two-sided Jacobi variants.
Like the performance models for one-sided Jacobi variants, these models are incomplete and
have not been validated.
7.3.5 Communication costs
Table 7.7 summarizes the communication costs for parallel Jacobi methods. I assume that
the communication block size is chosen to be as large as possible.
A performance model for Jacobi could be created by selecting the appropriate
computation costs from table 7.5 or table 7.6 and the appropriate communication cost from
table 7.7. Not all load imbalance and overhead costs are covered in either of these tables,
and the models have not been validated.
122
Table 7.5: Performance models ( op counts) for one-sided Jacobi variants. Entries whichdi�er from the previous column are shaded.
Unblocked
Blocked
a
Little�eld
Maschho�
b
exploit
symmetry
c
store
diagonalsd
fast
givense
exploit
symmetry
f
store
diagonals
g
ATA
3 2n2� 1+3n3 1
3 2n2� 1+3n3 1
1 2n2� 1+n3 1
1 2n2� 1+n3 1
2p2 c� 3+2n3 3
2p2 c� 3+n3 3
[Q;D]=
parteig(ATA)
Onesweeph
1 2n2( �+ p)
1 2n2( �+ p)
1 2n2( �+ p)
1 2n2( �+ p)
8n2� 1+
n2( �+ p)
+24n2nb 1
8n2� 1+
n2( �+ p)
+24n2nb 1
Q�L
2n2� 1+3n3 1
2n2� 1+3n3 1
2n2� 1+3n3 1
n2� 1+2n3 1
2p2 c� 3+4n3 3
2p2 c� 3+4n3 3
V�Q
2n2� 1+3n3 1
0
0
0
0
0
Total
(persweep)
7 2n2� 1+
1 2n2( �+ p)
+9n3 1
7 2n2� 1+
1 2n2( �+ p)
+6n3 1
5 2n2� 1+
1 2n2( �+ p)
+4n3 1
3 2n2� 1+
1 2n2( �+ p)
+3n3 1
4p2 c� 3+
8n2� 1+
n2( �+ p)
+24n2nb 1
+6n3 3
4p2 c� 3+
8n2� 1+
n2( �+ p)
+24n2nb 1
+5n3 3
Assume:
i
nb=
n2pc
pc=16pr
7 8n2pp� 1+
1 8n2pp �+
1 8n2pp p+
9n3 p
1
7 8n2pp� 1+
1 8n2pp �+
1 8n2pp p+
6n3 p
1
5 8n2pp� 1+
1 8n2pp �+
1 8n2pp p+
4n3 p
1
3 8n2pp� 1+
1 8n2pp �+
1 8n2pp p+
3n3 p
1
64p� 3+
2n2pp� 1+
1 4n2pp �+
1 4n2pp p+
3 8n3 p
1+
6n3 p
3
64p� 3+
2n2pp� 1+
1 4n2pp �+
1 4n2pp p+
3 8n3 p
1+
5n3 p
3
aFor parallel codes we assume that the blocksize is chosen to be as large as possible i.e. nb = n=(2pc)where pc is the numer of processor columns. For a serial code pc = n=(2� nb) can be arbitrarily chosen.
bThis is the one-sided method used by Little�eld and Mascho�[125].cThis is the method shown in �gure 7.5dThis is the method used by Arbenz and Oettli[10]eUsing fast givens is often mentioned, but rarely implemented. Perhaps the bene�t is not as good as this
model would suggest.fThis is the method shown in �gure 7.4gThis is the method used by Arbenz and Slapnicar[9]hOne sweep of Jacobi on an matrix of size 2nb by 2nbiI also assume that only one processor in each processor column is involved in each partial eigendecom-
position.
123
Table 7.6: Performance models ( op counts) for two-sided Jacobi variants
Unblocked Blockeda
Ignoresymmetryb
Exploitsymmetry
fastgivensc
Ignoresymmetryd
Exploitsymmetry
parteig(A([I; J ]; [I; J ])(one sweepe)
12n
2( �+ p) 12n
2( �+ p ) 12n
2( �+ p )
8n2�1+
n2( �+ p)
+24n2nb 1
8n2�1+
n2( �+ p )
+24n2nb 1
QAQT Rotatefrom both sides
4n2�1+6n3 12n2�1+
3n3 1n2�1+2n3 1 4p2c�3+8n3 3 4p2c�3+4n3 3
QZ
Update eigenvectors2n2�1+3n3 1 2n2�1+3n3 1 n2�1+3n3 1 2p2c�3+4n3 3 2p2c�3+4n3 3
Total (per sweep)
12n
2( �+ p)
+6n2�1
+9n3 1
12n
2( �+ p )
+4n2�1
+6n3 1
12n
2( �+ p )
+2n2�1
+4n3 1
8n2�1+
n2( �+ p)
+24n2nb 1+
6n2�1+12n3 1
8n2�1+
n2( �+ p )
+24n2nb 1+
6n2�1+8n3 1
Assume:f
nb= n2pc
pc=16pr
18n2pp �+
18n2pp p+
32n2pp�1+
9n3
p 1
18n2pp �+
18n2pp p+
n2pp�1+
6n3
p 1
18n2pp �+
18n2pp p+
12n2pp�1+
4 n3pp 1
2 n2pp�1+
n2pp �+
14n2pp p+
38n3
p 1+
32n2pp�1+
12n3
p 1
2 n2pp�1+
14n2pp �+
14n2pp p+
38n2
p 1+
12n2pp�1+
8n3
p 1
aFor parallel codes we assume that the blocksize is chosen to be as large as possible i.e. nb = n=(2pc)where pc is the number of processor columns. For a serial code pc = n=(2� nb) can be arbitrarily chosen.
bThis is the method used by Pourandi and Tourancheau[142], by Schreiber[150] and the method describedin �gure 7.3.
cUsing fast givens is often mentioned, but rarely implemented. Perhaps the bene�t is not as good as thismodel would suggest.
dThis is the method shown in �gure 7.4eOne sweep of Jacobi on an matrix of size 2nb by 2nbfI also assume that only one processor in each processor column is involved in each partial eigendecom-
position.
124
Table 7.7: Communication cost for Jacobi methods (per sweep)
One-sided Two-sided
1-D datalayouta
2-D datalayoutb
1-D datalayoutc
2-D datalayoutd
2-D datalayoute
exchangecolumn vectors
4p�+2n2� 4pc�+2n2
pr� 4p�+2n2�
6pc log(pp)�+
32n2 log(
pp)
pr�
12pp�+
3 n2
pp�
Reduce ATA 02pc lg(pr)�+
2n2
pclg(pr)�
0 0 0
Broadcastrotationsf
02pc lg(pr)�+
12n2 lg(pr)�
2p lg(p)�+n2
plg(p)�
4pc log(pr)�
+4pc log(pc)�
+ 12n2
prlog(pc)�
+ 12n2
pclog(pr)�
8pp�+
2 n2pp�
aThis is the method used by Arbenz and Slapnicar[9]bThis is the method used by Little�eld and Maschho�[125]cThis is the method used by Pourzandi and Tourancheau[142]dThis is 2D method most likely to be used todayeThis is method used by Schreiber[150]fOn the unblocked methods we assume that communication is blocked even though the computation is
not. We also assume that each rotation is sent as a single oating point number. This is natural if you areusing fast Givens but requires extra divides and square roots if fast Givens are not used.
7.3.6 Blocking
Classical Jacobi methods annihilate individual entries whereas blocked Jacobi
methods use a partial eigendecomposition on blocks. Cyclic Jacobi methods use fewer
ops, especially if fast Givens rotations are used. But, almost all of the oating point oper-
ations in blocked Jacobi methods are performed in matrix-matrix multiply operations, the
most e�cient operation.
Both cyclic and blocked Jacobi methods can be blocked for communication. The
communication block size need only be an integer multiple of the computation block size.
Blocking for communication may be more important than blocking for computation because
it reduces the number of messages by a factor equal to the communication block size.
Blocking allows greater possibilities for the partial eigendecomposition. A better
partial eigendecomposition will lead to faster convergence. For example, performing two
Jacobi sweeps in the partial eigendecomposition would result in fewer sweeps through the
entire matrix. However initial experiments indicate that on random matrices the best that
one can hope for is a reduction of lg(nb) in the number of full sweeps even if one uses a
complete eigendecomposition as the \partial eigendecomposition".
125
Using a block size that is smaller than the maximum allowed (i.e. nb < n=(2pc))
o�ers various possibilities. It allows communication to be pipelined to some extent. Alter-
natively it allows more than pc processors to be involved in computing the partial eigende-
compositions.
The per sweep cost of the partial eigensolutions grows as the square of the block
size because larger block sizes mean that fewer processors are involved in the partial eigen-
decomposition8.
I recommend keeping the code simple by keeping the communication and compu-
tation block size equal and setting nb = n=(2pc) so that each parallel pairing involves one
partial eigendecomposition per processor column. Using a rectangular process grid such that
(16pr � pc � 32pr) requires a lower nb and hence allows the code to keep communication
and computation block size equal while holding the cost of the partial eigendecomposition
to 38 to 3
4n3
p 1. On most machines this will be no more than half the 5n3
p 3 cost, in part
because the partial eigendecomposition will �t in the highest level data cache.
A larger computational block size increases the cost of partial eigendecomposition
and decreases the cost of the BLAS3 operations. Larger communication block size decreases
message latency cost but leaves less opportunity for overlapping communication with com-
putation. A larger ratio of pc to pr increases message latency but reduces the partial
eigendecomposition cost9. See section 7.3.9 for details on the partial eigendecomposition
cost.
7.3.7 Symmetry
Exploiting symmetry in two-sided Jacobi methods is important because it reduces
the number of ops per sweep from 12n3 to 8n3. However, exploiting symmetry while
maintaining load balance is di�cult. If in a blocked Jacobi method, the block size were set
to the largest value possible, i.e. n2pc
, and a standard rectangular grid of processors were
used, half of the processors (either those above or below the diagonal) would be idle all
the time. Using a smaller block size would allow better load balance but gives up some
of the bene�ts of blocking. Alternatives, such as using a di�erent processor layout for the
eigenvector update, are feasible, but their complexity make them unattractive.
8This does not hold for nb < n=(2p).9Assuming only one processor per processor column is involved in computing partial eigendecompositions.
126
In one-sided Jacobi methods, ATA is symmetric and only half of it need be com-
puted. In fact, only a quarter of it must be computed as shown in the following section.
7.3.8 Storing diagonal blocks in one-sided Jacobi
One-sided Jacobi methods must compute diagonal blocks of ATA. This is shown
in the matlab code given in �gure 7.5 as: blkA = A(:; [J; I ])0 �A(:; [J; I ]). This is ine�cientbecause not only does it compute both halves of a symmetric (or hermitian) matrix, but
A[:; I ]0�A[:; I ] and A[:; J ]0�A[:; J ] are already known. They are the diagonal blocks returnedby parteig on the most recent previous pairing which involved I and J respectively. Storing
these blocks for future use avoids the need to recompute them, although they may need to
be refreshed from time to time for accuracy reasons.
7.3.9 Partial Eigensolver
My performance models suggest that execution time is likely to be minimized
when the partial eigendecomposition consists of either one or two sweeps of Jacobi. The
per sweep cost of the partial eigensolver grows as O(n2nbpp ). In my recommended Jacobi
method, the partial eigensolver consists of one sweep of Jacobi and based on the data
layout which I recommend, and costs 38n3
p 1 + O( n2pp) or roughly 10% to 30% of the total
cost of the sweep. Preliminary experiments indicate that with a block size of 32, using a
full eigendecomposition instead of a partial eigendecomposition may reduce the number of
sweeps by as much as 20%. Assuming that a full eigendecomposition of a 32 by 32 matrix
costs 6 times what a single sweep of Jacobi would cost, this analysis suggests that the added
cost of a full eigendecomposition will not reduce the number of sweeps su�ciently to result
in a net decrease in execution time, especially if DGEMM performs e�ciently on a smaller
block size10. On the other hand, since most of the advantage of a full eigendecomposition
will come from the second sweep, using two sweeps of Jacobi in the partial eigensolver
may result in a net decrease in execution time. This analysis depends on a great many
assumptions and should be taken as a guide, not a prediction. Schreiber[150] reached a
similar conclusion.
In a non-blocked code, the \partial eigendecomposition" should consist of a rota-
tion, i.e. a full eigendecomposition. In a non-blocked code, the cost of the partial eigensolver,
10A smaller block size reduces the cost of the partial eigensolver.
127
though still O(n2nb), is lower because nb = 1 and for a 2 by 2 matrix, a single sweep of
Jacobi is a full eigendecomposition. Except for very small n, say n < 100, partial eigende-
compositions, such as those suggested by G�otze[85], are not likely to result in lower total
execution time.
In a blocked eigensolver, one must compute a partial eigendecomposition for each
pairing. Most commonly, a single sweep of two-sided Jacobi is used as the partial eigende-
composition. Since the elements in A[I; I ]0 � A[I; I ] and A[J; J ]0 � A[J; I ] are involved in
more pairings than the elements in A[I; J ]0�A[I; J ] they need not be annihilated in every
pairing.
The number of partial eigenproblems that can be performed simultaneously is
n2nb . If this is less than p, either the partial eigenproblems must themselves be performed
in parallel or some processors will be idle. Unless nb is quite large, say nb � 64, it is
likely to be faster to compute them each on a single processor, especially since the partial
eigendecomposition is a two-sided, not one-sided, sweep.
If n=(2nb) = pc, it is natural to assign one processor within each processor column
to perform the partial eigendecomposition. If n=(2nb) > pc, each parallel pairing will have
more partial eigenproblems than processor columns, hence the code could assign di�erent
partial eigenproblems to di�erent processors within each processor column. The other alter-
native is to increase pc (decreasing pr). Hence, assigning di�erent partial eigenproblems to
di�erent processors within a column only makes sense if bandwidth cost makes increasing pc
unattractive. On the other hand, the only disadvantage to assigning di�erent partial eigen-
problems to di�erent processors with a column (as opposed to increasing pc) is increased
code complexity.
If the cost of divisions and square roots (14n22 �+ p
pc) is signi�cant, one should
consider inexact rotations in the partial eigensolver. G�otze points out that one need not
perform exact rotations and suggests a number of approximate rotations which avoid divides
and square roots[85]. It would be counterproductive to use inexact rotations (saving O(n2)
ops at the expense of increasing the number of sweeps and the accompanying O(n3) ops)
in a parallel cyclic Jacobi method. Likewise I would be hesitant to use inexact rotations in
the partial eigensolver unless doing so makes it feasible to perform two sweeps in the partial
eigensolver. However, it is entirely possible that more sweeps with inexact rotations might
be better than fewer sweeps using exact rotations in the partial eigensolver.
Using a classical threshold scheme in the partial eigensolver is likely to save little
128
time, but using thresholds to perform more important rotations might improve performance.
A classical threshold scheme is not attractive because the processors performing fewer ro-
tations would simply sit idle. However having each processor compute the same number of
rotations, while using thresholds to skip some rotations might allow the rotations performed
to be more productive.
7.3.10 Threshold
For serial cyclic codes, thresholds can signi�cantly reduce the total number of
oating point operations performed, especially on spectrally diagonally dominant matrices.
Since Jacobi methods are most likely to be attractive on spectrally diagonally dominant
codes, thresholds cannot be rejected as unimportant. However in a blocked parallel pro-
gram, entire blocks can only be skipped if the whole block requires no rotations. As an
example, consider a blocked parallel Jacobi eigensolution of a 1024 by 1024 matrix on a
1024 node computer using a block size of 16. This would involve 63 (or 64) steps each of
which would consist of 32 pairings performed in parallel. Each pairing involves a partial
eigendecomposition of a 2� 16 by 2� 16 matrix. If any of the o�-diagonal elements in any
of the 32 pairings requires annihilation, no savings is achieved in that step. Hence, in the
worst case, if just 63 of the 499,001 o�-diagonal elements (one per step) require annihilation,
the threshold algorithm realizes no bene�t.
Corbato[47] devised a method for implementing a classical Jacobi method in O(n3)
time. His method involves keeping track of the largest o�-diagonal element in each column.
The cost of maintaining this data structure would more than double the cost of each rotation
and may not lead to reduced execution time even in serial codes. However, Beresford
Parlett[137] pointed out to me that one need not keep track of the true largest element and
that each rotation must maintain the sum of the squares of the elements, hence allowing
the list of \largest" o�-diagonal elements to be out-of-date would seriously undermine the
advantage and would signi�cantly reduce the overhead. This deserves further study.
Untested Threshold methods
One could design a code that used variable block sizes and/or switched from a
one-sided-non-threshold Jacobi to a two-sided-threshold Jacobi. A code could even scan
the matrix, identify the elements that need to be eliminated and select pairings and block
129
sizes that would eliminate those elements as e�ciently as possible. In our worst case example
given in the preceding paragraph, it might be that those 63 o� diagonal elements could be
annihilated in just two parallel steps each requiring only a two element rotation.
Scanning all o�-diagonal elements and choosing the largest n non-interfering ele-
ments might be an attractive compromise between the classical Jacobi method which ex-
amines all o� diagonal elements and annihilates the largest and the cyclic Jacobi method
which annihilates all elements without regard to size. If software overhead could be kept
modest, such a method might on small spectrally diagonally dominant matrices. Precisely
the matrices that are best suited to Jacobi methods.
Jacobi methods that attempt to annihilate larger elements, i.e. threshold methods,
work best on two-sided Jacobi methods. This is unfortunate because it appears that one-
sided Jacobi is otherwise preferred.
As mentioned in section 7.3.9, thresholds might be useful in the partial eigende-
composition.
7.3.11 Pairing
The order in which the o�-diagonal elements are annihilated is referred to as the
pairing strategy. Eliminating o�-diagonal element Ai;j in a two-sided Jacobi requires that
rows i and j of A and columns i and j of A be rotated. Hence, rows i and j of A must be
distributed similarly, i.e. Ai;k and Aj;k must both reside on the same processor. Likewise,
columns i and j of A must be distributed similarly. Orthogonalizing vectors i and j in
a one-sided Jacobi also requires that the two vectors be distributed similarly. In order to
annihilate multiple o�-diagonal elements simultaneously, they must reside on di�erent sets
of processors.
The pairing strategy a�ects execution time through communications cost, num-
ber of pairings per sweep and number of sweeps required for convergence. Di�erent pair-
ing strategies require di�erent communication patterns and hence di�erent communication
costs. Some pairings strategies require slightly more pairings than others. Mantharam and
Eberlein argue that some pairings lead to faster convergence than others[72].
In this section, we illustrate two pairing strategies, showing how each would pair
8 elements in 4 sets at a time. The elements might be individual indices (in a non-blocked
Jacobi) or blocks of indices. The sets might correspond to individual processors (in a 1D
130
data layout) or columns of processors (in a one-sided Jacobi on a 2D layout) or rows and
columns of processors (in a two-sided Jacobi on a 2D layout). Furthermore, several sets
might be assigned to the same processor or column of processors.
The classic round robin pairing strategy[84] leaves one element stationary and
rotates the other elements. As the following diagram shows, in 7 pairings, each element is
paired exactly once with each of the other elements. Elements 3 through 8 follow elements
2 through 7 respectively, while element 2 follows element 8.
1 2 3 48 7 6 5
1 3 4 52 8 7 6
1 4 5 63 2 8 7
1 5 6 74 3 2 8
1 6 7 85 4 3 2
1 7 8 26 5 4 3
1 8 2 37 6 5 4
A slight variation, called the caterpillar pairing method[72, 73, 150], cuts the
communication cost in half at the expense of increasing the number of pairings from n� 1
to n. The caterpillar method, modi�ed so that communication is always performed in the
same direction, is shown below. Only the elements in the top line rotate, and they always
rotate to the left. The elements, shown in red, in the bottom line get swapped into the top
line one at a time. In this pairing method, it takes 8 pairings in order for each element to
be paired with every other element. The swapped elements need not perform any work,
but must exchange the blocks assigned to them prior to the next communication step. This
pairing strategy requires 16 (in general 2n) pairings to come back to the original pairing,
but the second n pairings duplicate the �rst n.
131
1 2 3 45 6 7 8
2 3 4 85 6 7 1
3 4 8 25 6 7 1
4 8 7 35 6 2 1
8 7 3 45 6 2 1
7 6 4 85 3 2 1
6 4 8 75 3 2 1
5 8 7 64 3 2 1
8 7 6 54 3 2 1
Mantharam and Eberlein[72] suggest that some pairing strategies may lead to
convergence in fewer steps than others.
7.3.12 Pre-conditioners
One sided Jacobi methods compute eigenvectors by orthogonalizing a matrix which
has the same or related left singular vectors as the original matrix. Some options include:
[U;D; V ] = svd(A); U contains the eigenvectors of A, D is the absolute value of the
eigenvalues of A. This method is used by Berry and Sameh[21].
[U;D; V ] = svd(chol(A)); U contains the eigenvectors of A, D is the square root of the
eigenvalues of A. This is used by Arbenz and Slapni�car[9] and is mathematically
equivalent to classical Jacobi.
[Q;R] = qr(A); [U;D; V ] = svd(R);Q � U contains the eigenvectors of A. D contains the
absolute value of the eigenvalues of A
132
In addition, there are pivoting counterparts to both Cholesky and QR, indeed many a-
vors of QR with pivoting, which would improve these pre-conditioners. If A is spectrally
diagonally dominant, permuting A so that the diagonal elements are non-increasing might
provide most of the bene�t that Cholesky with pivoting does and at considerably lower
cost.
7.3.13 Communication overlap
Overlapping communication and computation is attractive because in theory it re-
duces the total cost from the sum of the computation and communication costs to their max-
imum. Arbenz and Slapni�car demonstrated that overlapping communication and computa-
tion is straightforward in a one-sided Jacobi method with a one-dimensional data layout[10].
But, overlapping communication and computation when using a two-dimensional data lay-
out is not as straightforward. Furthermore, actual experience with communication and
computation overlap has been disappointing, see section B.1.6
7.3.14 Recursive Jacobi
The partial eigendecomposition could be a recursive call to a Jacobi eigensolver. A
recursive Jacobi could o�er all the bene�ts shown by Toledo on LU[165], notably excellent
use of the memory hierarchy. Unfortunately, each level of recursion requires 6 calls, tripling
the software overhead. Therefore, the number of subroutine calls, and hence the software
overhead, grows at an unacceptably high O(nlg(6)).
Increasing software overhead in order to reduce the number of sweeps will make
sense for large matrices but not for small matrices. Since Jacobi is unlikely to be faster than
tridiagonal based methods for large matrices, I feel that it is more important to concentrate
on making Jacobi fast on smaller matrices. Hence, I do not include recursion as a part of my
recommended Jacobi method. Nonetheless, it may be that one step of recursion (tripling
the software overhead) and conceivably two steps of recursion (increasing software overhead
by a factor of 9) may reduce total execution time, but I would not expect the improvement
to be signi�cant.
133
7.3.15 Accuracy
Demmel and Veseli�c[58] prove that on scaled diagonally dominant matrices, Jacobi
can compute small eigenvalues with high relative accuracy while tridiagonal based methods
cannot. Drma�c and Veseli�c[71] show that Jacobi methods can be used to re�ne an eigen-
solution, thereby providing high relative accuracy on scaled diagonally dominant matrices
at lower total cost than a full Jacobi. Demmel et al.[56] give a comprehensive discussion of
the situations in which Jacobi is more accurate than other available algorithms.
7.3.16 Recommendation
If I were asked to write one Jacobi method for all non-vector distributed memory
computers, it would be a one-sided blocked Jacobi method. It would use a one-dimensional
data layout on computers with fewer than 48 nodes and a two-dimensional data layout on
computers with 48 or more nodes. It would use 16-32 times as many processor columns as
rows in a two-dimensional data layout11. It would use a computational and communication
block size equal to12 max(n=(2pc); 8), leaving processors idle if 8 < n=(2pc). It would
compute the partial eigendecompositions on just one processor in each processor column.
It would avoid recomputing diagonal entries unnecessarily, use a one-directional caterpillar
track pairing and one sweep of Jacobi for the partial eigendecomposition. It would use the
largest block size possible for both computation and communication.
If I had time to experiment, I would investigate di�erent partial eigendecompo-
sitions, pre-conditioners and pairing strategies in that order. Overlapping communication
and computation appears to o�er greater performance improvements in theory than in prac-
tice. I would use thresholds as a part of the stopping criteria, but wouldn't count on them
to avoid unnecessary ops. I would check to make sure that my suggested data layout (1D
for p < 48, 16pr < pc < 32pr for p > 48 and nb = max(n=(2pc); 8) ) was reasonable on
several computers, but unless there was a substantial bene�t to tuning the data layout to
each machine I would hesitate to do so.
For vector machines I recommend an unblocked code with fast Givens rotations
if the cost of BLAS1 operations is no more than twice that of BLAS3 operations. If the
BLAS1 operations cost just twice what BLAS3 operations cost, the op cost in an unblocked
11The ratio pcpr
can be made to fall in the 16-32 range for any number of processors except 1 to 15, 32 to
63 and 128 to 144. No more than 2.1% of the processors are left idle following these rules.12De�nitions for all symbols used here can be found in Appendix A.
134
code would be 6=5 that of the blocked code (because unblocked codes using fast Givens
require 3/5 as many ops. Savings on other aspects can be expected to make up for this
di�erence on all but the largest matrices. Communication should still be blocked however.
One-dimensional data layout can be used for more nodes if a cyclic code is used, perhaps
as many as a hundred nodes, since block size is not an issue. As long as n < 2p a one-
dimensional data layout is limited only by communication costs.
Combining elements of classical and cyclic Jacobi is an interesting long shot. Clas-
sical Jacobi always annihilates the largest o�-diagonal element but requires O(n4) compar-
isons per sweep13. Annihilating the n largest o�-diagonal elements each time would roughly
match the number of comparisons performed to the number of ops performed. To paral-
lelize this idea, one would have to choose the n largest non-interfering elements.
7.4 ISDA
The total execution time for the ISDA[97] for solving the symmetric eigenprob-
lem14 will be no less than 100n3 on typical matrices. The execution time depends largely
on how many decouplings are required to make each of the smaller matrices no larger than
half the size of the original matrix. It also depends on the cost of each decoupling, but this
will not vary that much.
The ISDA achieves high oating point execution rates, but in order to beat tridi-
agonal methods it must achieve 100=(10=3) = 30 times higher oating point rates, which
it does not. The PRISM implementation of ISDA takes 36 minutes = 2160 seconds to
compute the eigendecomposition of a matrix of size 4800 by 4800 on the 100 node SP2 at
Argonne[29], ScaLAPACKs PDSYEVX takes 397 seconds to compute the eigendecomposition
of a matrix of size 5000 by 5000 on a 64 node SP2[31]. ISDA should not require as large
a granularity, n=pp, as PDSYEVX because of its heavy reliance on matrix-matrix multiply.
However, at present, the PRISM implementation is still at least three times slower than
PDSYEVX even on small matrices. Solving a matrix of size 800 by 800 on 64 nodes takes 60
seconds using the PRISM ISDA code, whereas PDSYEVX can solve a matrix of size 1000 by
1000 on 64 nodes of an SP2 in 16 seconds.
The cost of each decoupling depends upon how close the split happens to come to
13Or increased overhead if Corbato's method[47] is used.14See section 2.7.3 for a brief description of the ISDA.
135
a eigenvalue of the matrix being split. The number of beta function evaluations required
for a given decoupling is roughly � log(mini2n(split � �i)), where split is the split point
selected for this decoupling. The distance between split and the nearest eigenvalue cannot be
computed in advance but is likely to fall in the range: (log(n)= log(1:5)+2; log(n)= log(1:5)+
8. This is consistent with empirical results. For our purposes we will say that the number
of beta function evaluations is: (log(1500)= log(1:5) + 2 = 20. The cost per beta function
evaluation is 2 matrix-matrix multiplies at: 2(n0)3=p 3 each, where n0 is the size of the
matrix being decoupled. Hence the cost for the �rst decoupling is: 2 � 2 � 20n3=p 3 =
80n3=p 3.
If each decoupling splits the matrix exactly in half, round i of decouplings involves
2i decouplings each involving a matrix of size n=2i at a total cost of: 2i � 80(n=2i)3 =
80n3=4i. The sum of all rounds would then be:P1
i=0 80n3=4i = 80� 4=3 = 107n3.
The ISDA for symmetric eigendecomposition may require substantially longer on
some matrices with a single cluster of eigenvalues containing more than half of the eigen-
values and on matrices with most of the eigenvalues at one end of the spectrum15. It is
unlikely that the �rst split point chosen for decoupling will lie in the middle of a cluster.
Hence, if the matrix contains one large cluster, that cluster will likely remain completely
in one of the two submatrices, making the decoupling less even and hence less successful.
Likewise, if most of the eigenvalues are at one end of the spectrum, the submatrix on that
end of the spectrum will likely be much larger than the other after the �rst decoupling. If
each decoupling splits o� only 20% of the spectrum, the total time will be twice what it
would be if each decoupling splits the spectrum exactly in half.
One could check to make sure that a reasonable split point has been chosen by
performing an LDLT decomposition on the shifted matrix, and counting the number of
positive or negative values in D. An LDLT decomposition costs 1=3n3 ops or about 0.5%
of the ops required to perform the full decoupling.
7.5 Banded ISDA
Banded ISDA is very nearly a tridiagonal based method and hence o�ers per-
formance that is nearly as good as tridiagonal based methods. PRISM's single processor
implementation of banded ISDA is two to three times slower than bisection (DSTEBZ)[26].
15Fann et al.[75] present a couple examples of real applications that �t this description.
136
Computing eigenvectors using banded ISDA will not only be more di�cult to code, it will
require about twice as many ops as inverse iteration. Banded ISDA requires additional
bandwidth reductions, each of which requiring up to 2n3 additional ops during back trans-
formation16.
Banded ISDA could make sense if reduction to banded form were twice as fast as
reduction to tridiagonal. Although even then one has to question whether it makes sense
to use banded ISDA instead of a banded solver.
Banded ISDA should perform a few shifted LDLT decompositions to make sure
that the selected shift will leave at least 1/3 of the matrix in each of the two submatrices.
7.6 FFT
Yau and Lu[174] have implemented an FFT based invariant subspace decomposi-
tion method. It, like ISDA, uses e�cient matrix-matrix multiply ops, but since it requires
100n3 ops the same analysis which shows that ISDA will not be faster applies to it as well.
Domas and Tisseur have implemented a parallel version of the Yau and Lu method[60].
16The �rst bandwidth reduction essentially always requires the full 2n3 ops during back transformation,though later ones typically require less than that. However, taking advantage of the opportunity to performfewer ops wither means a complex data structure or that the update matrix Q be formed and then applied,adding another 4=3n3 ops.
137
Chapter 8
Improving the ScaLAPACK
symmetric eigensolver
8.1 The next ScaLAPACK symmetric eigensolver
The next ScaLAPACK symmetric eigensolver will be 50% faster than the ScaLAPACK
symmetric eigensolver in version 1.5 and provide performance that is independent of the
user's data layout. Separating internal and external data layout will not only make the code
easier to use because the user need not modify their storage scheme, it will also improve
performance. The next ScaLAPACK symmetric eigensolver will select the fastest of four
methods for reduction to tridiagonal form1, and use Parlett and Dhillon's new tridiagonal
eigensolver[139].
Separating internal and external data layout allows execution time to be reduced
for three reasons. It allows reduction to tridiagonal form and back transformation to use
di�erent data layouts. It allows reduction to tridiagonal form to use a square processor grid,
signi�cantly reducing message latency and software overhead. It allows the code to support
any input and output data layout without all the layers of software required to support
any data layout. Last but not least by concentrating our coding e�orts on the simple, but
e�cient square cyclic data layout, we can implement several reduction to tridiagonal codes
and incorporate ideas that would be prohibitively complicated in a code that had to support
multiple data layouts.
1On machines where timers are not available, a heuristic will be used which may not always pick thefastest.
138
The rest of this section concentrates on improving execution time in reduction to
tridiagonal form. Back transformation is already very e�cient and hence leaves less room
for improvement. We leave the tridiagonal eigensolver to others[139]. Figure 8.1 gives a
top-level description of the next ScaLAPACK symmetric eigensolver.
Figure 8.1: Data redistribution in the next ScaLAPACK symmetric eigensolver
Choose a data layout for reduction to tridiagonal form (see �gure 8.2)Redistribute A to reduction to tridiagonal form data layoutReduce to tridiagonal formReplicate diagonal, (D), and sub-diagonal, (E), to all processorsUse Parlett and Dhillon's tridiagonal eigendecomposition schemeChoose data layout for back transformation
BCK-pr = dpp=15e ; BCK-pc = bp=prc ; BCK-nb = dn=(k pc)eIf space is limited, redistribute A back to original data layoutRedistribute eigenvectors, Z, to back transformation data layoutRedistribute A to back transformation data layoutPerform back transformationRedistribute eigenvectors to user's format
8.2 Reduction to tridiagonal form in the next ScaLAPACK sym-
metric eigensolver
Figure 8.2 shows how the data layout for reduction to tridiagonal form will be
chosen. The data layout and the code used for reduction to tridiagonal form must be
chosen in tandem.
Although the new PDSYTRD has three variants, they all share the same pattern of
communication and computation shown in �gure 8.3.
Message initiations are reduced by using techniques �rst used in HJS, and several
new ones. HJS stores V andW in a row-distributed/column-replicated manner which avoids
to need to broadcast them repeatedly. HJS also keeps the number of messages small by
combining messages wherever possible.
Our communication pattern has three advantages over HJS: it requires fewer mes-
sages, does not risk over/under ow and uses only the BLACS communication primitives2.
The manner in which we compute the Householder vector requires the same number of
message initiations as the HJS, but avoids the risk of over/under ow in the computation of
the norm. We use fewer messages than HJS because we update w in a novel manner (see
2Whether the right communication primitives were chosen for the BLACS may be debatable, but they arewhat is available for use within ScaLAPACK.
139
Figure 8.2: Choosing the data layout for reduction to tridiagonal form
If timers (or environmental inquiry routine) are availableTime select operationsDetermine the best data layout for each of the four reduction to tridiagonal form codesEstimate the execution time for each of the four reduction to tridiagonal form codesSelect the fastest code and the corresponding data layout
elseif p=bppc2 � 1:5 (i.e if p =2,3,6,7,14, 15)
TRD-pr = bp=7:5c+ 1TRD-pc = p=prTRD-nb = 32Use old PDSYTRD
elseTRD-pc = bppcTRD-pr =TRD-pcTRD-nb = 1if the compiler is good
if (n > 200pp)
Use new PDSYTRD with compiled kernelelse
Use unblocked reduction to tridiagonal form (no BLAS)endif
elseif (n > 100
pp)
Use new PDSYTRD with DGEMVelse
Use unblocked reduction to tridiagonal form (no BLAS)endif
endifendif
discussion of Line 4.1 below) and we delay the spread of w (which HJS naturally performs
at the bottom of the loop) to the top of the loop so that it can be spread in the same
message that spreads v.
Our communication pattern has one disadvantage over HJS: it requires redundant
computation in the update of w. The discussion of Line 4.1 below explains that we can
choose to eliminate this redundant computation by increasing the number of messages.
Line 2.1 in Figure 8.3 In Section 8.4.1 we show how to avoid over ow while using just
2n log(pp) messages.
Lines 3.2 and 3.6 in Figure 8.3 Only 2 messages are required to transpose a matrix
when a square processor layout is used. Each processor, (a; b) must sends a message
to, and receive a message from, its transpose processor (b; a). The required time is:
nXn0=1
2(�+ 2n0 �) = 2n�+ 2n2 �
Line 4.1 in Figure 8.3 w = w � W V Tv � V WT v can be computed in a number of
ways. W;V and v are distributed across processor rows and replicated across proces-
140
Figure 8.3: Execution time model for the new PDSYTRD. Line numbers match Fig-ure 4.5(PDSYTRD execution time) where possible.
computation communicationoverhead imbalance latency bandwidth
do ii = 1; n; nbmxi = min(ii+ nb; n)do i = ii ;mxi
Update current (ith) column of A1.2 A = A �W V T � V WT
Compute re ector
2.1 v = house(A) 2n lg(pp)�
Perform matrix-vector multiply
1.1, 3.1 spread v; w across n lg(pp)�
n2 lg(pp)p
p �
3.2 transpose v; w 2 n2pp �
3.3 w = tril(A)v; 23n3
p 2
wT = tril(A;�1)vT 2 n2pp �
Update the matrix-vector product
4.1 w = w �W V Tv � V WTv n2pp �
3.6 w = w + transpose wT
Compute companion update vector
5.1 c = w � vT ; 2n lg(pp)�
w = � w� (c �=2) v
end do i = ii ;mxi
Perform rank 2k update
6.3 A = A �W V T � V WT 23n3
p 3end do ii = 1; n; nb
sor columns. WT ; V T and vT are distributed across processor columns and replicated
across processor rows. Furthermore, since only the partial sums contributing to w are
known, the updates to w can be made on any processor column, and even spread across
various processor columns. Appendix B.1 how this update is performed without com-
munication and shows that there are a range of options which trade o� communication
and load imbalance.
Line 1.1 in Figure 8.3 updates the current block column. This can be implemented in
141
several ways. LAPACK's DSYTRD uses a right looking update3 because a matrix-matrix
multiply is more e�cient than an outer product update. HJS uses a left looking update
because on their cyclic data layout, the left looking update allows all processors to be
involved, reducing load imbalance.
Line 5.1 in Figure 8.3 Computing c = wvT requires summing c within a processor col-
umn. In order to compute w in Line 5.1, c must be known throughout a processor
column. To allow w and v to be broadcast in the same message (Line 3.1), c is summed
and broadcast in the column that owns column i+ 1 of the matrix.
Line 6.1 in Figure 8.3 No communication is required here. W;V T and WT are already
replicated as necessary.
8.3 Making the ScaLAPACK symmetric eigensolver easier to
use
The next ScaLAPACK symmetric eigensolver will separate internal data layout from
external data layout while executing 50% faster than PDSYEVX on a large range of problem
sizes on most distributed memory parallel computers and requiring less memory. Separating
internal and external data layout allows the user to choose whatever data layout is most
appropriate for the rest of their code and to use that data layout regardless of the problem
size and computer they are using. Separating internal and external data layouts also makes
it easy for the ScaLAPACK symmetric eigensolver to add support for additional data layouts.
However, while these ease-of-use issues are the most important advantages of separating
internal and external data, we will focus further discussion on how this separation improves
performance.
8.4 Details in reducing the execution time of the ScaLAPACK
symmetric eigensolver
Separating internal and external data layout will improve the performance of
PDSYEVX by allowing PDSYEVX to use di�erent data layouts for di�erent tasks, and by allow-
3A right looking update updates the current column with a matrix-matrix multiply. A left looking updateupdates every column in the block column with an outer product update.
142
ing PDSYEVX to concentrate only on the most e�cient data layout for each task. A reduction
to tridiagonal form which only works on a cyclic data layout on a square processor grid will
not only have lower overhead and load imbalance than the present reduction to tridiagonal
form, but will be able to incorporate techniques that would be prohibitively complicated if
they were implemented in a code that must support all data layouts.
Signi�cant reduction of the execution time in PDSYEVX, the ScaLAPACK symmetric
eigensolver, requires that all four sources of ine�ciency (message latency, message trans-
mission, software overhead and load imbalance) be reduced. Fortunately, as Hendrickson,
Jessup and Smith[91] have shown, all of these can be reduced. PDSYEVX sends 3 times as
many messages as necessary4, and require 3 times as much message volume as well5. Over-
head and load imbalance costs are harder to quantify. Load imbalance costs will be reduced
by using data layouts appropriate to each task6. If necessary, load imbalance costs can be
further reduced at the expense of increasing the number of messages sent. Overhead will
be reduced by eliminating the PBLAS, reducing the number of calls to the BLAS and, where
a su�ciently good compiler is available, eliminating the calls to the BLAS entirely.
8.4.1 Avoiding over ow and under ow during computation of the House-
holder vector without added messages
Over ow and under ow can be avoided during the computation of the Householder
vector without added messages by using the pdnrm2 routine to broadcast values. The eas-
iest way to compute the norm of a vector in parallel is to sum the squares of the elements.
However, this will lead to over ow if the square of one of the elements or one of the inter-
mediate values are greater than the over ow threshold (likewise under ow occurs if one or
more of the squares of the elements or the intermediate vallues is less than the under ow
threshold). The ScaLAPACK routine pdnrm2 avoids under ow and over ow during reduc-
tions by computing the norm directly leaving the result on all processors in the processor
column. The requires 2 lg(pr)� execution time. In PDSYTRD, � = A(i+1; i) is broadcast
4PDSYEVX uses 17 n log(
pp), HJS uses 9n log(
pp), we will show that this can be reduced to 5n log(
pp)
but do not claim that this is minimal.5PDSYEVX sends (5 log(
pp) + 2)n2=
pp elements per processor and HJS reduces this to ( 12 log(
pp) +
52 ) n
2=pp elements per processor. The design I suggest requires ( 3
2 log(pp)+ 5
2 )n2=pp elements per processor
but requires fewer messages.6Statically balancing the number of eigenvectors assigned to each processor column will reduce load
imbalance in back transformation. Using a smaller block size will reduce load imbalance in reduction totridiagonal form
143
to all processors in the processor column, this requires 2 lg(pr)� execution time. In HJS,
they sum the squares of the elements and broadcast � = A(i+1; i) at the same time by
summing an additional value in the reduction. All processors except for the processor that
owns A(i+1; i) contribute 0 to the sum while the processor owning A(i+1; i) contributes
A(i+1; i).
In the new PDSYEVX, we will employ this trick, to broadcast � at the same time
as the norm is computed. It is slightly more complicated because norm computations do
not preserve negative numbers. Hence, we compute two norms: max(0; �) and max(0;��),from these � is easily recovered. Ideally, we need a new PBLAS or BLACS routine which
would simultaneously compute a norm and broadcast both it and other values.
8.4.2 Reducing communications costs
Communications costs can be reduced in both reduction to tridiagonal form and
back transformation but by vastly di�erent methods. PDSYTRD, ScaLAPACK's reduction to
tridiagonal form code, will use a cyclic data layout on a square processor grid to simplify
the code, allowing PDSYEVX to use the techniques demonstrated by Hendrickson, Jessup and
Smith[91]: direct transpose, a column replicated/row distributed data layout for interme-
diate matrices and combining messages. In addition, PDSYTRD will delay the last operation
in the loop to combine it with the �rst, reducing the number of messages per loop iteration
from 6 to 5.
Communication costs will be reduced in back transformation by using a rectangular
grid and a relatively large block size. Most of the communication in back transformation
is within processor columns, and the communication within processor columns cannot be
pipelined (meaning that it grows as log(pr)), hence setting pc to be substantially larger
(roughly 4-8 times larger) than pr will cut message volume nearly in half compared to the
message volume required for a square processor grid.
Communications cost could be reduced further on select computers by writing ma-
chine speci�c BLACS implementations7 , but I don't think that the bene�t will justify the
7Karp et al.[107] proved that a broadcast or reduction of k elements on px processors can be executedin log(px)�+ k �. Equally importantly, the latency term can be reduced signi�cantly by machine speci�ccode because latency is primarily a software cost, the actual hardware latency is typically less than onetenth of the total observed latency. I believe that by coding broadcasts and reductions in a machine speci�cmanner, I could reduce the latency to ( �software + log(px)�hardware. It might be possible to achieve asimilar result using active messages. Machine speci�c optimization of the BLACS broadcast and reductioncodes is attractive because it would bene�t all of the ScaLAPACK matrix transformation codes. However,
144
cost. In PDSYEVX as shipped in version 1.5 of ScaLAPACK, software overhead and load imbal-
ance are roughly twice as high as communications cost on the PARAGON. The new PDSYEVX
should reduce communications by at least a factor of 2, and though I hope it will reduce
software overhead and load imbalance by close to a factor of 4, overhead and load imbalance
will probably remain larger than communications cost. The fact that communication costs
is not the dominant factor limiting e�ciency limits the improvement that one can expect
from machine speci�c BLACS implementations.
Communications cost in back transformation could be reduced further by overlap-
ping communication and computation and/or using an all-to-all broadcast pattern instead
of a series of broadcasts. Back transformation enjoys the luxury of being able to compute
the majority of what it needs to communicate in advance. This allows many possibili-
ties for reducing the communications bandwidth cost. The fact that message latency, load
imbalance and software overhead costs are modest in back transformation means that a
reduction in the communications bandwidth cost ought to result in signi�cant performance
improvement in back transformation. However, overlapping communication and computa-
tion has historically o�ered less bene�t than in practice than in theory, (see section B.1.6)
so I approach this with caution and will not pursue it without �rst convincing myself that
the bene�t is signi�cant on several platforms.
8.4.3 Reducing load imbalance costs
Load imbalance can be reduced in both reduction to tridiagonal form and back
transformation by careful selection of the block size. The number of messages in reduction
to tridiagonal form is not dependent on the data layout block size, hence a cyclic data
layout (i.e. block size of 1) will be used, reducing load imbalance. The fact that only half
of the ops in reduction to tridiagonal form are BLAS3 ops and the large number of load
imbalanced row operations combine to make the optimal algorithmic block size for reduction
to tridiagonal form small.
Load imbalance is minimized in back transformation by choosing a block size
which assigns a nearly equal number of eigenvectors to each column of processors (nb =
dn=(k pc)e for some small integer k). A block cyclic data layout reduces execution time
in back transformation by reducing the number of messages sent, hence we must look for
purely from the point of view of improving the performance of the ScaLAPACK symmetric eigensolver thise�ort probably would not be worth the e�ort.
145
other ways to reduce load imbalance. Fortunately, all eigenvectors must be updated at each
step, hence a good static load balance of eigenvectors across processor columns eliminates
most of the load imbalance in back transformation. The load imbalance within each column
of processors is less important because the number of processor rows will be small. The
computation of T can be performed simultaneously on all processor columns, eliminating
the load imbalance in that step.
8.4.4 Reducing software overhead costs
There are many ways to reduce software overhead, but software overhead is poorly
understood and hence it is hard to predict which method will be best. Hendrickson, Jessup
and Smith[91] showed that using a cyclic data layout and a square processor grid reduces the
number of DTRMV calls from O(n2=nb) to O(n) because each local matrix is triangular. Using
lightweight (no error checking, minimal overhead) BLAS would reduce software overhead, but
these are still in the planning stages. If the compiler produces e�cient code for a simple
doubly nested loop, software overhead can be further reduced by using a compiled code
instead of calls to the BLAS. Peter Strazdins has shown that software overhead within the
PBLAS can be reduced up to 50%[161, 160]. Alternatively, eliminating the PBLAS entirely
would eliminate the overhead associated with the PBLAS. I would prefer to reduce the PBLAS
overhead and continue to use the PBLAS. But, that is likely to be much harder than simply
abandoning the PBLAS.
When PDSYTRD, ScaLAPACK's reduction to tridiagonal form, was written the PBLAS
did not support column-replicated/row-distributed matrices or algorithmic blocking. Hence,
many of the ideas mentioned here for improving the performance of PDSYTRD were not
available to a PBLAS-based code. PBLAS version 2 now o�ers these capabilities.
Software overhead cannot be measured separate from other costs and is hence
di�cult to measure, understand and reason about. It varies widely frommachine to machine
and can change just by changing the order in which subroutines are linked. We do not,
for example, know how much can be attributed to subroutine calls, how much is caused
by error checking, how much is caused by loop unrolling and how much is caused by code
cache misses.
A good compiler should be able to compute the local portion of Av faster than
two calls to DTRMV because a simple doubly nested loop could access each element in the
146
local portion of A only once whereas two calls to DTRMV would require that each element
in A be read twice. The result is that the ratio of ops to main memory reads is 4-to-1
in the doubly nested loop versus 2-to-1 in DTRMV8. Furthermore, a compiled kernel would
avoid the BLAS overhead and might involve less loop unrolling - reducing overhead directly
and reducing code cache pressure as well. However, compiler technology is uneven, so we
would make using compiled code instead of the BLAS optional.
Unblocked reduction to tridiagonal form will likely be faster than blocked reduc-
tion to tridiagonal form on problem sizes where software overhead is the dominant cost.
Unblocked reduction to tridiagonal form on a cyclic data layout eliminates load imbalance,
requires a minimum of communication and software overhead. The only disadvantage is
that all of the 4=3n3 ops are BLAS2 ops. However, with a good compiler, these BLAS2
ops can perform well on most computers. The kernel in an unblocked reduction to tridiag-
onal form involves 8 ops to each read-modify-write memory access9. Most computers have
adequate main memory bandwidth to handle this at full speed. However, not all compilers
are good enough yet.
8.5 Separating internal and external data layout without in-
creasing memory usage
Separating internal and external data layout will require memory-intensive data re-
distribution, but making the data redistribution codes more space e�cient will save enough
memory space to o�set the memory needs of separating internal and external data lay-
out. Data redistributions between two data layouts with di�erent values of pr; pc or nb use
messages of O(n2=(p3=2) + nb2) data elements. However, degenerate data redistributions
between two data layouts with the same values of pr; pc or nb use messages of roughly
n2=p elements. In order to avoid treating degenerate data redistributions separately, the
current redistribution codes require n2=p bu�er space for all redistributions. Splitting one
large message into several smaller ones is not conceptually di�cult but will require that the
code be rewritten and the testing will have to be augmented to properly exercise the new
paths. However, the execution time will not be signi�cantly a�ected. Both PDLARED2D, the
8These ratios are 8-to-1 and 4-to-1 respectively for Hermitian matrices.9The ratio for reducing Hermitian matrices to tridiagonal form is 16 ops per read-modify-=write
operation.
147
eigenvector redistribution routine, and DGMR2D, the general purpose redistribution routine,
will have to be modi�ed.
If the redistribution routines are not modi�ed as described above, memory usage
would increase from 4n2=p to 6n2=p, and run a remote risk of causing the eigensolver to
crash. While both PDLARED2D and DGMR2D require n2=p space and could use the same
space, they do not. PDLARED2D uses space passed to it in the WORK array, while DGMR2D calls
malloc to allocate space. The eigensolver could crash if a message of n2=p elements were
sent, and the communication system was unable to allocate a bu�er of that size. Messages of
that size are not required during normal ScaLAPACK eigensolver tests, hence the eigensolver
could crash during regular use even after passing all tests and after months or even years
of awless service. Modifying the redistribution routines as we propose, eliminates this
potential problem.
Memory needs could be reduced from 4n2=p to 3n2=p by using the space allocated
to the input matrix, A, and the output matrix, Z, as internal workspace. This would
require a modi�cation to the present calling sequence, probably in the form of a new data
descriptor. However, reducing memory usage by 25% may not justify a change to the calling
sequence.
148
Chapter 9
Advice to symmetric eigensolver
users
Parallel dense tridiagonal eigensolvers should be used if none of the following
counter indications hold. Use a serial eigensolver if the problem is small enough to �t1. Use
a sparse eigensolver if your input matrix is sparse2 and you don't need all the eigenvalues
or if the matrix is dense and you only need a small fraction of the eigenvalues. Use a
Jacobi eigensolver if you need to compute small eigenvalues of a scaled diagonally dominant
matrix (or a matrix satidying one of the other properties described by Demmel et al.[56])
accurately.Use a Jacobi eigensolver for small (n < 100pp) spectrally diagonally dominant
matrices3.
Currently the three most readily available parallel dense symmetric eigensolvers
are PeIGs and ScaLAPACK's PDSYEV and PDSYEVX. PeIGs and PDSYEVmaintain orthogonality
among eigenvectors associated with clustered eigenvalues. PeIGs and PDSYEVX are faster
than PDSYEV. PDSYEVX scales better than either PeIGs or PDSYEV.
The choice between PeIGs and ScaLAPACK is probably more a matter of which
infrastructure4 is preferred and is out of the scope of this thesis. Furthermore, it is likely
that PeIGs will at some point use the ScaLAPACK symmetric eigensolver in the future. Hence,
1i.e. if memory allows2The break-even point is not known, so I suggest that if your matrix is less than 10% non-zero and you
need less than 10% of the eigenvalues you should use a sparse eigensolver.3Spectrally diagonally dominant means that the eigenvector matrix, or a permutation thereof, is diago-
nally dominant.4PeIGs is built on top of Global Arrays[101] while ScaLAPACK is built on the BLACS or MPI.
149
the upgrade path for both may end up with the same underlying code. If you are not likely
to use more than 32 processors, PeIGs performance should be acceptable5. If your input
matrices do not include large clusters of eigenvalues or if you can accept non-orthogonal
eigenvectors, PDSYEVX is the right choice. Otherwise, i.e. if your input matrix has large
clusters of eigenvalues for which you need orthogonal eigenvectors, and you wish to use
more than 32 processors, PDSYEV is the right choice. Eventually, the imporved version of
PDSYEVX described in Chapter 8 will be the method of choice in all cases.
5Since PeIGs uses a 1D data layout, its performance will degrade if you use more than 32 processors.
150
Part II
Second Part
151
Bibliography
[1] R.C. Agarwal, S.M. Balle, F.G. Gustavson, M. Joshi, and P. Palkar. A three-
dimensional approach to parallel matrix multiplication. IBM Journal of Research and
Development, 39(5), 1995. also available as:http://www.almaden.ibm.com/journal/
rd/agarw/agarw.html.
[2] R.J. Allan and I.J. Bush. Parallel diagonalisation routines. Technical report, The
CCLRC HPCI Centre at Daresbury Labaratory, 1996. http://www.dl.ac.uk/TCSC/
Subjects/Parallel_Algorithms/diags/diags.doc.
[3] A. Anderson, D. Culler, D. Patterson, and the NOW Team. A case for networks
of workstations: NOW. IEEE Micro, Feb 1995. http://now.CS.Berkeley.EDU/
Papers2.
[4] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum,
S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen. LAPACK Users'
Guide (second edition). SIAM, Philadelphia, 1995. 324 pages.
[5] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. DuCroz, A. Greenbaum,
S. Hammarling, A. McKenney, and D. Sorensen. LAPACK: A portable linear algebra
library for high-performance computers. Computer Science Dept. Technical Report
CS-90-105, University of Tennessee, Knoxville, 1990. LAPACK Working Note #20
http://www.netlib.org/lapack/lawns/lawn20.ps.
[6] E. Anderson and J. Dongarra. Evaluating block algorithm variants in LAPACK. Com-
puter Science Dept. Technical Report CS-90-103, University of Tennessee, Knoxville,
1990. (LAPACK Working Note #19).
152
[7] ANSI/IEEE, New York. IEEE Standard for Binary Floating Point Arithmetic, Std
754-1985 edition, 1985.
[8] P. Arbenz, K. Gates, and Ch. Sprenger. A parallel implementation of the symmetric
tridiagonal qr algorithm. In Frontier's 92, McLEan, Virginia, 1992.
[9] P. Arbenz and I. Slapni�car. On an implementation of a one-sided block jacobi method
on a distributed memory computer. Z. Angew. Math. Mech., (76, Suppl. 1):343{344,
1996. http://www.inf.ethz.ch/personal/arbenz/ICIAM_jacobi.ps.gz.
[10] Peter Arbenz and Michael Oettli. Block implementations of the symmetric qr and
jacobi algorithms. Technical Report 178, Swiss Institue of Technology, 1995. ftp:
//ftp.inf.ethz.ch:/pub/publications/tech-reports/1xx/178.ps.
[11] K. Asanovic. Ipm: Interval performance monitoring. http://www.icsi.berkeley.
edu/~krste/ipm/IPM.html.
[12] C. Ashcraft. A taxonomy of distributed dense LU factorization methods. Technical
Report ECA-TR-161, Boeing Computer Services, March 1991.
[13] Z. Bai and J. Demmel. On a block implementation of Hessenberg multishift QR
iteration. International Journal of High Speed Computing, 1(1):97{112, 1989. (also
LAPACK Working Note #8 http://www.netlib.org/lapack/lawns/lawn8.ps).
[14] R. Barlow, D. Evans, and J. Shanehchi. Parallel multisection applied to the eigenvalue
problem. Comput. J., 6:6{9, 1983.
[15] R.H. Barlow and D.J. Evans. A parallel organization of the bisection algorithm. The
Computer Journal, 22(3), 1978.
[16] Mike Barnett, Lance Shuler, Robert van de Geijn, Satya Gupta, David Payne, and
Jerrell Watts. Interprocessor collective communication library (InterCom). In Pro-
ceedings of the Scalable High Performance Computing Conference, pages 357{364.
IEEE, 1994. ftp://ftp.cs.utexas.edu/pub/rvdg/shpcc.ps.
[17] A. Basermann and P. Weidner. A parallel algorithm for determining all eigenvalues of
large real symmetric tridiagonal matrices. Parallel Computing, 18:1129{1141, 1992.
153
[18] K. Bathe. Finite Element Procedures in Enginerring Analysis. Prentice Hall, Inc.,
Englewood Cli�s, NJ, 1982.
[19] A. Beguelin, J. Dongarra, A. Geist, R. Manchek, and V. Sunderam. A users' guide
to PVM: Parallel virtual machine. Technical Report ORNL/TM-11826, Oak Ridge
National Laboratory, Oak Ridge, TN, July 1991.
[20] H. Bernstein and M. Goldstein. Parallel implementation of bisection for the calcu-
lation of eigenvalues of a tridiagonal symmetric matrices. Technical report, Courant
Institute, New York, NY, 1985.
[21] M. Berry and A. Sameh. Parallel algorithms for the singular value and dense sym-
metric eigenvalues problems. J. Comput. and Appl. Math., 27:191{213, 1989.
[22] Allan J. Beveridge. A general atomic and molecular electronic structure system.
available as:http://gserv1.dl.ac.uk/CFS/gamess_4.html.
[23] J. Bilmes, K. Asanovic, J. Demmel, D. Lam, and C.-W. Chin. Optimizing matrix mul-
tiply using phipac: a portable, high-performance, ansi c coding methodology. Com-
puter Science Dept. Technical Report CS-96-326, University of Tennessee, Knoxville,
May 1996. LAPACK Working Note #111 http://www.netlib.org/lapack/lawns/
lawn111.ps.
[24] C. Bischof and X. Sun. A framework for symmateric band reduction and tridiagonal-
ization. Technical report, Supercomputing Research Center, 1991. (Prism Working
Note #3 ftp://ftp.super.org/pub/prism/wn3.ps).
[25] C. Bischof, X. Sun, and B. Lang. Parallel tridiagonalization through two-step band
reduction. In Scalable High-Performance Computing Conference. IEEE Computer
Society Press, May 1994. (Also Prism Working Note #17 ftp://ftp.super.org/
pub/prism/wn17.ps).
[26] C. Bischof, X. Sun, A. Tsao, and T. Turnbull. A study of the invariant subspace
decomposition algorithm for banded symmetric matrices. In Proceedings of the Fifth
SIAM Conference on Applied Linear Algebra. IEEE Computer Society Press, June
1994. (Also Prism Working Note #16 ftp://ftp.super.org/pub/prism/wn16.ps).
154
[27] C. Bischof and C. Van Loan. The WY representation for products of Householder
matrices. SIAM J. Sci. Statist. Comput., 8:s2{s13, 1987.
[28] Christian Bischof, William Gerorge, Steven Huss-Lederman, Xiaobai Sun, Anna Tsao,
and Thomas Turnbull. Prism software, 1997. http://www.mcs.anl.gov/Projects/
PRISM/lib/software.html.
[29] Christian Bischof, William Gerorge, Steven Huss-Lederman, Xiaobai Sun, Anna Tsao,
and Thomas Turnbull. SYISDA User's Guide, version 2.0 edition, 1995. ftp://ftp.
super.org/pub/prism/UsersGuide.ps.
[30] R. H. Bisseling and J. G. G. van de Vorst. Parallel LU decomposition on a transputer
network. In G. A. van Zee and J. G. G. van de Vorst, editors, Lecture Notes in
Computer Science, Number 384, pages 61{77. Springer-Verlag, 1989.
[31] L. S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra,
S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley.
ScaLAPACK Users' Guide. SIAM, Philadelphia, 1997. http://www.netlib.org/
scalapack/slug/scalapack_slug.html.
[32] Jerry Bolen, Arlin Davis, Bill Dazey, Satya Gupta, Greg Henry, David Robboy, Guy
Shi�er, David Scott, Mark Stallcup, Amir Taraghi, Stephen Wheat, LeeAnn Fisk,
Gabi Istrail, Chu Jong, Ro� Riesen, and Lance Shuler. Massively parallel distributed
computing: World's �rst 281 giga op supercomputer. In Intel Supercomputer User's
Group, 1995.
[33] R. P. Brent. Algorithms for minimization without derivatives. Prentice-Hall, 1973.
[34] K. Bromley and J. Speiser. Signal processing algorithms, architectures and applica-
tions. In Proceedings SPIE 27th Annual International Technical Symposium, 1983.
Tutorial 31.
[35] S. Carr and R. Lehoucq. Compiler blockability of dense matrix factorizations.
ACM TOMS, 1977. also available as:ftp://info.mcs.anl.gov/pub/tech_reports/
lehoucq/block.ps.Z.
155
[36] S. Chakrabarti, J. Demmel, and K. Yelick. Modeling the bene�ts of mixed data and
task parallelism. In Symposium on Parallel Algorithms and Architectures (SPAA),
July 1995. http://HTTP.CS.Berkeley.EDU/~yelick/soumen/mixed-spaa95.ps.
[37] H. Chang, S. Utku, M. Salama, and D. Rapp. A parallel Householder tridiagonaliza-
tion strategem using scattered row decomposition. I. J. Num. Meth. Eng., 26:857{874,
1988.
[38] H.Y. Chang and M.Salama. A parallel Householder tridiagonalization stratagem using
scattered square decomposition. Parallel Computing, 6:297{312, 1988.
[39] S. Chinchalkar. Computing eigenvalues and eigenvectors of a dense real symmetric
matrix on the ncube 6400. Technical Report CTC91TR74, Advanced Computing
research Institute, June 1991.
[40] J. Choi, J. Demmel, I. Dhillon, J. Dongarra, S. Ostrouchov, A. Petitet, K. Stanley,
D. Walker, and R. C. Whaley. ScaLAPACK: A portable linear algebra library for
distributed memory computers - Design issues and performance. Computer Science
Dept. Technical Report CS-95-283, University of Tennessee, Knoxville, March 1995.
LAPACK Working Note #95 http://www.netlib.org/lapack/lawns/lawn95.ps.
[41] J. Choi, J. Dongarra, S. Ostrouchov, A. Petitet, D. Walker, and R. C. Whaley. A
proposal for a set of parallel basic linear algebra subprograms. Computer Science Dept.
Technical Report CS-95-292, University of Tennessee, Knoxville, May 1995. LAPACK
Working Note #100 http://www.netlib.org/lapack/lawns/lawn100.ps.
[42] J. Choi, J. Dongarra, R. Pozo, and D. Walker. ScaLAPACK: A scalable linear algebra
library for distributed memory concurrent computers. In Proceedings of the Fourth
Symposium on the Frontiers of Massively Parallel Computation, pages 120{127. IEEE
Computer Society Press, 1992. LAPACK Working Note #55 http://www.netlib.
org/lapack/lawns/lawn55.ps.
[43] C-C Chou, Y. Deng, G. Li, and Y. Wang. Parallelizing strassen's method for matrix
multiplication on distributed-memory mimd architectures. International Journal of
Computers and Mathematics with Applications, 30:45{69, 1995.
156
[44] Almadena Chtchelkanova, John Gunnels, Greg Morrow, James Overfelt, and
Robert A. van de Geijn. Parallel implementation of blas: General techniques for level
3 blas. Technical Report TR-95-40, Department of Computer Sciences, University of
Texas, October 1995. PLAPACK Working Note #4, to appear in Concurrency: Prac-
tice and Experience. http://www.cs.utexas.edu/users/plapack/plawns.html.
[45] M. Chu. A note on the homotopy method for linear algebraic eigenvalue problems.
Lin. Alg. Appl, 105:225{236, 1988.
[46] John M. Conroy and Louis J. Podrazik. A parallel inertia method for �nding eigen-
values on vector and simd architectures. SIAM Journal on Statistical Computing,
16:500{505, March 1995.
[47] F. J. Corbato. On the coding of jacobi's method for computing eigenvalues and
eigenvectors of real symetric matrices. Journal of the ACM, 10(2):123{125, 1963.
[48] Jessup E.R. Crivelli, S. The cost of eigenvalue computation on distributed memory
mimd multiprocessors. Parallel Computing, 21:401{422, 1995.
[49] J. Cullum and R. A. Willoughby. Lanczos algorithms for large symmetric eigenvalue
computations. Birkha�user, Basel, 1985. Vol.1, Theory, Vol.2. Program.
[50] J.J.M. Cuppen. A divide and conquer method for the symmetric tridiagonal eigen-
problem. Numer. Math., 36:177{195, 1981.
[51] M. Dayde, I. Du�, and A. Petitet. A Parallel Block Implementation of Level 3
BLAS for MIMD Vector Processors. ACM Transactions on Mathematical Software,
20(2):178{193, 1994.
[52] E. D'Azevedo. personal communication, 1997. http://www.epm.ornl.gov/
~efdazedo/.
[53] J. Demmel. CS 267 Course Notes: Applications of Parallel Processing. Computer
Science Division, University of California, 1991. 130 pages.
[54] J. Demmel, I. Dhillon, and H. Ren. On the correctness of some bisection-like parallel
eigenvalue algorithms in oating point arithmetic. Electronic Trans. Num. Anal.,
3:116{140, December 1995. LAPACK working note 70.
157
[55] J. Demmel, J. J. Dongarra, S. Hammarling, S. Ostrouchov, and K. Stanley. The dan-
gers of heterogeneous network computing: Heterogenous networks considered harmful.
In Proceedings Heterogeneous Computing Workshop '96, pages 64{71. IEEE Computer
Society Press, 1996.
[56] J. Demmel, M. Gu, S. Eisenstat, I. Slapnicar, K. Veselic, and Z. Drmac. Computing
the singular value decomposition with high relative accuracy. Computer Science Dept.
Technical Report CS-97-348, University of Tennessee, Feb 1997. LAPACK Working
Note #2 http://www.netlib.org/lapack/lawns/lawn2.ps.
[57] J. Demmel and K. Stanley. The performance of �nding eigenvalues and eigenvectors
of dense symmetric matrices on distributed memory computers. In Proceedings of the
Seventh SIAM Conference on Parallel Proceesing for Scienti�c Computing. SIAM,
1994.
[58] J. Demmel and K. Veseli�c. Jacobi's method is more accurate than QR. SIAM J. Mat.
Anal. Appl., 13(4):1204{1246, 1992. (also LAPACK Working Note #15).
[59] Inderjit Dhillon. A New O(n2) Algorithm for the Symmetric Tridiagonal Eigen-
value/Eogenvector Problem. PhD thesis, University of California at Berkeley, 1997.
[60] St�ephane Domas and Fran�coise Tisseur. Parallel implementation of a symmetric eigen-
solver based on the yau and lu method. In International Journal of Supercomputer
Applications (proceedings of Environments and Tools For Parallel Scienti�c Comput-
ing III, Faverges de la Tour, France, 21-23 August, 1996.
[61] J. Dongarra, J. Bunch, C. Moler, and G. W. Stewart. LINPACKUser's Guide. SIAM,
Philadelphia, PA, 1979.
[62] J. Dongarra, J. Du Croz, I. Du�, and S. Hammarling. A set of Level 3 Basic Linear
Algebra Subprograms. ACM Trans. Math. Soft., 16(1):1{17, March 1990.
[63] J. Dongarra, J. Du Croz, S. Hammarling, and Richard J. Hanson. An Extended Set of
FORTRAN Basic Linear Algebra Su broutines. ACM Trans. Math. Soft., 14(1):1{17,
March 1988.
158
[64] J. Dongarra, S. Hammarling, and D. Sorensen. Block reduction of matrices to con-
densed forms for eigenvalue computations. J. Comput. Appl. Math., 27:215{227, 1989.
LAPACK Working Note #2 http://www.netlib.org/lapack/lawns/lawn2.ps.
[65] J. Dongarra, R. Hempel, A. Hay, and D. Walker. A proposal for a user-level message
passing interface in a distributed memory environment. Technical Report ORNL/TM-
12231, Oak Ridge National Laboratory, Oak Ridge, TN, February 1993.
[66] J. Dongarra and D. Sorensen. A fully parallel algorithm for the symmetric eigenprob-
lem. SIAM J. Sci. Stat. Comput., 8(2):139{154, March 1987.
[67] J. Dongarra and R. van de Geijn. Reduction to condensed form for the eigenvalue
problem on distributed memory computers. Computer Science Dept. Technical Report
CS-91-130, University of Tennessee, Knoxville, 1991. LAPACK Working Note #30
http://www.netlib.org/lapack/lawns/lawn30.ps also Parallel Computing.
[68] J. Dongarra, R. van de Geijn, and D. Walker. A look at scalable dense linear alge-
bra libraries. In Scalable High-Performance Computing Conference. IEEE Computer
Society Press, April 1992.
[69] J. Dongarra and R. C. Whaley. A user's guide to the blacs v1.1. Technical report,
University of Tennessee, Knoxville, March 1995. LAPACK Working Note #94 http:
//www.netlib.org/lapack/lawns/lawn94.ps.
[70] C. C. Douglas, M. Heroux, G. Slishman, and R.M. Smith. Gemmw: A portable level
3 blas winograd variant of strassen's matrix-matrix multiply algorithm. Journal of
Computational Physics, 110:1{10, 1994.
[71] Zlatko Drma�c and Kre�simir Veseli�c. Iterative re�nement of the symmet-
ric eigensolution. Technical report, University of Colorado at Boulder, 1997.
[72] P.J. Eberlein and M. Mantharam. Jacobi sets for the eigenproblem and their e�ect
of convergence studied by graphci representations. Technical report, SUNY Bu�alo,
1990.
[73] P.J. Eberlein and M. Mantharam. New jacobi for parallel computations. Parallel
Computing, 19:437{454, 1993.
159
[74] G. Fann and R. Little�eld. Performance of a fully parallel dense real symmetric
eigensolver in quantum chemistry applications. In Proceedings of the Sixth SIAM
Conference on Parallel Processing for Scienti�c Computation. SIAM, 1994.
[75] G. Fann and R. J. Little�eld. A parallel algorithm for householder tridiagonalization.
In Proceedings of the Sixth SIAM Conference on Parallel Processing for Scienti�c
Computing, pages 409{413. SIAM, 1993.
[76] R. Fellers. Performance of pdsyev, : : : . Mathematics Dept. Master's Thesis avail-
able by anonymous ftp to http://cs-tr.CS.Berkeley.EDU/NCSTRL/, University of
California, 1997.
[77] V. Fernando, B. Parlett, and I. Dhillon. A way to �nd the most redundant equation
in a tridiagonal system. Berkeley Mathematics Dept. Preprint, 1995.
[78] UTK Joint Institute for Computational Science, 1997. http://www-jics.cs.utk.
edu/SP2/sp2_config.html.
[79] J.G.F. Francis. The QR transformation: A unitary analogue to the LR transformation,
parts I and II. The Computer Journal, 4:265{272, 332{345, 1961.
[80] K. Gates. A rank-two divide and conquer method for the symmetric tridiagonal
eigenproblem. In Frontier's 92, McLean, Virginia, 1992.
[81] Kevin Gates. Using inverse iteration to improve the divide and conquer algorithm.
Technical Report 159, Swiss Institue of Technology, 1991.
[82] Kevin Gates and Peter Arbenz. Parallel divide and conquer algorithms for the sym-
metric tridiagonal eigenproblem. Technical Report 222, Swiss Institue of Technology,
1995. ftp://ftp.inf.ethz.ch:/pub/publications/tech-reports/2xx/222.ps.
[83] W. Givens. Numerical computation of the characteristic values of a real matrix.
Technical Report 1574, Oak Ridge National Laboratory, 1954.
[84] Gene H. Golub and Charles F. Van Loan. Matrix Computations. The Johns Hopkins
University Press, Baltimore, MD, 1983.
160
[85] J. G�otze. On the parallel implementation of jacobi and kogbetliantz algorithms.
SIAM J. on Sci. Comput., pages 1331{1348, 1994. http://www.nws.e-technik.
tu-muenchen.de/~jugo/pub/SIAMjac.ps.Z.
[86] A. Greenbaum and J. Dongarra. Experiments with QL/QR methods for the sym-
metric tridiagonal eigenproblem. Computer Science Dept. Technical Report CS-
89-92, University of Tennessee, Knoxville, 1989. LAPACK Working Note #17
http://www.netlib.org/lapack/lawns/lawn17.ps.
[87] Numerical Algorithms Group, 1997. http://www.nag.co.uk/numeric.html.
[88] M. Gu and S. Eisenstat. A stable algorithm for the rank-1 modi�cation of the sym-
metric eigenproblem. Computer Science Dept. Report YALEU/DCS/RR-916, Yale
University, September 1992.
[89] M. Gu and S. C. Eisenstat. A divide-and-conquer algorithm for the symmetric tridi-
agonal eigenproblem. SIAM J. Mat. Anal. Appl., 16(1):172{191, January 1995.
[90] M. Hegland, M. H. Kahn, and Osborne M. R. A parallel algorithm for the reduction to
tridiagonal form for eigendecomposition. Technical Report TR-CS-96-06, Australian
National University, 1996. http://cs.anu.edu.au/techreports/1996/index.html.
[91] B. Hendrickson, E. Jessup, and C. Smith. A parallel eigensolver for dense symmetric
matrices. Technical Report SAND96{0822, Sandia National Labs, Albuquerque, NM,
March 1996. Submitted to SIAM J. Sci. Comput.
[92] G. Henry. personal communication, 1997. http://www.cs.utk.edu/~ghenry/.
[93] Greg Henry. Improving Data Re-Use in Eigenvalue-Related Computations. PhD thesis,
Cornell University, 1994.
[94] High Performance Fortran Forum. High Performance Fortran language speci�cation
version 1.0. Draft, January 1993. Also available as technical report CRPC-TR 92225,
Center for Research on Parallel Computation, Rice University.
[95] Y. Huo and R. Schreiber. E�cient, massively parallel eigenvalue computations.
preprint, 1993.
161
[96] S. Huss-Lederman, Jacobson E.M., J. R. Johnson, Tsao A., and T. Turnbull.
\strassen's algorithm for matrix multiplication: Modeling, analysis, and implementa-
tion". Technical report, Center for Computing Sciences, 1996. (Also Prism Working
Note #34 ftp://ftp.super.org/pub/prism/wn34.ps).
[97] S. Huss-Lederman, A. Tsao, and G. Zhang. \a parallel implementation of the invariant
subspace decomposition algorithm for dense symmetric matrices". In Proceedings of
Sixth SIAM conference on Parallel Processing for Scienti�c Computing, March 1993.
(Also Prism Working Note #9 ftp://ftp.super.org/pub/prism/wn9.ps).
[98] IBM, Kingston, NY. Engineering and Scienti�c Subroutine Library | Guide and
Reference, release 3 edition, 1988. Order No. SC23-0184.
[99] I. Ipsen and E. Jessup. Solving the symmetric tridiagonal eigenvalue problem on the
hypercube. SIAM J. Sci. Stat. Comput., 11(2):203{230, 1990.
[100] C.G.J. Jacobi. �Uber ein leichtes verfahren die in der theorie der s�acul�arst�orungen
vorkommenden gleichungen numerisch aufzul�osen. Crelle's Journal, 30:51{94, 1846.
[101] Paci�c Northwest Labaratories Jarek Nieplocha, 1996.
http://www.emsl.pnl.gov:2080/docs/global/ga.html.
[102] E. Jessup and I. Ipsen. Improving the accuracy of inverse iteration. SIAM J. Sci.
Stat. Comput., 13(2):550{572, 1992.
[103] B. K�agstr�om, P. Ling, and C. Van Loan. GEMM-Based Level 3 BLAS: High-
Performance Model Implementations and Performance Evaluation Benchmark. Re-
port UMINF-95.18, Department of Computing Science, Ume�a University, S-901 87
Ume�a, Sweden, 1995. To appear in ACM Trans. Math. Software LAPACK Working
Note #107 http://www.netlib.org/lapack/lawns/lawn107.ps.
[104] B. K�agstr�om, P. Ling, and C. Van Loan. GEMM-Based Level 3 BLAS: Portability
and Optimization Issues . Technical report, Department of Computing Science, Ume�a
University, 1997. To appear in ACM Trans. Math. Software.
[105] W. Kahan. Accurate eigenvalues of a symmetric tridiagonal matrix. Computer Science
Dept. Technical Report CS41, Stanford University, Stanford, CA, July 1966 (revised
June 1968).
162
[106] R.K. Kamilla, X.G. Wu, and J.K. Jain. Composite fermion theory of collective exci-
tations in fractional quantum hall e�ect. Physical Review Letters, 1996.
[107] R.M. Karp, A. Sahay, E. Santos, and K.E. Schauser. Optimal broadcast and summa-
tion in the LogP model. In Proc. 5th ACM Symposium on Parallel Algorithms and
Architectures, pages 142{153, 1993.
[108] L. Kaufman. Banded eigenvalue solvers on vector machines. ACM Trans. Math. Soft.,
10:73{86, 1984.
[109] L. Kaufman. A parallel qr algorithm for the symmetric tridiagonal eigenvalue problem.
Journal of Parallel and Distributed Computing, 23:429{434, 1994.
[110] D. Koebel, D. Loveman, R. Schreiber, G. Steele, and M. Zosel. The High Performance
Fortran Handbook. MIT Press, Cambridge, 1994.
[111] A. S. Krishnakumar and M. Morf. Eigenvalues of a symmetric tridiagonal matrix: A
divide and conquer approach. Numer. Math., 48:349{368, 1986.
[112] Peter Freche Krystian Pracz, Martin Janssen. Correlation of eigenstates in the critical
regime of quantum hall systems. J. Phys. Condens. Matter, 8:7147{7159, 1996. also
available as:http://xxx.lanl.gov/abs/cond-mat/9605012.
[113] D. Kuck and A. Sameh. A parallel QR algorithm for symmetric tridiagonal matrices.
IEEE Trans. Computers, C-26(2), 1977.
[114] J.R. Kuttler and V.G. Sigillito. Eigenvalues of the laplacian in two dimensions. SIAM
Review, 26:163{193, 1984.
[115] M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimiza-
tions of blocked algorithms. In Proceedings of the Fourth International Conference
on Architectural Support for Programming Languages and Operating Systems, pages
63{74, April 1991.
[116] B. Lang. A parallel algorithm for reducing symmetric banded matrices to tridiagonal
form. SIAM J. Sci. Comput., 14(6), November 1993.
[117] C. Lawson, R. Hanson, D. Kincaid, and F. Krogh. Basic Linear Algebra Subprograms
for Fortran usage. ACM Trans. Math. Soft., 5:308{323, 1979.
163
[118] Thomas J. LeBlanc and Evangelos P. Markatos. Shared memory vs. message pass-
ing in shared-memory multiprocessors. In 4th Symp. on Parallel and Distributed
Processing, 1992. ftp://ftp.cs.rochester.edu/pub/papers/systems/92.ICPP.
locality_vs_load_balancing.ps.Z.
[119] R. B. Lehoucq. Arpack software. http://www.mcs.anl.gov/home/lehoucq/
software.html.
[120] K. Li and T.-Y. Li. An algorithm for symmetric tridiagonal eigenproblems | divide
and conquer with homotopy continuation. SIAM J. Sci. Comp., 14(3), May 1993.
[121] Rencang Li and Huan Ren. An e�cient tridiagonal eigenvalue solver on cm 5 with
laguerre's iteration. Computer Science Division CSD-94-848, University of Califor-
nia, 1994. http://sunsite.berkeley.edu/Dienst/UI/2.0/Describe/ncstrl.ucb%
2fCSD-94-848.
[122] T.-Y. Li and Z. Zeng. Laguerre's iteration in solving the symmetric tridiagonal eigen-
problem - a revisit. Michigan State University preprint, 1992.
[123] T.-Y. Li, H. Zhang, and X. H. Sun. Parallel homotopy algorithm for symmetric
tridiagonal eigenvalue problems. SIAM J. Sci. Stat. Comput., 12:464{485, 1991.
[124] W. Lichtenstein and S. L. Johnsson. Block cyclic dense linear algebra. SIAM J. Sci.
Comp., 14(6), November 1993.
[125] R.J. Little�eld and K. J. Maschho�. Investigating the performance of parallel eigen-
solvers for large processor counts. Theretica Chemica Acta, 84:457{473, 1993.
[126] S.-S. Lo, B. Phillipe, and A. Sameh. A multiprocessor algorithm for the symmetric
eigenproblem. SIAM J. Sci. Stat. Comput., 8(2):155{165, March 1987.
[127] Mi Lu and Xiangzhen Qiao. Applying parallel computer systems to solve symmetric
tridiagonal eigenvalue problems. Parallel Computing, 18:1301{1315, 1992.
[128] S. C. Ma, M. Patrick, and D. Szyld. A parallel, hybrid algorithm for the generalized
eigenproblem. In Garry Rodrigue, editor, Parallel Processing for Scienti�c Comput-
ing, chapter 16, pages 82{86. SIAM, 1989.
164
[129] R. S. Martin, C. Reinsch, and J. H. Wilkinson. Householder's tridiagonalization of a
symmetric matrix. Numerische Mathematik, 11:181{195, 1968.
[130] K. Maschho�. Parpack software. http://www.caam.rice.edu/~kristyn/parpack_
home.html.
[131] R. Mathias. The instability of parallel pre�x matrix multiplication. SIAM J. Sci.
Stat. Comput., 16(4):956{973, July 1995.
[132] Gary Oas. Universal cubic eigenvalue repulsion for random normal matrices. Physical
Review E, 1996. also available as:http://xxx.lanl.gov/abs/cond-mat/9610073.
[133] David C. O'Neal and Raghurama Reddy. Solving symmetric eigenvalue problems on
distributed memory machines. In Proceedings of the Cray User's Group, pages 76{96.
Cray Inc., 1994.
[134] B. Parlett. The Symmetric Eigenvalue Problem. Prentice Hall, Englewood Cli�s, NJ,
1980.
[135] B. Parlett. Acta Numerica, chapter The new qd algorithms, pages 459{491. Cambridge
University Press, 1995.
[136] B. Parlett. The construction of orthogonal eigenvectors for tight clusters by use
of submatrices. Center for Pure and Applied Mathematics PAM-664, University of
California, Berkeley, CA, January 1996. submitted to SIMAX.
[137] B. Parlett. personal communication, 1997.
[138] B. N. Parlett. Laguerre's method applied to the matrix eigenvalue problem. Mathe-
matics of Computation, 18:464{485, 1964.
[139] B.N. Parlett and I.S. Dhillon. On Fernando's method to �nd the most redundant
equation in a tridiagonal system. Linear Algebra and its Applications, 267:247{279,
1997. Nov.
[140] Antoine Petitet. Algorithmic Redistribution Methods for Block Cyclic Decompositions.
PhD thesis, University of Tennessee, 1996.
[141] C. P. Potter. A parallel divide and conquer eigensolver. http://sawww.epfl.ch/
SIC/SA/publications/SCR95/7-95-27a.html.
165
[142] M. Pourzandi and B. Tourancheau. A parallel performance study of jacobi-like eigen-
value solution. http://www.netlib.org/tennessee/ut-cs-94-226.ps.
[143] Earl Prohofsky. Statistcal Mechanics and Stability of Macromolecules. Cambridge
University Press, 1995.
[144] B. Putnam, E. W. Prohofsky, K. C. Lu, and L. L. Van Zandt. Breathing modes and
induced resonant melting of the double helix. Physics Letters, 70A, 1979.
[145] C. Reinsch. A stable rational qr algorithm for the computation of the eigenvalues of
an hermitian, tridiagonal matrix. Num. Math., 25:591{597, 1971.
[146] H. Ren. On error analysis and implementation of some eigenvalue and singular value
algorithms. PhD thesis, University of California at Berkeley, 1996.
[147] J. Rutter. A serial implementation of Cuppen's divide and conquer al-
gorithm for the symmetric eigenvalue problem. Mathematics Dept. Mas-
ter's Thesis http://sunsite.berkeley.edu/Dienst/UI/2.0/Describe/ncstrl.
ucb%2fCSD-94-799, University of California, 1991.
[148] R. Saavedra, W. Mao, D. Park, J. Chame, and S. Moon. The combined e�ectiveness
of unimodular transformations, tiling, and software prefetching. In Proceedings of
the 10th International Parallel Processing Symposium. IEEE Computer Society, April
15{19 1996.
[149] V. Sarkar. Automatic selection of high order transformations in the IBM ASTU
Optimizer. IBM Software Solutions Division Report, 1996.
[150] R. Schreiber. Solving eigenvalue and singular value problems on an undersized systolic
array. SIAM J. Sci. Stat. Comput., 7:441{451, 1986.
[151] D. Scott, M. Heath, and R. Ward. Parallel block Jacobi eigenvalue algorithms using
systolic arrays. Lin. Alg. & Appl., 77:345{355, 1986.
[152] G. Seifert, Th. Heine, O. Knospe, and R. Schmidt. Computer simulations for the
structure and dynamics of large molecules, clusters and solids. In Lecture Notes in
Computer Science, volume 1067, page 393. Springer-Verlag, 1996.
166
[153] B. T. Smith, J. M. Boyle, J. J. Dongarra, B. S. Garbow, Y. Ikebe, V. C. Klema, and
C. B. Moler. Matrix Eigensystem Routines { EISPACK Guide, volume 6 of Lecture
Notes in Computer Science. Springer-Verlag, Berlin, 1976.
[154] C. Smith, B. Hendrickson, and E. Jessup. A parallel algorithm for householder tridiag-
onalization. In Proceedings of the Fifth SIAM Conference on Applied Linear Algebra,
pages 361{365. SIAM, 1994.
[155] D. Sorensen and P. Tang. On the orthogonality of eigenvectors computed by divide-
and-conquer techniques. SIAM J. Num. Anal., 28(6):1752{1775, 1991.
[156] J. Speiser and H. Whitehouse. Parallel processing algorithms and architectures for
real time processing. In Proceedings SPIE Real Time Signal Processing IV, 1981.
[157] V. Strassen. Gaussian elimination is not optimal. Numerische Mathematik, 13:354{
355, 1969.
[158] P. Strazdins. Matrix factorization using distributed panels on the fujitsu ap1000.
In IEEE First International Conference on Algorithms And Architectures for Par-
allel Processing, Brisbsane, April 1995. http://cs.anu.edu.au/people/Peter.
Strazdins/papers.html#DBLAS.
[159] P. Strazdins. A high performance, portable distributed blas implementation. In
Fifth Parallel Computing Workshop for the Fujitsu PCRF, Kawasaki, November 1996.
http://cs.anu.edu.au/people/Peter.Strazdins/papers.html#DBLAS.
[160] P. Strazdins. personal communication, 1997. http://cs.anu.edu.au/people/
Peter.Strazdins.
[161] P. Strazdins. Reducing software overheads in parallel linear algebra libraries. Technical
report, Australian National University, 1997. Submitted to PART'97, The 4th Annual
Australasian Conference on Parallel And Real-Time Systems, 29 - 30 September 1997,
The University of Newcastle, Newcastle, Australia.
[162] P. Swarztrauber. A parallel algorithm for computing the eigenvalues of a symmetric
tridiagonal matrix. To appear in Math. Comp., 1993.
[163] D. Szyld. Criteria for combining inverse iteration and Rayleigh quotient iteration.
SIAM J. Num. Anal., 25(6):1369{1375, December 1988.
167
[164] Thinking Machines Corporation. CMSSL for CM Fortran: CM-5 Edition, version
3.1, 1993.
[165] S. Toledo. Locality of reference in lu decomposition with partial pivoting. SIAM
Journal on Matrix Analysis and Applications, 18-4, 1997. http://theory.lcs.mit.
edu/~sivan/029774.ps.gz.
[166] Alessandro De Vita, Giulia Galli, Andrew Canning, and Roberto Car. A microscopic
model for surface-induced graphite-to-diamond transitions. Nature, 379, Feb 8 1996.
[167] D. Watkins. Fundamentals of Matrix Computations. Wiley, 1991.
[168] R. Whaley. Automatically tunable linear algebra subroutines, 1997. http://www.
netlib.org/utk/projects/atlas.
[169] R. Clint Whaley. Basic linear algebra communication subroutines: Analysis and
implementation across multiple parallel architectures. Technical report, University of
Tennessee, Knoxville, June 1994. LAPACK Working Note #73 http://www.netlib.
org/lapack/lawns/lawn73.ps.
[170] J. H. Wilkinson. The Algebraic Eigenvalue Problem. Oxford University Press, Oxford,
1965.
[171] R. P. Wilson, R. S. French, C. S. Wilson, S. P. Amarasinghe, J. M. Anderson, S. W. K.
Tjiang, Shih-Wei Liao, C. Tseng, MaryW. Hall, M. S. Lam, and J. L. Hennessy. SUIF:
An Infrastructure for Research on Parallelizing and Optimizing Compilers. HTML
from http://suif.stanford.edu/suif/suif-overview/suif.html.
[172] M. Wolfe. High performance compilers for parallel computing. Addison-Wesley, 1996.
[173] Y.-J. J. Wu, A. A. Alpatov, C. Bischof, and R. A. van de Geijn. \a parallel implemen-
tation of symmetric band reduction using plapack". In Scalable Parallel Library Con-
ference, 1996. (Also Prism Working Note #35 ftp://ftp.super.org/pub/prism/
wn35.ps).
[174] Shing-Tung Yau and Ya Yan Lu. Reducing the symmetric matrix eigenvalue problem
to matrix multiplications. SIAM J. Sci. Comput., 14(1):121{136, January 1993.
168
[175] Paul G. Hipes Yi-Shuen Mark Wu, Steven A. Cuccaro and Aron Kuppermann.
Quantum-mechanical reactive scattering using a high-performance distributed-
memory parallel computer. Chem. Phys. Lett., 168:429{440, 1990.
169
Appendix A
Variables and abbreviations
170
Table A.1: Variable names and their uses
Name Meaning
(a; b) The processor in processor row a and processor column b.
A The input matrix (partially reduced).
A(i; j) The i; j element in the (partially reduced) matrix A.c The number of eigenvalues in the largest cluster of eigenvalues.
C The set of all processor columns.ca The current processor column within the sub-grid.cb The current processor column sub-grid.e The number of eigenvalues required.
j The current column, A(j : n; j : n) being the un-reduced portionof the matrix.
j0 The column within the current block column, j0 = mod (j; nb)lg(pp) log2
pp
m The number of eigenvectors required.
mb
The row block size. Used only when we discuss rectangularblocks. In general, the row block size and column block sizeare assumed to be equal and are written as nb.
mullen
A compile time parameter in the PBLAS which controls the panelsize used in PBLAS symmetric matrix vector multiply routine,PDSYMV.
n The size of the input matrix A.
nb
The blocking factor. In PDSYEVX the data layout and algorithmicblocking factor are the same. In HJS the data layout blockingfactor is 1 and nb refers to the algorithmic blocking factor.
p The number of processors used in the computation.
pbfPanel blocking factor. The panel width used in DGEMV in PDSYEVX
and DGEMM in PDSYEVX and HJS is pbf � nb.pr The number of processor rows in the process grid.pr1 The number of processor rows in a sub-grid.pr2 The number of processor sub-grid rows.pc The number of processor columns in the process grid.pc1 The number of processor columns in a sub-grid.pc2 The number of processor sub-grid columns.
R The set of all processor rows.ra The current processor row within the sub-grid.rb The current processor row sub-grid.
spread
In a \spread across", every processor in current processor col-umn broadcasts to every other processor in the same processorrow. In a \spread down", every processor in current processorrow broadcasts to every other processor in the same processorcolumn.
tril(A; 0)The lower triangular part, including the diagonal, of the un-reduced part of the input matrix A, i.e. A(j : n; j : n)
tril(A;�1) The lower triangular part, excluding the diagonal, of the un-reduced part of the input matrix A, i.e. A(j : n; j : n)
171
Table A.2: Variable names and their uses (continued)
Name Meaning
v The vector portion of the householder re ector.
VThe current column of householder re ectors. Size: n� j+ j0 byj0.
V (j � j0 : n; 1 : j 0) The current column of householder re ectors. Size: n� j+ j0 byj0.
vnbThe imbalance in the 2D block-cyclic distribution of the eigen-vector matrix.
wThe companion update vector. i.e. the vector used in A = A �vwT �WvT to reduce A
WThe current column of companion update vectors. Size: n�j+j0by j0.
W (j�j0 : n; 1 : j0) The current column of companion update vectors. Size: n�j+j0by j0.
Abbreviation Meaning
CPU Central Processing Unit
FPU Floating Point Unit
Table A.3: Abbreviations
Symbol Meaning Terms included
� The message initiation cost for BLACS send and re-ceive.
n lg(p);n
�The inverse bandwidth cost for BLACS send and re-ceive.
n2 lg(p)pp
; n2pp;
n�nb lg(p)
�3
DGEMM (matrix-matrix multiply) subroutine overheadplus the time penalty associated of invoking DGEMM onsmall matrices.
n2
nb2�pbf; nnb
3Time required per DGEMM (matrix-matrix multiply) op.
n3
p; n
2�nbpp
�2
DGEMV (matrix-vector multiply) subroutine overheadplus the time penalty associated of invoking DGEMV onsmall matrices.
n
2Time required per DGEMV (matrix-vector multiply) op.
n3
p; n
2�nbpp
� Time required per divide. n2
p;n
p Time required per square root.
1 Time required per BLAS1 (scalar-vector) op. n2
p;n
�1 Subroutine overhead for BLAS1 and similar codes. n2pp
�4 Subroutine overhead for the PBLAS. n
Table A.4: Model costs
172
Appendix B
Further details
B.1 Updating v during reduction to tridiagonal form
Line 4.1, w = w �W V Tv � V WT v in Figure 8.3 can be computed with minimal
communication, minimal computation or with an intermediate amount of both commu-
nication and computation. Indeed, Line 4.1 can be computed with O((n2
p+ n2 nb
pr) 2 +
n log(pr�0:5)�) cost for various r 2 [0:5; 1:0]. r = 1:0 corresponds to the minimal computa-
tion cost option (discussed in section B.1.3) while r = 0:5 corresponds to the minimal (zero)
communication cost option (discussed in section B.1.2). Section B.1.4 describes the inter-
mediate options in a generalized form which includes both the minimum communication
and minimum computation options as special cases.
The plethora of options for the update of v stems from the fact that the input ma-
tricesW;V;WT and V T are replicated across the relevant processors while the input/output
vector v is stored as partial sums across the processor columns in each of the processor rows.
The input matrices are replicated because they will need to be replicated later to update
A. The vector v is stored as partial sums because that is how it is initially computed,
and because the combine operation used to compute v from the partial sums has not been
performed at this point.
Throughout this section we only discuss computing WV T v. VWT v can be com-
puted in a similar manner. Moreover, the two computations, and all associated communi-
cation, can be merged to reduce software overhead and message latency costs.
173
B.1.1 Notation
In describing most parallel linear algebra codes, including all codes in this thesis
outside of this appendix, we need not explicitly state the processor on which a value is
stored. Ai;j is understood to live on the processor that owns row i and column j. The
nb0 element array tmp contains di�erent values on di�erent processors. Therefore, for the
discussion in this appendix, an additional subscript is added to tmp to indicate the processor
column. Furthermore, some entries in tmp are left unde�ned at various stages, therefore we
use j 2 fcag to indicate all columns j owned by processor column ca. i.e. tmpj2fcag;ca = val
means that 8j 2 fcag, tmpj on processor ca is assigned val . For extra clarity within a
display we write this as tmpj;caj2fcag
.
B.1.2 Updating v without added communication
Line 4.1, w = w � W V Tv � V WT v in Figure 8.3 can be computed without
any communication other than that needed to compute v without the update. It initially
appears that w = w � W � V T v � V � WT v requires communication because computing
tmp = V Tv requires summing nb0 values1 within each processor column, and computing
w = w �W � tmp requires that tmp be broadcast within each processor column. However,
W � V Tv can be computed with a single sum within each processor row, and by delaying
the sum needed to compute w, one of them can be avoided completely. Figure B.1 derives
how W � V T v can be computed with a single sum within each processor row.
Line 3 The transformation from line 2 to line 3 is the standard way that a matrix vector
multiply is performed in parallel. The leftmost sum is the local portion, the middle
sum is the sum over all processors in the processor column.
Line 4 Delay the sum over all processors in the processor column until after multiplying
by W . The rightmost two sums involve only local values.
Figure B.2 shows how to compute W � V Tv without added communication.
Line 5 Local computation of V T � v. Operations:nX
i=1;nb
nbX
nb0=1
2i
prnb0 2 = 1
2 n2 nb
pr 2
1nb
0= i� ii� 1 is the number of columns in H
174
Figure B.1: Avoiding communication in computing W � V T v
tmp = W � V Tv (Line 1)
tmpi =X
1�j�nb0Wij
X
k2fCgVkjvk (Line 2)
tmpi =X
1�j�nb0Wij
X
1�R�prk2fCg
Hkjhk (Line 3)
tmpi =X
C2pc1�j�nb0
Wij
X
k2fCgHkjhk (Line 4)
Line 6 Local computation of W � tmp. Operations:
nX
i=1;nb
nbX
nb0=1
2i
prnb0 2 = 1
2 n2 nb
pr 2
Line 7 E�ect of summing resi within each processor row. This operation is merged with
the unavoidable summation of w within each processor row, hence this operation is
not performed and has no cost.
B.1.3 Updating w with minimal computation cost
Figure B.3 shows how W � V Tv can be performed with only O( n2pp+ n2 nb
p) com-
putation by distributing the computation of tmp = V T � v and w = w +W � tmp over all
the processors. Each of the nb columns of V T is assigned to one processor row, hence each
processor row is assigned nbppcolumns of V T . Each processor row computes the portion of
V T �v assigned to it, leaving the answer on the diagonal processor in this row. The diagonal
processors then broadcast the nbppelements of V T �v which they own to all of the processors
within their processor column. Finally, each processor computes w = w +W � tmp for the
values of W and tmp which it owns.
175
Figure B.2: Computing W � V Tv without added communication
tmpj;C =X
k2fCgV Tk;j vk (Line 5)
resi;Ci2fRg
=X
j
Wi;j tmpj;C (Line 6)
=X
j
Wi;j
X
k2fCgV Tk;j vk
X
C
resi;Ci2fRg
=X
j1�C�pc
Wi;j
X
k2fCgV Tk;j vk (Line 7)
=X
j
Wi;j
X
k
V Tk;j vk
Line 8 Local computation of V T � v. Operations:nX
i=1;nb
nbX
nb0=1
2i
pr
nb0
pc 2 =
12
n2 nb
p 2
Line 9 Combine tmpj2fRg;C within each processor column, leaving the answer on the di-
agonal processor. Operations:
nX
i=1;nb
nbX
nb0=1
log(pc) (�+nb0
pc�) = n log(pc)�+ 1
2
n nb
pclog(pc)�
Line 10 Broadcast tmpj2fCg;C within each processor row from the diagonal processor.
Operations:
nX
i=1;nb
nbX
nb0=1
log(pc) (�+nb0
pc�) = n log(pc)�+ 1
2
n nb
pclog(pc)�
Line 11 Local computation of W � tmp. Operations:
nX
i=1;nb
nbX
nb0=1
2i
pr
nb0
pc 2 =
12
n2 nb
p 2
176
Figure B.3: Computing W � V T v with minimal computation
tmpj;Cj2fRg
=X
k2fCgV Tk;j vk (Line 8)
8R=C
tmpj;Cj2fCg
=X
1�cl�pck2fclg
V Tk;j vk (Line 9)
=X
k
V Tk;j vk
tmpj;Cj2fCg
=X
k2fCgV Tk;j vk (Line 10)
resi;Ci2fRg
=X
j2fCgWi;j tmpj;C
j2fCg(Line 11)
=X
j2fCgWi;j
X
k2fCgV Tk;j vk
X
C
resi;Ci2fRg
=X
1�C�pcj2fCg
Wi;j
X
k2fCgV Tk;j vk (Line 12)
=X
j
Wi;j
X
k
V Tk;j vk
177
Line 12 E�ect of summing resi within each processor row. This operation is merged with
the unavoidable summation of w within each processor row, hence this operation is
not performed and has no cost.
The update of w in HJS requires similar communication and computation costs
although the patterns of communication are quite di�erent. HJS uses recursive halving to
spread the result of tmp = V T v, computes W � tmp on all processors, and uses recursive
doubling to compute w while simultaneously spreading it to all processor columns. Although
the BLACS do not o�er recursive halving and recursive doubling operations we could build
them out of BLACS sends and receives but that incurs higher latency costs.
B.1.4 Updating w with minimal total cost
Line 4.1, w = w � W WTw � W WTw in Figure 8.3 can be computed with
O(n2 nbpr
2 + n log(pr�0:5)�) cost for any r � 0:5. On a high latency machine, one can
reduce the total number of messages by increasing the load imbalance. On a low latency
machine, one can reduce the load imbalance by using more messages. The two options de-
scribed in the preceding sections are special cases of the general case of methods described
in this section. Section B.1.2 corresponds to r = 0:5. Section B.1.3 corresponds to r = 1:0.
This method has not been implemented and hence has not been proven to result
in decreased execution times in practice.
Methods corresponding to 0:5 < r < 1:0 require what amounts to a four dimen-
sional processor grid. The pr�pc processor grid is divided into pr2�pc2 sub-grids with each
sub-grid consisting of pr1 � pc1 processors. We restrict our attention to square processor
grids and square processor sub-grids, hence pr = pc; pr1 = pc1 and pr2 = pc2. Each processor
column is identi�ed by a pair of numbers, (ca; cb), s.t. 1 � ca � pc1 and 1 � cb � pc2. Like-
wise, each processor row is identi�ed by a pair of numbers, (ra; rb), s.t. 1 � ra � pr1 and
1 � rb � pr2. No modi�cations are needed to the BLACS to support this method because
each processor belongs to only two 2 dimensional processor grids: the normal two dimen-
sional data layout and a two dimensional data layout containing only those processors in
the same processor sub-grid, i.e. with the same rb and cb.
Figure B.4 shows the general method for updating w using a 4 dimensional data
layout. The nb0 elements of tmp are distributed over the pr1 processor rows and columns
within each processor block, such that each processor row and column owns roughly nb0pr1
178
elements of tmp.
Figure B.4: Computing W � V T v on a four dimensional processor grid
tmpj;(ca;cb)j2frag
=X
k2f(ca;cb)gV Tk;j vk (Line 13)
8ra=ca
tmpj;(ca;cb)j2fcag
=X
1�cl�pc1k2f(ca;cb)g
V Tk;j vk (Line 14)
=X
k2f(�;cb)gV Tk;j vk
tmpj;(ca;cb)j2fcag
=X
k2f(ca;cb)gV Tk;j vk (Line 15)
resi;(ca;cb)i2(ra;rb)
=X
j2fcagWi;j tmpj;(ca;cb)
j2fcag(Line 16)
=X
j2fcagWi;j
X
k2f(�;cb)gV Tk;j vk
X
(ca;cb)
resi;(ca;cb)i2(ra;rb)
=X
1�ca�pc11�cb�pc2j2fcag
Wi;j
X
k2f(�;cb)gV Tk;j vk (Line 17)
=X
1�ca�pc1j2fcag
Wi;j
X
1�cb�pc2k2f(�;cb)g
V Tk;j vk
=X
j
Wi;j
X
k
V Tk;j vk
B.1.5 Notes to �gure B.4
Line 13 Local computation of V T � v. Operations:nX
i=1;nb
nbX
nb0=1
2i
pr
nb0
pc1 2 =
12 n
2 nb=(pr pc1) 2
Line 14 Combine tmpj2frag;(ca;cb) within each processor sub-grid column, leaving the an-
179
swer on the diagonal processor (i.e. ra = ca) within each sub-grid. Operations:
nX
i=1;nb
nbX
nb0=1
log(pc1) (�+nb0
pc1�) = n log(pc1)�+ 1
2
n nb
pc1log(pc1)�
Line 15 Broadcast tmpj2frag;(ca;cb) within each processor sub-grid row from the diagonal
processor in that sub-grid row. Operations:
nX
i=1;nb
nbX
nb0=1
log(pc1) (�+nb0
pc1�) = n log(pc1)�+ 1
2
n nb
pc1log(pc1)�
Line 16 Local computation of W � tmp. Operations:
nX
i=1;nb
nbX
nb0=1
2i
prnb0=pc1 2 = 1
2
n2 nb
pr pc1 2
Line 17 E�ect of summing resi within each processor row. This operation is merged with
the unavoidable summation of w within each processor row, hence this operation is
not performed and has no cost.
B.1.6 Overlap communication and computation as a last resort
There are numerous studies showing that overlapping communication and compu-
tation improves performance, but most of them show only modest improvement. Arbenz
and Slapnicar[9] show a 5% improvement by overlapping communication and computation
while Pourzandi and Tourancheau show a 6% improvement. Those that show the greatest
improvement combine communication and computation overlap with other equally impor-
tant techniques such as pipelining and lookahead[32].
I don't know why overlapping communication and computation leads to only mod-
est improvements. In theory it ought to hide most of the communication costs. There are
several possible explanations, all of which presumably contribute. I suspect that the most
important reason for the disappointing savings from overlap is that overhead and not com-
munication costs are not the primary factor limiting e�ciency. A second important reason
is that most of the cost of communication on todays distributed memory machines is the
cost of moving the data between the node and the network, not moving data within the
network. The cost of moving data to and from the node always involves main memory cy-
cles, unless the main memory is dual ported (i.e. expensive), which must be stolen from the
180
execution of the rest of the code. Further the latency cost is almost all software overhead,
hence during the message setup the cpu is busy and cannot compute.
The disadvantage to communication and computation overlap is that it adds com-
plexity which can be put to better use elsewhere. Both the Pourzandi/Tourancheau and
Arbenz/Slapnicar studies used a 1D data layout in Jacobi although a 2D data layout o�ers
lower communication and costs O( n2pp) versus O(n2) and lower overhead costs. They would
have done better to use a 2D data layout and delayed (potentially forever) consideration of
communication and computation overlap.
B.2 Matlab codes
B.2.1 Jacobi
The following is the matlab code for Table 7.4.
n = 1000;
p = 64;
blacsalpha = 65.9e-6;
blacsbeta=.146e-6;
dividebeta=3.85e-6;
squarerootbeta=7.7e-6;
blasonebeta=.074e-6;
dgemmalpha=103e-6;
dgemmbeta=.0215e-6;
term(1) = 8 * sqrt(p) * ( log2(p) - 3 ) * blacsalpha
term(2) = 7/2 * n^2 / sqrt(p) * blacsbeta
term(3) = 1/8 * n^2 / sqrt(p) * log2(p) * blacsbeta
term(4) = 1/2 * n^2 / sqrt(p) * dividebeta
term(5) = 1/4 * n^2 / sqrt(p) * squarerootbeta
term(6) = 3/8 * n^3 / p * blasonebeta
term(7) = 8 * sqrt(p) * dgemmalpha
term(8) = 5 * n^3 / p * dgemmbeta
time = sum(term)
181
Appendix C
Miscellaneous matlab codes
C.1 Reduction to tridiagonal form
The following matlab code performs an unblocked reduction to tridiagonal form.
It produces the same values, up to roundo�, of D, E and TAU as LAPACK's DSYTRD and
ScaLAPACK's PDSYTRD.
%
% tridi - An unblocked, non-syymetric reduction to tridiagonal form
%
% This file creates an input matrix A, reduces it to tridiagonal form
% and tests to make sure that the reduction was performed correctly.
%
% outputs:
% D, E - The tridiagonal matrix
% tau
% A - The lower half holds the householder updates
%
%
% Produce the input matrix
%
N = 7;
A = hilb(N) + toeplitz( [ 1 (1:(N-1))*i ] );
B = A; % Keep a copy to check our work later.
%
% Reduce to tridiagonal form
%
n = size(A,1);
182
I = eye(N);
for j =1:n-1
%
% Compute the householder vector: v
%
clear v;
v(1:n,1) = zeros(n,1);
v(j+1:n,1) = A(j+1:n,j);
alpha = A(j+1,j);
beta = - norm(v) * real(alpha) / abs( real(alpha) ) ;
tau(j) = ( beta - alpha ) / beta ;
v = v / ( alpha - beta ) ;
v(j+1) = 1.0 ;
%
% Perform the matrix vector multiply:
%
w = A * v ;
%
% Compute the companion update vector: w
%
w = tau(j) * w ;
c = w' * v;
w = (w - (c * tau(j) / 2 ) * v );
D(j) = A(j,j);
E(j) = beta ;
%
% Updte the trailing matrix
%
A = A - v * w' - w * v';
%
% Store the household vector back into A
%
A(j+2:n,j) = v(j+2:n);
end
D(n) = A(n,n);
183
%
% Check to make sure that the reduction was performed correctly.
%
DE = diag(D) + diag(E,-1) + diag(E,1) ;
Q=I;
for j = 1:n-1
clear house
house(1:n,1) = zeros(n,1);
house(j+1:n,1) = A(j+1:n,j);
house(j+1,1) = 1.0;
Q = (I- tau(j)' * house * house') * Q ;
end
norm( B - Q' * DE * Q )