PARALLEL ALGORITHMS FOR LOW-RANK APPROXIMATIONS OF ...
Transcript of PARALLEL ALGORITHMS FOR LOW-RANK APPROXIMATIONS OF ...
PARALLEL ALGORITHMS FOR LOW-RANK APPROXIMATIONS OFMATRICES AND TENSORS
BY
LAWTON MANNING
A Thesis Submitted to the Graduate Faculty of
WAKE FOREST UNIVERSITY GRADUATE SCHOOL OF ARTS AND SCIENCES
in Partial Fulfillment of the Requirements
for the Degree of
MASTER OF SCIENCE
Computer Science
May 2021
Winston-Salem, North Carolina
Approved By:
Grey Ballard, Ph.D., Advisor
Jennifer Erway, Ph.D., Chair
Samuel Cho, Ph.D.
Table of Contents
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Low-Rank Approximations . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Distributed-Memory Parallel Algorithms . . . . . . . . . . . . . . . . 2
1.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Chapter 2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Distributed-Memory Parallel Computing . . . . . . . . . . . . . . . . 5
2.1.1 MPI Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 MPI Collectives . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Parallel Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Singular Value Decomposition . . . . . . . . . . . . . . . . . . 8
2.2.2 Truncated SVD . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.3 QR Decomposition . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.4 Nonnegative Matrix Factorization . . . . . . . . . . . . . . . . 10
2.2.5 Hierarchical NMF . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Tensor Train . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2 Tensor Train Notation . . . . . . . . . . . . . . . . . . . . . . 13
2.3.3 TT Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Chapter 3 Parallel Hierarchical Clustering using Rank-Two Nonnegative MatrixFactorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Preliminaries and Related Work . . . . . . . . . . . . . . . . . . . . . 19
3.3.1 Non-negative Matrix Factorization(NMF) . . . . . . . . . . . 19
3.3.2 Parallel NMF . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.3 Communication Model . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
ii
3.4.1 Sequential Algorithms . . . . . . . . . . . . . . . . . . . . . . 23
3.4.2 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5.1 Experimental Platform . . . . . . . . . . . . . . . . . . . . . . 36
3.5.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Chapter 4 Tensor Train Rounding using Gram Matrices . . . . . . . . . . . . . . . . . . . . . . 50
4.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.1 Tensor Train Notation . . . . . . . . . . . . . . . . . . . . . . 50
4.2.2 Cholesky QR and Gram SVD . . . . . . . . . . . . . . . . . . 51
4.2.3 Cookies Problem and TT-GMRES . . . . . . . . . . . . . . . 52
4.2.4 TT-Rounding via Orthogonalization . . . . . . . . . . . . . . 54
4.2.5 Previous Work on Parallel TT-Rounding . . . . . . . . . . . . 55
4.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4 Truncation of Matrix Product . . . . . . . . . . . . . . . . . . . . . . 58
4.4.1 Truncation via Orthogonalization . . . . . . . . . . . . . . . . 59
4.4.2 Truncation via Gram SVD . . . . . . . . . . . . . . . . . . . . 59
4.4.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . 62
4.4.4 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . 63
4.5 TT-Rounding via Gram SVD . . . . . . . . . . . . . . . . . . . . . . 67
4.5.1 TT Rounding Structure . . . . . . . . . . . . . . . . . . . . . 67
4.5.2 Structured Gram Matrix Computation . . . . . . . . . . . . . 68
4.5.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5.4 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5.5 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . 75
4.6 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 77
4.6.2 Parallel Scaling of TT Rounding . . . . . . . . . . . . . . . . . 78
4.6.3 Time Breakdown of TT Rounding . . . . . . . . . . . . . . . . 79
4.6.4 TT-GMRES Performance . . . . . . . . . . . . . . . . . . . . 80
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Chapter 5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Curriculum Vitae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
iii
List of Figures
2.1 NMF matrix diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Tensor Train format of a five-way tensor . . . . . . . . . . . . . . . . 13
2.3 Unfoldings for TT Tensors . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 Hierarchical Clustering of DC Mall HSI . . . . . . . . . . . . . . . . . 20
3.2 Hierarchy node classification . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Parallel cluster splitting using Rank-2 NMF . . . . . . . . . . . . . . 30
3.4 Strong Scaling for Clustering on DC-HYDICE . . . . . . . . . . . . . 38
3.5 Strong Scaling Speedup for Rank-2 NMF . . . . . . . . . . . . . . . . 39
3.6 Time Breakdown for Rank-2 NMF on Synthetic . . . . . . . . . . . . 40
3.7 Time Breakdown for Rank-2 NMF on SIIM-ISIC . . . . . . . . . . . . 41
3.8 Strong Scaling Speedup for Clustering . . . . . . . . . . . . . . . . . 42
3.9 Time Breakdown for Clustering on Synthetic . . . . . . . . . . . . . . 43
3.10 Time Breakdown for Clustering on SIIM-ISIC . . . . . . . . . . . . . 44
3.11 Level Times for 1 Compute Node on Synthetic . . . . . . . . . . . . . 46
3.12 Level Times for 40 Compute Nodes on Synthetic . . . . . . . . . . . . 47
3.13 Rank Scaling for Hierarchical and Flat NMF . . . . . . . . . . . . . . 47
4.1 Numerical results for truncation of matrix product X = ABT . . . . 65
4.2 Tensor network diagrams . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3 Strong Scaling for Model 2 . . . . . . . . . . . . . . . . . . . . . . . . 78
4.4 Performance results for Model 3 . . . . . . . . . . . . . . . . . . . . . 78
4.5 Weak scaling time breakdowns for Model 1 . . . . . . . . . . . . . . . 80
4.6 TT-GMRES timing for MATLAB implementation . . . . . . . . . . . 81
4.7 TT-GMRES Weak Scaling . . . . . . . . . . . . . . . . . . . . . . . . 82
iv
Abstract
Low-rank approximations are useful in the compression and interpretation of largedatasets. Distributed parallel algorithms of such approximations, like those for matri-ces and tensors, are applicable for even larger datasets that cannot conceivably fit onone computer. In this thesis I will present parallelizing two such approximation algo-rithms: Hierarchical Nonnegative Matrix Factorization, and Tensor Train Rounding.In both cases, the distributed parallel algorithms outperform the state of the art.
Nonnegative Matrix Factorization (NMF) is a tool for clustering nonnegative ma-trix data. A Hierarchical NMF clustering can be achieved by recursively clusteringa dataset using Rank-2 or two cluster NMF. The hierarchical clustering algorithmcan reveal more detailed information about the data. Also, it is faster than a flatclustering of the same size since Rank-2 NMF is faster and scales better than thegeneral NMF algorithm as the number of clusters increases.
Tensor Train (TT) uses a series of 3-dimensional TT cores to approximate anN-dimensional tensor. TT ranks determine the sizes of these cores. Arithmetic withTensor Train causes an artificial increase in the TT ranks, and thus the sizes of the TTcores. So, TT applications use an algorithm called TT rounding to truncate TT rankssubject to some approximation error. The TT rounding algorithm can be thought of asa Truncated Singular Value Decomposition (tSVD) of a product of highly structuredmatrices. The state-of-the-art approach requires a slow orthogonalization phase. Afaster Gram SVD algorithm avoids this slow phase and reduces the computation timeof TT Rounding and improves its parallel scalability.
v
Chapter 1: Introduction
Low-rank approximations are useful in the compression and interpretation of large
datasets. Distributed-memory parallel algorithms of such approximations, like those
for matrices and tensors, are applicable for even larger datasets that cannot conceiv-
ably fit on one computer. In this thesis we will present parallelizing two such approxi-
mation algorithms: Hierarchical Nonnegative Matrix Factorization, and Tensor Train
Rounding. In both cases, the distributed-memory parallel algorithms outperform the
state of the art.
1.1 Low-Rank Approximations
There are a wide variety of low-rank approximations that are used in a range of appli-
cations such as facial recognition [18], dimensional reduction [62], hyperspectral image
segmentation [25], and data completion [55]. Some of these low-rank approximations
include: Singular Value Decomposition (SVD), Nonnegative Matrix Factorization
(NMF), Principal Component Analysis (PCA), the tensor CP Decomposition, and
Tensor Train (TT).
For example, hyperspectral image segmentation is a popular application for Non-
negative Matrix Factorization (NMF). NMF is a clustering algorithm that can cluster
individual pixels in a hyperspectral image. The resulting NMF clustering also con-
tains feature signatures for each cluster and fractional cluster membership for each
pixel. For hyperspectral images, these feature signatures can describe the types of
materials each pixel captures as different materials reflect light at different spectra
(colors) [25].
Another example is the low-rank approximation of incomplete tensor data called
1
tensor completion. Tensor completion is the problem of filling missing or unobserved
entries of partially observed tensors [55]. Filling missing entries in a tensor gives many
degrees of freedom for what those entries could ultimately be, so tensor completion
problems require constraints so that they can be solvable. One of the common con-
straints is maintaining a low rank in the resulting completed tensor. There are several
definitions of rank for a tensor approximation, depending on the type of approxima-
tion used. One of the common tensor decompositions used for tensor completion is
the CP decomposition. After computing the CP decomposition that best fits the
observed data and has a minimal rank, the unobserved data is predicted using the
corresponding value from that CP model.
1.2 Distributed-Memory Parallel Algorithms
In 1965, Gordon Moore observed that the number of transistors on a single silicon
chip has increased by a factor of two per year and proposed that it would continue
to do so for at least the next 10 years [42]. This observation, now known as Moore’s
Law has been generalized over time to computational instead of transistor density. As
engineers met the physical limits of transistor density, other strategies were developed
to meet the extended Moore’s law, such as multiple processor cores on a single chip
and GPU accelerators. However, even as computers become more and more powerful,
there are still problems that take too long to solve. These problems also typically
require large amounts of memory as well. Both computational and storage bottlenecks
lead us to work on distributed-memory systems such as supercomputers.
The most powerful supercomputers in the world are not made up of futuristic
processors or overly large hard drives. Instead, they are giant networks of individ-
ual computers made of commercially available technology. For example, the Summit
supercomputer at Oak Ridge National Laboratory was the most powerful supercom-
2
puter in the world, with 4608 individual “nodes”, each with 2 IBM POWER-9 CPUs
and 6 NVIDIA Volta GPUs [39]. Although each of these nodes are powerful in their
own right, the ability to utilize mulitple nodes in tandem makes distributed-memory
parallel algorithms high performing.
The Summit nodes each contain 512 GB of main memory for use by the proces-
sors [39]. If a problem requires more than this amount of memory, which is likely
for problems requiring high performance computing, adding more nodes to the com-
putation can allow for the distribution of that problem’s data across many nodes.
However, distributing memory like this comes with a downside, which is the commu-
nication between nodes.
Relative to the speed of computation on an individual node, the costs associated
with communicating data between two nodes is orders of magnitude higher. In the
worst cases, the majority of time spent in a distributed-memory algorithm can be
that slow communication of data instead of the actual computations of the algorithm,
which limits parallel scalability. This is why we must design parallel algorithms that
avoid this communication as much as possible. The algorithms presented in this thesis
both avoid communicating the bulk of their data but instead communicate the results
of smaller, intermediate calculations.
1.3 Applications
This thesis will cover two distributed-memory parallel algorithms for low-rank approx-
imations: Nonnegative Matrix Factorization and Tensor Train. Nonnegative Matrix
Factorization (NMF) is a clustering algorithm for nonnegative data that can extract
feature signatures and cluster membership for individual samples. Hierarchical Clus-
tering with Rank-2 NNMF (HierNMF) results from an optimization on a flat NMF
clustering algorithm. HierNMF can give a deeper answer than the flat algorithm and
3
potentially do it faster. This algorithm is discussed further in chapter 3. Tensor Train
(TT) is a data compression format for tensors, which are multidimensional arrays in
any number of dimensions. TT allows for computations do be done on tensors implic-
itly without being uncompressed. TT Rounding is a common bottleneck subroutine
used in many TT applications and so, chapter 4 proposes another approach to that
subroutine that avoids both communication and computation to result in a faster
approximation.
4
Chapter 2: Preliminaries
This chapter will provide some background knowledge on how distributed-memory
algorithms are designed and implemented using the Message Passing Interface (MPI),
and analyzed using the α − β − γ model, and the linear algebra concepts needed to
understand the content of future chapters.
2.1 Distributed-Memory Parallel Computing
Distributed-memory parallel architectures consist of multiple processors, each with
their own local memory. We use the Message Passing Interface (MPi) to allow for
processors to explicitly send and receive data. MPI is a standard interface for writing
distributed-memory parallel code in C, C++, and FORTRAN. Unlike shared memory
interfaces like OpenMP, MPI requires that data must be explicitly passed between
processors, often through collectively invoked functions.
2.1.1 MPI Cost Model
In analyzing MPI algorithms, there are the normal costs of computation as well as the
additional communication costs of passing data between processors. Communication
costs can be broken down into two parts: bandwidth and latency. Bandwidth is
the cost associated with the amount of data sent between processors. Latency is
the overhead cost of sending any amount of data in MPI. To analyze these costs
together, we use the α − β − γ model defined in [11]. This model combines the
costs of latency, bandwidth, and computation by assigning the coefficients α, β, and
γ to each, respectively. On distributed-memory systems, latency is the most costly
followed by bandwidth and then computation. So, α β γ. In this model, the
5
cost of sending w words of data costs βw + α.
2.1.2 MPI Collectives
MPI collectives are commonly used functions where groups of processors invoke one
function to pass data collectively between them. Table 2.1 shows the MPI collectives
used in this thesis and their initial and final data distributions. For example, given
elements of a vector x scattered across processors, AllGather will gather those ele-
ments so that all processors have a full copy of x. If instead each processor had a
local x, AllReduce would sum the individual x and store the result to all processors.
ReduceScatter would sum the local x on each processor and distribute the elements
of that sum across processors [11].
Operation Before After
All-Reducep0 p1 p2
x(0) x(1) x(2)
p0 p1 p2∑pj x
(j)∑p
j x(j)
∑pj x
(j)
Reduce-Scatter
p0 p1 p2
x(0)0 x
(1)0 x
(2)0
x(0)1 x
(1)1 x
(2)1
x(0)2 x
(1)2 x
(2)2
p0 p1 p2∑pj x
(j)0 ∑p
j x(j)1 ∑p
j x(j)2
All-Gather
p0 p1 p2
x0
x1
x2
p0 p1 p2
x0 x0 x0
x1 x1 x1
x2 x2 x2
Table 2.1: MPI collective algorithm data distributions [11]. xi is a segment of a vectorx. x(j) is data originally belonging to processor pj.
Table 2.2 shows the minimal α−β−γ costs of each of the three collectives described
in Table 2.1. As the number of processors p increases, the latency costs increase,
eventually creating a bottleneck in any distributed-memory parallel algorithm.
6
CollectiveCost
Computation (γ) Bandwidth (β) Latency (α)All-Reduce
O(n)O(n) O(log2 p)Reduce-Scatter
All-Gather —
Table 2.2: MPI collective algorithm costs using the α− β − γ model [11]. The costsassume an input array of n words that is communicated using p processors.
2.1.3 Parallel Scaling
Scaling is useful for analyzing parallel algorithms. There are two types of scaling:
strong and weak. Strong scaling is done by observing the performance boost by
increasing the number of processors for working on the same problem. An algorithm
is said to have perfect strong scaling when the performance “speed-up” relative to
one processor is identical to the number of processors used (e.g. 8x speed-up for 8
processors). Perfect strong scaling is possible when the problem is computationally
bound and the computations can be evenly distributed between processors. However,
after a certain point the communication cost in a parallel algorithm will start to
dominate entirely since it can grow with the number of processors used.
Weak scaling is done by observing the performance as the number of processors
increases in step with the size of the problem. Applications for weak scaling are
generally problems where resolution can be increased. This could be the number of
spatial grid points in a simulation, for example.
2.2 Matrices
A matrix is a two-dimensional grid of numbers and is a useful data storage format. In
this work, a matrix called “A” is written as A. One of the important characteristics
of a matrix that is explored in this thesis is its rank. We will explore the rank further
in section 2.2.1.
7
Low-rank approximations of matrices extract the most useful features out of the
original matrix. This can be useful in things like image compression as the resulting
representation of the matrix can be smaller but still maintain the essence of the
original data.
2.2.1 Singular Value Decomposition
The Singular Value Decomposition (SVD) is a popular factorization of real or complex
matrices into interpretable component matrices. The SVD is given by
A = UΣVT (2.1)
where A ∈ Rm×n, U ∈ Rm×n, Σ ∈ Rn×n, V ∈ Rn×n, and m ≥ n.
U and V are orthonormal matrices. Orthonormal matrices have orthogonal col-
umn vectors with unit norms. This means that each column vector is perpendicular
to the other column vectors in the matrix and their “length” is 1. In the case of the
SVD, the column vectors of U and V are called the left and right singular vectors,
respectively.
Σ is a diagonal matrix with positive descending diagonal entries. This means that
only the entries along the main diagonal from upper-left to lower-right can be nonzero
while the rest of the matrix is zero. These diagonal entries are called the singular
values, and they are unique to the matrix A.
The SVD has many properties that are useful for Numerical Linear Algebra. The
rank r of the matrix A is defined as the number nonzero of singular values in Σ.
Since the number of singular values is bounded by the number of diagonal entries of
the matrix Σ, the rank is also bounded as r ≤ n. If r = n, a matrix is said to be full
rank.
8
2.2.2 Truncated SVD
Given a matrix A with rank r and SVD A = UΣVT , the best rank k ≤ r approxi-
mation of A can be defined as
Ak =k∑j=1
σjujvjT (2.2)
as provided by [60], where σj are the singular values of A up to k and uj and vjT
are column vectors of U and V.
From eq. (2.2), the truncated SVD is the first k vectors of U and VT and the first
k singular values from the full SVD of a matrix A. The Truncated SVD is represented
as Ak = UΣVT
.
So, after computing the full SVD as described in section 2.2.1, the truncated SVD
for any rank k is trivial to compute.
2.2.3 QR Decomposition
Similar to the Singular Value Decomposition (section 2.2.1), the QR decomposition
takes any matrix A and computes
A = QR (2.3)
where A ∈ Rm×n, Q ∈ Rm×n, and R ∈ Rn×n. Like U and V in the SVD, Q has
orthonormal columns. R is an upper triangular matrix. This type of matrix generally
has nonzeros along the main diagonal and every entry above the main diagonal in a
triangle, while every entry below the main diagonal is zero.
The QR decomposition is useful for solving least squares problems. As will be
explained in chapter 4, it can also be used to solve the Truncated SVD, as it is less
computationally expensive to compute.
9
A
(m×n)
≈ W
(m×k)
HT
(k×n)
Figure 2.1: Nonnegative Matrix Factorization (NMF) of a matrix A by factor matricesW and H. The dimensions of each matrix are listed in parentheses below the boxes.The boxes of each matrix are relative in size to one another given dimension choices.
2.2.4 Nonnegative Matrix Factorization
Nonnegative Matrix Factorization (NMF) is an approximation of a matrix with high
dimensions as a product of two lower dimensional nonnegative matrices. The approx-
imation is written as
A ≈WHT (2.4)
where A ∈ Rm×n+ and is a data matrix. W ∈ Rm×k
+ and H ∈ Rn×k+ are both nonneg-
ative factor matrices. The chosen k ≤ min (m,n) is a parametrized value and is the
rank of the factor matrices and also the nonnegative rank of the approximation of A.
This approximation is also depicted in fig. 2.1.
There are several methods for computing a NMF. One of these methods is the
Alternating Nonnegative Least Squares (ANLS) method [38]. This method starts
with the minimization problem
minW≥0‖A−WHT‖ (2.5)
for finding W and the similar problem of
minH≥0‖AT −HW T‖ (2.6)
10
for finding H. These are both constrained Least Squares (LS) problems with nonneg-
ativity constraints. They are referred to as Nonnegative Least Squares (NNLS).
By fixing either W or H and solving the linear system for the other, an alternating
update algorithm can converge to a stopping point, since both minimizations are
convex problems [38].
There are different algorithms used solve the NNLS problems as described in
eq. (2.5) and eq. (2.6). One of these methods, Block Principal Pivoting (or BPP),
is described in [35] and [31]. BPP uses the active set method in order to compute
the NNLS. The active set method deals with the non-negative constraint of NNLS
by iteratively computing the unconstrained LS and grouping negative contributions.
This active set method is well-defined for the vector case, and is extended to the
matrix case by going column-by-column.
2.2.5 Hierarchical NMF
NMF can be used to cluster data by interpreting the W and H factor matrices. For
example, if columns of a data matrix represent samples of data and rows represent
features of those samples, then the k columns of W represent k clusters of data and
the k rows of HT represent the membership of each data point in the k clusters.
Since NMF can naturally be used as a clustering algorithm, recursively calling
NMF with k = 2 on data can result in a hierarchical tree of clusters. This is the
basic premise of the Hierarchical NMF algorithm. In Hierarchical NMF, k refers to
the number of leaf clusters in the resulting tree.
From section 2.2.4, BPP is a general approach to solving NNLS for any k and
scales like O(k). In [38], the authors propose a faster NNLS that requires k = 2.
The possible active sets for a matrix with k = 2 is only of size 4 and so can be
computed exhaustively without being infeasible. Since the algorithm proposed in [38]
11
is so simple to compute for k = 2, the authors proposed that it be used as a subroutine
for Hierarchical NMF. In chapter 3, we parallelize this Hierarchical NMF algorithm
using a parallel Rank-2 NMF.
2.3 Tensors
Tensors are a generalization of matrices in higher dimensions. In this work, a tensor
called “T” is written as T. Tensors are popular in a number of fields such as sig-
nal processing, numerical linear algebra, computer vision, numerical analysis, data
mining, graph analysis, and neuroscience [36].
2.3.1 Tensor Train
One of the problems of working with tensors is the so-called “curse of dimensionality”,
where the number of elements of the tensor is exponential in the number of modes [47].
Some tensor applications can use tens to thousands of modes and so can lead to tensors
of infeasible size in both storage and computation. A solution to this problem is to
use a tensor decomposition that can compress the data and is not exponential in the
number of modes. One such decomposition is called Tensor Train.
Tensor Train (TT) is a low-rank tensor decomposition. It’s been used in areas
such as molecular simulations, data completion, uncertainty quantification, and clas-
sification [1]. The “train” of tensor train is a series of tensors, called TT cores. Each
of these tensors, with the exception of the first and last tensors, is a three-way tensor.
The first and last tensor in the train are both matrices. Figure 2.2 shows a diagram
of a five-way tensor in TT format.
12
i
j
k
l
m
I1
R1
I2
R1R2
I3
R2R3
I4
R3R4
I5
R4
Figure 2.2: TT format of a five-way tensor X ∈ RI1×I2×I3×I4×I5 . Note that R0 =RN = 1 is shown through the first and last TT cores being matrices. The blue shadedregions represent the matrices and vectors required in computing eq. (2.7). Althoughthe In can be of any size, they are generally thought to be much larger relative toRn and so this representation shows tall TT cores.
2.3.2 Tensor Train Notation
Given a tensor X ∈ RI1×···×IN where N is the number of modes of X and each Ik is
the dimension of that mode, if X can be represented in TT format, then there exist
positive integers R0, . . . , RN with R0 = RN = 1 and N TT cores where the nth TT
core is TX,n ∈ RRn−1×In×Rn . In other words, X is in TT format if can be represented
as
X(i1, . . . , iN) = TX,1(i1, :) · · ·TX,n(:, in, :) · · ·TX,N(:, iN) (2.7)
where TX,n is the nth tensor core of N cores in the train [47]. Figure 2.2 shows the
pattern of element access for the entry X(i, j, k, l,m).
The integers R0, . . . , RN are called the TT ranks. By reducing these TT ranks
and approximating X, then the resulting tensor is in a more compressed format. This
TT rank reduction is called TT rounding.
One of the advantages of using Tensor Train over other tensor low-rank approxima-
tions is that the number of elements of the TT format is linear rather than exponential
13
in the number of modes of the original tensor. In other words,
|TT (X)| =N∑k
Rk−1IkRk (2.8)
where |TT (X)| is the number of elements of the TT representation of X. Note that
eq. (2.8) shows that
|TT (X)| = O(NIR2
)(2.9)
where N is the number of modes of X, I is the largest dimension of X and R is the
largest TT-rank of X.
By comparison to eq. (2.9), another decomposition called Tucker hasO(RN +NIR
)elements, where R is called the Tucker rank, which might be different than TT ranks.
TT avoids having elements that are exponential in the number of modes by limiting
the modes of the factor tensors.
Some computations with tensors, such as the truncated SVD, require the individ-
ual TT cores to be “unfolded”. Figure 2.3 shows this pattern of unfolding for vertical
and horizontal unfoldings.
14
In
Rn−1Rn
TX,n ∈ RRn−1×In×Rn
are TT cores
Rn
Rn−1 · · ·Rn
· · ·Rn
In
H(TX,n) ∈ RRn−1×InRn
Rn
Rn−1
...
Rn−1
...
Rn−1
In
V(TX,n) ∈ RRn−1In×Rn
Figure 2.3: Types of unfolding for TT tensors. TX,n is the nth TT core. The blueshaded region is a slice of TX,n. H(TX,n) is the horizontal unfolding of TX,n. V(TX,n)is the vertical unfolding of TX,n.
2.3.3 TT Rounding
The truncated SVD is necessary to reduce the ranks of a TT tensor X. In general,
each TT rank Rn is reduced as the TT rounding algorithm proceeds down the train
of TT cores. The current state-of-the-art method of computing TT rounding requires
an orthogonalization step using the QR decomposition. Although it is quite accurate,
this approach is slow. Chapter 4 describes an improvement on this method that
avoids using the QR orthogonalization step, improving the speed of the overall TT
Rounding algorithm.
15
Chapter 3: Parallel Hierarchical Clustering using
Rank-Two Nonnegative Matrix Factorization
The following chapter is a manuscript published to the International Conference
on High Performance Computing (HiPC’20) authored by myself, Grey Ballard, Ra-
makrishnan Kannan, and Haesun Park. For this work, I contributed to designing and
implementing the parallel algorithms identified in the paper. I also contributed to the
experimental section of the manuscript by reporting results and choosing data sets
for experimentation.
3.1 Abstract
Nonnegative Matrix Factorization (NMF) is an effective tool for clustering nonnega-
tive data, either for computing a flat partitioning of a dataset or for determining a
hierarchy of similarity. In this paper, we propose a parallel algorithm for hierarchical
clustering that uses a divide-and-conquer approach based on rank-two NMF to split a
data set into two cohesive parts. Not only does this approach uncover more structure
in the data than a flat NMF clustering, but also rank-two NMF can be computed
more quickly than for general ranks, providing comparable overall time to solution.
Our data distribution and parallelization strategies are designed to maintain compu-
tational load balance throughout the data-dependent hierarchy of computation while
limiting interprocess communication, allowing the algorithm to scale to large dense
and sparse data sets. We demonstrate the scalability of our parallel algorithm in terms
of data size (up to 800 GB) and number of processors (up to 80 nodes of the Summit
supercomputer), applying the hierarchical clustering approach to hyperspectral imag-
ing and image classification data. Our algorithm for Rank-2 NMF scales perfectly
16
on up to 1000s of cores and the entire hierarchical clustering method achieves 5.9x
speedup scaling from 10 to 80 nodes on the 800 GB dataset.
3.2 Introduction
Nonnegative Matrix Factorization (NMF) has been demonstrated to be an effective
tool for unsupervised learning problems including clustering [15, 51, 65]. An NMF
consists of two tall-and-skinny non-negative matrices whose product approximates a
nonnegative data matrix. That is, given an m×n data matrix A, we seek nonnegative
matrices W and H that each have k columns so that A ≈ WHT. Each pair of
corresponding columns of W and H form a latent component of the NMF. If the
rows of A correspond to features and the columns to samples, the ith row of the H
matrix represents the loading of sample i onto each latent component and provides a
soft clustering. Because the W factor is also nonnegative, each column can typically
be interpreted as a latent feature vector for each cluster.
Hierarchical clustering is the process of recursively paritioning a group of samples.
While standard NMF is interpreted as a flat clustering, it can also be extended for
hierarchical clustering. Kuang and Park [38] propose a method that uses rank-2 NMF
to recursively bipartition the samples. The method determines a binary tree such that
all leaves contain unique samples and the structure of the tree determines hierarchical
clusters.A single W vector for each node can also be used for cluster interpretation.
We discuss the hierarchical method in more detail in Section 3.3 and Section 3.4.1.
We illustrate the output of the hierarchical clustering method with an example
data set and output tree. Following Gillis et al. [25], we apply the method to a
hyperspectral imaging (HSI) data set of the Washington, D.C national mall, which
has pixel dimensions 1280 × 307 and 191 spectral bands. Figure 3.1 visualizes the
output tree with 6 leaves along with their hierarchical relationships. The root node,
17
labeled 0, is a flattening of the HSI data to a 2D grayscale image. Each other node is
represented by an overlay of the member pixels of the clusters (in blue) on the original
grayscale image. The first bipartitioning separates vegetation (cluster 1) from non-
vegetation (cluster 2), the bipartitioning of cluster 1 separates grass (cluster 3) from
trees (cluster 4), the bipartitioning of cluster 2 separates buildings (cluster 5) from
sidewalks/water (cluster 6), and so on. If the algorithm continues, it chooses to split
the leaf node that provides the greatest benefit to the overall tree, which can be
quantified as a node’s “score” in various ways.
While the hierarchical clustering method offers advantages in terms of interpre-
tation as well as execution time compared to flat NMF, implementations of the al-
gorithm are limited to single workstations and the dataset must fit in the available
memory. Currently available implementations can utilize multiple cores via MAT-
LAB [38] or explicit shared-memory parallelization in the SmallK library [17].
The goal of this work is to use distributed-memory parallelism to scale the algo-
rithm to large datasets that require the memory of multiple compute nodes and to
high processor counts. While flat NMF algorithms have been scaled to HPC plat-
forms [6, 21, 32, 41], our implementation is the first to our knowledge to scale a hier-
archical NMF method to 1000s of cores. As discussed in detail in Section 3.4.2, we
choose to parallelize the computations associated with each node in the tree, which
involve a Rank-2 NMF and the computation of the node’s score. We choose a data
matrix distribution across processors that avoids any redistribution of the input ma-
trix regardless of the data-dependent structure of the tree’s splitting decisions so that
the communication required involves only the small factor matrices. Analysis of the
algorithm shows the dependence of execution time on computation and communica-
tion costs as well as on k, the number of clusters computed. In particular, we confirm
that many of the dominant costs are logarithmic in k, which is favorable to the linear
18
or sometimes superlinear dependence of flat NMF algorithms.
We demonstrate in Section 3.5 the efficiency and scalability of our parallel al-
gorithm on three data sets, including the HSI data of the DC mall and an image
classification data set involving skin melanoma. The experimental results show that
our parallelization of Rank-2 NMF is highly scalable, maintaining computation bound
performance on 1000s of cores. We also show the limits of strong scalability when
scaling to large numbers of clusters (leaf nodes), as the execution time shifts to be-
coming interprocessor bandwidth bound and eventually latency bound. The image
classification data set requires 800 GB of memory across multiple nodes to process,
and in scaling from 10 nodes to 80 nodes of the Summit supercomputer (see Sec-
tion 3.5.1), we demonstrate parallel speedups of 7.1× for a single Rank-2 NMF and
5.9× for a complete hierarchical clustering.
3.3 Preliminaries and Related Work
3.3.1 Non-negative Matrix Factorization(NMF)
The NMF constrained optimization problem
minW,H≥0
‖A−WHT‖2
is nonlinear and nonconvex, and various optimization techniques can be used to ap-
proximately solve it. A popular approach is to use alternating optimization of the
two factor matrices because each subproblem is a nonnegative least squares (NNLS)
problem, which is convex and can be solved exactly. Many block coordinate descent
(BCD) approaches are possible [34], and one 2-block BCD algorithm that solves the
NNLS subproblems exactly is block principal pivoting [35]. This NNLS algorithm is
an active-set-like method that determines the sets of entries in the solution vectors
that are zero and those that are positive through an iterative but finite process.
19
3
9 10
4
1
11 12
5 6
2
0
Figure 3.1: Hierarchical Clustering of DC Mall HSI
20
When the rank of the factorization (the number of columns of W and H) is
2, the NNLS subproblems can be solved much more quickly because the number
of possible active sets is only 4. As explained in more detail in Section 3.4.1, the
optimal solution across the 4 sets can be determined efficiently to solve the NNLS
subproblem more quickly than general-rank approaches like block principal pivoting.
Because of the relative ease of solving the NMF problem for the rank-2 case, Kuang
and Park [38] propose a recursive method to use a rank-2 NMF to partition the input
data into 2 parts, whereby each part can be further partitioned via rank-2 NMF
of the corresponding original data. This approach yields a hierarchical factorization,
potentially uncovering more global structure of the input data and allowing for better
scalability of the algorithm to large NMF ranks.
The hierarchical rank-2 NMF method has been applied to document clustering [38]
and hyperspectral image segmentation [25]. The leaves of the tree also yield a set of
column vectors that can be aggregated into an approximate W factor (ignoring their
hierarchical structure). Using this factor matrix to initialize a higher-rank NMF com-
putation leads to quick convergence and overall faster performance than initializing
NMF with random data; this approach is known as Divide-and-Conquer NMF [19].
We focus in this paper on parallelizing the hierarchical algorithms proposed by Kuang
and Park [38] and Gillis et al. [25].
3.3.2 Parallel NMF
Scaling algorithms for NMF to large data often requires parallelization in order to fit
the data across the memories of multiple compute nodes or speed up the computation
to complete in reasonable time. Parallelizations of multiple optimization approaches
have been proposed for general NMF [6, 17, 21, 32, 41]. In particular, we build upon
the work of Kannan et al. [20, 31, 32] and the open-source library PLANC, designed
21
for nonnegative matrix and tensor factorizations of dense and sparse data. In this
parallelization, the alternating optimization approach is employed with various op-
tions for the algorithm used to (approximately) solve the NNLS subproblems. The
efficiency of the parallelization is based on scalable algorithms for the parallel ma-
trix multiplications involved in all NNLS algorithms; these algorithms are based on
Cartesian distributions of the input matrix across 1D or 2D processor grids.
3.3.3 Communication Model
We use the α-β-γ model [4, 11, 58] for analysis of distributed-memory parallel algo-
rithms. In this model, the cost of sending a single message of n words of data between
two processors is α + β · n, so that α represents the latency cost of the message and
β represents the bandwidth cost of each word in the message. The γ parameter
represents the computational cost of a single floating point operation (flop). In this
simplified communication model, we ignore contention in the network, assuming in
effect a fully connected network, and other limiting factors in practice such as the
number of hops between nodes and the network injection rate [28]. We let p represent
the number of processors available on the machine.
All of the interprocessor communication in the algorithms presented in this work
are encapsulated in collective communication operations that involve the full set of
processors. Algorithms for implementing the collective operations are built out of
pairwise send and receive operations, and we assume the most efficient algorithms are
used in our analysis [11, 58]. The collectives used in our algorithms are all-reduce,
all-gather, and reduce-scatter. In an all-reduce, all processors start out with the same
amount of data and all end with a copy of the same result, which is in our case a sum
of all the inputs (and the same size as a single input). The cost of an all-reduce of size
n words is α ·O(log p) + (β+ γ) ·O(n) for n > p and α ·O(log p) + (β+ γ) ·O(n log p)
22
for n < p. In an all-gather, all processors start out with separate data and all end
with a copy of the same result, which is the union of all the input data. If each
processor starts with n/p data and ends with n data, the cost of the all-gather is
α · O(log p) + β · O(n). In a reduce-scatter, all processors start out with the same
amount of data and all end with a subset of the result, which is in our case a sum of
all the inputs (and is smaller than its input). If each processor starts with n data and
ends with n/p data, the cost of the reduce-scatter is α ·O(log p)+(β+γ) ·O(n). In the
case of all-reduce and reduce-scatter, the computational cost is typically dominated
by the bandwidth cost because β γ.
3.4 Algorithms
3.4.1 Sequential Algorithms
Rank-2 NMF
Using the 2-block BCD approach for a rank-2 NMF yields NNLS subproblems of the
form minH≥0‖WH
T −A‖ and minW≥0 ‖HWT −AT‖. In each case, the columns of the
transposed variable matrix can be computed independently. Considering the ith row
of H, for example, the NNLS problem to solve is
minhi,1,hi,2≥0
∥∥∥∥[w1 w2
] [hi,1hi,2
]− ai
∥∥∥∥= min
hi,1,hi,2≥0
∥∥hi,1w1 + hi,2w2 − ai∥∥
where w1 and w2 are the two columns of W and ai is the i column of A. We note that
there are four possibilities of solutions, as each of the two variables may be positive
or zero.
As shown by Kuang and Park [38], determining which of the four possible solutions
is feasible and optimal can be done efficiently by exploiting the following properties:
23
• if the solution to the unconstrained least squares problem admits two positive
values, it is the optimal solution to the nonnegatively constrained problem,
• if W and A are both nonnegative, then the candidate solution with two zero
values is never (uniquely) optimal and can be discarded, and
• if the unconstrained problem does not admit a positive solution, the better of
the two remaining solutions can be determined by comparing aTj w1/‖w1‖ and
aTj w2/‖w2‖.
If the unconstrained problem is solved via the normal equations, then the temporary
matrices computed for the normal equations (WTW and ATW) can be re-used to
determine the better of the two solutions with a single positive variable.
Algorithm 1 implements this strategy for all rows of H simultaneously. It takes as
input the matrices C = ATW and G = WTW, first solves the normal equations for
the unconstrained problem, and then chooses between the two alternate possibilities as
necessary. We note that each row of H is independent, and therefore this algorithm is
easily parallelized. Solving for W can be done using inputs C = AH and G = HTH.
Given that the computational complexity of Algorithm 1 is O(n) (or O(m) when
computing W), and the complexity of computing WTW and HTH is O(m+ n), the
typical dominant cost of each iteration of Rank-2 NMF is that of computing ATW
and AH, which is O(mn).
Hierarchical Clustering
A Rank-2 NMF can be used to partition the columns of the matrix into two parts.
In this case, the columns of the W factor represent feature weights for each of the
two latent components, and the strength of membership in the two components for
each column of A is given by the two values in the corresponding row of H. We can
24
Algorithm 1 Rank-2 Nonnegative Least Squares Solve [38]
Require: C is n× 2 and G is 2× 2 and s.p.d.1: function H = Rank2-NLS-Solve(C,G)2: H = CG−1 . Solve unconstrained system3: for i = 1 to n do4: if hi1 < 0 or hi2 < 0 then5: . Choose between single-variable solutions6: if ci1/
√g11 < ci2/
√g22 then
7: hi1 = 08: hi2 = ci2/g22
9: else10: hi1 = ci1/g11
11: hi2 = 012: end if13: end if14: end for15: end functionEnsure: H = arg min
H≥0
‖A−WHT‖ is n× 2 with C = ATW and G = WTW
determine part membership by comparing those values: if hi1 > hi2, then column i of
A is assigned to the first part, which is associated with feature vector w1. Membership
can be determined by other metrics that also take into account balance across parts
or attempt to detect outliers.
Given Rank-2 NMF as a splitting procedure, hierarchical clustering builds a binary
tree such that each node corresponds to a subset of samples from the original data
set and each node’s children correspond to a 2-way partition of the node’s samples.
In this way, the leaves form a partition of the original data, and the internal nodes
specify the hierarchical relationship among clusters. As the tree is built, nodes are
split in order of their score, or relative value to the overall clustering of the data.
The process can be continued until a target number of leaves is produced or until all
remaining leaves have a score below a given threshold.
A node’s score can be computed in different ways. For document clustering, Kuang
25
Internal NodeFrontier Node
Leaf Node
Figure 3.2: Hierarchy node classification
and Park [38] propose using modified normalized discounted cumulative gain, which
measures how distinct a node’s children are from each other using the feature weights
associated with the node and its children. For hyperspectral imaging data, Gillis et
al. [25] propose using the possible reduction in overall NMF error if the node is split
– the difference in error between using the node itself or using its children. We use
the latter in our implementation.
In any case, a node’s score depends on properties of its children, so the compu-
tation for a split must be done before the split is actually accepted. To this end,
we define a frontier node to be a parent of leaves; these are nodes whose children
have been computed but whose splits have not been accepted. Figure 3.2 depicts the
classification of nodes into internal, frontier, and leaf nodes. As the tree is built, the
algorithm selects the frontier node with the highest score to split, though no compu-
tation is required to split the node. When a frontier node split is accepted, it becomes
an internal node and its children are split (so that their scores can be computed) and
added to the set of frontier nodes. When the algorithm terminates, the leaves are
discarded and the frontier nodes become the leaves of the output tree.
Our hierarchical clustering algorithm is presented in Algorithm 2 and follows that
26
of Kuang and Park [38]. Each node includes a field A, which is a subset of columns
(samples) of the original data, a feature vector w, which is its corresponding column
of the W matrix from its parent’s Rank-2 NMF, a score, and pointers to its left and
right children. A priority queue Q tracks the frontier nodes so that the node with the
highest score is split at each step of the algorithm. We use a target number of leaf
clusters k as the termination condition. When a node is selected from the priority
queue, it is removed from the set of frontier nodes and its children are added.
Algorithm 2 Hierarchical Clustering [38]
Require: A is m× n, k is target number of leaf clusters1: function T = Hier-R2-NMF(A)2: R = node(A) . create root node3: Split(R)4: inject(Q,R.left) . create priority queue5: inject(Q,R.right) . of frontier nodes6: while size(Q) < k do7: N = eject(Q) . frontier node with max score8: Split(N .left) . split left child9: inject(Q,N .left) . and add to Q10: Split(N .right) . split right child11: inject(Q,N .right) . and add to Q12: end while13: end functionEnsure: T is binary tree rooted at R with k frontier nodes, each node has subset of
cols of A and feature vector w
The splitting procedure is specified in Algorithm 3. After the Rank-2 NMF is
performed, the H factor is used to determine part membership, and the columns of
the W factor are assigned to the child nodes. The score of the node is computed as
the reduction in overall NMF error if the node is split, which can be computed from
the principal singular values of the subsets of columns of the node and its children,
as given in Line 6. The principal singular values of the children are computed via the
power method. Note that the principal singular value of the node itself need not be
recomputed as it was needed for its parent’s score.
27
Algorithm 3 Node Splitting via Rank-Two NMF
Require: N has a subset of columns given by field A1: function Split(N )2: [W,H] = Rank2-NMF(N .A) . split N3: partition N .A into A1 and A2 using H4: N .left = node(A1,w1) . create left child5: N .right = node(A2,w2) . create right child6: N .score = σ2
1(A1) + σ21(A2)− σ2
1(N .A)7: end function
Ensure: N has two children and a score
3.4.2 Parallelization
In this section, we consider the options for parallelizing Hierarchical Rank-2 NMF
Clustering (Algorithm 2) and provide an analysis for our approach. The running
time of an algorithm is data dependent because not only does each Rank-2 NMF
computation require a variable number of iterations, but also the shape of the tree
can vary from a balanced binary tree with O(log k) levels to a tall, flat tree with O(k)
levels. For the sake of analysis, we will assume a fixed number of NMF iterations for
every node of the tree and we will analyze the cost of complete levels.
The first possibility for parallelization is across the nodes of the tree, as each Rank-
2 NMF split is independent. We choose not to parallelize across nodes in the tree for
two reasons. The first reason is that while the NMF computations are independent,
choosing which nodes to split may depend on global information. In particular, when
the global target is to determine k leaf clusters, the nodes must be split in order
of their scores, which leads to a serialization of the node splits. This serialization
might be relaxed using speculative execution, but it risks performing unnecessary
computation. If the global target is to split all nodes with sufficiently high scores,
then this serialization is also avoided and node splits become truly independent. We
choose not to parallelize in this way to remain agnostic to the global stopping criterion.
The second reason is that parallelizing across nodes requires redistribution of the
28
input data. Given a node split by p processors, in order to assign disjoint sets of
processors to each child node, each of the p processors would have to redistribute
their local data, sending data for samples not in their child’s set and receiving data
for those in their child’s set. The communication would be data dependent, but on
average, each processor would communicate half of its data in the redistribution set,
which could have an all-to-all communication pattern among the p processors. For
a node with n columns, the communication cost would be at least O(mn/p) words,
which is much larger than the communication cost per iteration of Parallel Rank-2
NMF, as we will see in Section 3.4.2.
By choosing not to parallelize across nodes in the tree, we employ all p proces-
sors on each node, and split nodes in sequence. The primary computations used to
split a node are the Rank-2 NMF and the score computation, which is based on an
approximation of the largest singular value. We use an alternating-updating algo-
rithm for Rank-2 NMF as described in Section 3.3, and we parallelize it following the
methodology proposed in [20] and presented in Algorithm 4.
The communication cost of the algorithm depends on the parallel distribution of
the input matrix data A. In order to avoid redistribution of the matrix data, we choose
a 1D row distribution so that each processor owns a subset of the rows of A. Because
the clustering partition splits the columns of A, each processor can partition its
local data into left and right children to perform the split without any interprocessor
communication. If we use a 2D distribution for a given node, then because the
partition is data dependent, a data redistribution is required in order to obtain a
load balanced distribution of both children. Figure 3.3 presents a visualization of the
node-splitting process using a 1D processor distribution. In the following subsections,
we describe the parallel algorithms for Rank-2 NMF and approximating the principal
singular value given this 1D data distribution and analyze their complexity in the
29
AW
HT
A1w1 A2w2
Figure 3.3: Parallel splitting using Rank-2 NMF and 1D processor distribution. ARank-2 NMF computes factor matrices W and H to approximate A, the values of Hare used to determine child membership of each column (either red or blue), and thecorresponding column of the W matrix represents the part’s feature weighting. The1D distribution is depicted for 3 processors to show that splitting requires no inter-processor redistribution as children are evenly distributed identically to the parent.
context of the hierarchical clustering algorithm.
Algorithms
Parallel Rank-2 NMF Algorithm 4 presents the parallelization of an alternating-
updating scheme for NMF that uses the exact rank-2 solve algorithm presented in
Algorithm 1 to update each factor matrix. The algorithm computes the inputs to
the rank-2 solves in parallel and then exploits the parallelism across rows of the
30
factor matrix so that each processor solves for a subset of rows simultaneously. The
distribution of all matrices is 1D row distribution, so that each processor owns a
subset of the rows of A, W, and H. We use the notation A to refer to the (m/p)×n
local data matrix and W and H to refer to the (m/p)× 2 and (n/p)× 2 local factor
matrices. With this distribution, the computation of WTW and HTH each is done
via local multiplication followed by a single all-reduce collective. All processors own
the data they need to compute their contribution to ATW; in order to distribute
the result to compute the rows H independently, a reduce-scatter collective is used
to sum and simultaneously distribute across processors. To obtain the data needed
to compute W, each processor must access all of H, which is performed via an all-
gather collective. The iteration progresses until a convergence criterion is satisfied.
For performance benchmarking we use a fixed number of iterations, and in practice
we use relative change in objective function value (residual norm).
Parallel Power Method In order to compute the score for a frontier node, we
use the difference between the principal singular value of the matrix columns of the
node and the sum of those of its children. Thus, we must determine the principal
singular value of every node in the tree once, including leaf nodes. We use the power
method to approximate it, repeatedly applying AAT to a vector until it converges to
the leading right singular vector. We present the power method in Algorithm 5. Note
that we do not normalize the approximate left singular vector so that the computed
value approximates the square of the largest singular value.
Given the 1D distribution, only one communication collective is required for the
pair of matrix-vector multiplications. That is, the approximate right singular vector v
is redundantly owned on each processor, and the approximate left singular vector u is
distributed across processors. Each processor can compute its local u from v without
31
Algorithm 4 Parallel Rank-2 NMF
Require: A is m×n and row-distributed across processors so that A is local (m/p)×n submatrix
1: function [W,H] = Parallel-Rank2-NMF(A)2: Initialize local W randomly3: while not converged do4: . Compute H
5: GW = WTW
6: GW = All-Reduce(GW )
7: B = ATW
8: C = Reduce-Scatter(B)9: H = Rank2-NLS-Solve(C,GW )10: . Compute W
11: GH = HTH
12: GH = All-Reduce(GH)13: H = All-Gather(H)14: D = AH15: W = Rank2-NLS-Solve(D,GH)16: end while17: end functionEnsure: A ≈WHT with W, H row-distributed
32
communication and use the result for its contribution to v = ATu. An all-reduce
collective is used to obtain a copy of v on every processor for the next iteration,
and the norm is redundantly computed without further communication. We used the
relative change in σ as the stopping criterion for benchmarking.
Algorithm 5 Parallel Power Method
Require: A is m×n and row-distributed across processors so that A is local (m/p)×n submatrix
1: function σ = Parallel-Power-Method(A)2: Initialize v randomly and redundantly3: while not converged do4: u = Av5: z = A
Tu
6: v = All-Reduce(z)7: σ = ‖v‖8: v = v/σ9: end while10: end functionEnsure: σ ≈ σ2
1(A) is redundantly owned by all procs
Analysis
Parallel Rank-2 NMF Each iteration of Algorithm 4 incurs the same cost, so we
analyze per-iteration computation and communication costs. We first consider the
cost of the Rank-2 NNLS solves, which are local computations. In the notation of
Algorithm 1, matrix G is 2 × 2, so solving the unconstrained system (via Cholesky
decomposition) and then choosing between single-positive-variables solutions if neces-
sary requires constant time per row of C. Thus, the cost of Algorithm 1 is proportional
to the number of rows of the first input matrix. In the context of Algorithm 4, the
per-iteration computational cost of rank-2 solves is then O((m+n)/p). The other lo-
cal computations are the matrix multiplications WTW and H
TH, which also amount
to O((m+n)/p) flops, and ATW and AH, which require O(mn/p) flops because they
33
involve the data matrix. Thus, the computation cost is γ · O((mn + m + n)/p) and
typically dominated by the multiplications involving A. We track the lower order
terms corresponding to NNLS solves because their hidden constants are larger than
that of the dominating term.
There are four communication collectives each iteration, and each involves all p
processors. The two all-reduce collectives to compute the Gram matrices of the factor
matrices involve 2×2 matrices and incur a communication cost of (γ+β+α)·O(log p).
The reduce-scatter and all-gather collectives involve n × 2 matrices (the size of H)
and require β ·O(n) +α ·O(log p) in communication cost (we ignore the computation
cost of the reduce-scatter because it is typically dominated by the bandwidth cost).
If the algorithm performs ı iterations, the overall cost of Algorithm 4 is
γ ·O(ı(mn+m+ n)
p
)+ β ·O(ın) + α ·O(ı log p). (3.1)
Parallel Power Method Similar to the previous analysis, we consider a single
iteration of the power method. The local computation is dominated by two matrix-
vector products involving the local data matrix of size O(mn/p) words, incurring
O(mn/p) flops. The single communication collective is an all-reduce of the approxi-
mate right singular vector, which is of size n, incurring β ·O(n) +α ·O(log p) commu-
nication. We ignore the O(n) computation cost of normalizing the vector, as it will
typically be dominated by the communication cost of the all-reduce. Over iterations,
Algorithm 5 has an overall cost of
γ ·O(mn
p
)+ β ·O(n) + α ·O( log p). (3.2)
Note the per-iteration cost of the power method differs by only a constant from the
per-iteration cost of Rank-2 NMF. Because the power method involves single vectors
34
rather than factor matrices with two columns, its constants are smaller than half the
size of their counterparts.
Hierarchical Clustering To analyze the overall cost of the hierarchical clustering
algorithm, we sum the costs over all nodes in the tree. Because the shape of the tree
is data dependent and affects the overall costs, for the sake of analysis we will analyze
only complete levels. The number of rows in any node is m, the same as the root node,
as each splitting corresponds to a partition of the columns. Furthermore, because each
split is a partition, every column of A is represented exactly once in every complete
level of the tree. If we assume that all nodes perform the same number of NMF
iterations (ı) and power method iterations (), then the dominating costs of a node
with n columns is
γ ·O(
(ı+ )mn+ ı(m+ n)
p
)+ β ·O((ı+ )n) + α ·O((ı+ ) log p).
Because the sum of the number of columns across any level of the tree is n, the cost
of the `th level of the tree is
γ ·O(
(ı+ )mn+ ım2`
p
)+ β ·O((ı+ )n) + α ·O((ı+ )2` log p). (3.3)
Note that the only costs that depend on the level index ` are the latency cost and a
lower-order computational cost.
Summing over levels and assuming the tree is nearly balanced and has height
O(log k) where k is the number of frontier nodes, we obtain an overall cost of Algo-
rithm 2 of
γ ·O(
(ı+ )mn
plog k +
ımk
p
)+ β ·O((ı+ )n log k) + α ·O((ı+ )k log p). (3.4)
We see that the leading order computational cost is logarithmic in k and perfectly
load balanced. If the overall running time is dominated by the computation (and
35
in particular the matrix multiplications involving A), we expect near-perfect strong
scaling. The bandwidth cost is also logarithmic in k but does not scale with the
number of processors. The latency cost grows most quickly with the target number
of clusters k but is also independent of the matrix dimensions m and n.
3.5 Experimental Results
3.5.1 Experimental Platform
All the experiments in this section were conducted on Summit. Summit is a su-
percomputer created by IBM for the Oak Ridge National Laboratory. There are
approximately 4,600 nodes on Summit. Each node contains two IBM POWER9 pro-
cessors on separate sockets with 512 GB of DDR4 memory. Each POWER9 processor
utilizes 22 IBM SIMD Multi-Cores (SMCs), although one of these SMCs on each pro-
cessor is dedicated to memory transfer and is therefore not available for computation.
For node scaling experiments, all 42 available SMCs were utilized in each node so
that every node computed with 42 separate MPI processes. Additionally, every node
also supports six NVIDIA Volta V100 accelerators but these were unused by our
algorithm.
Our implementation builds on the PLANC open-source library [20] and uses the
Armadillo library (version 9.900.1) for all matrix operations. On Summit, we linked
this version of Armadillo with OpenBLAS (version 0.3.9) and IBM’s Spectrum MPI
(version 10.3.1.2-20200121).
3.5.2 Datasets
Hyperspectral Imaging We use the Hyperspectral Digital Imagery Collection Ex-
periment (HYDICE) image of the Washington DC Mall. We will refer to this dataset
as DC-HYDICE [40]. DC-HYDICE is formatted into a 3-way tensor representing two
36
spatial dimensions of pixels and one dimension of spectral bands. So, a slice along the
spectral band dimension would be the full DC-HYDICE image in that spectral band.
For hierarchical clustering, these tensors are flattened so that the rows represent the
191 spectral bands and the columns represent the 392960 pixels. The data set is
approximately 600 MB in size.
Image Classification The SIIM-ISIC Melanoma classification dataset, which we
will refer to as SIIM-ISIC [52], consists of 33126 RGB training images equally sized
at 1024 × 1024. Unlike with hyperspectral imaging, the resulting matrix used in hi-
erarchical clustering consists of image pixels along the rows and individual images
along the columns. So, the resulting sized matrix is 3145728 × 33126, which is ap-
proximately 800 GB in size. Given its size, SIIM-ISIC requries 10 Summit nodes to
perform hierarchical clustering.
Synthetic Dataset Our synthetic dataset has the same aspect ratio of SIIM-ISIC
but consists of fewer rows and columns by a factor of 3. The resulting matrix is
1048576 × 11042. We choose the smaller size in order to fit on a single node for
scaling experiments.
3.5.3 Performance
For all hierarchical clustering experiments in this section, the number of tree leaf
nodes k was set at 100, the number of NMF iterations was set to 100, the power
iteration was allowed to stop iterating after convergence, and only complete levels
were considered for analysis purposes for both level and strong scaling plots.
37
0 5 10 15 20 25 30 35 40
Number of Compute Cores
2
4
6
8
10
12
14
Rel
ativ
eS
pee
dup
Figure 3.4: Strong Scaling for Clustering on DC-HYDICE
Single-Node Scaling for DC Dataset
DC-HYDICE is small compared to the other datasets, so it can easily fit on one com-
pute node. Also, its small number of 191 rows doesn’t allow for parallelizing beyond
that number of MPI processes. So, this dataset was used for a single-node scaling
experiment on Summit from 1 to 42 cores. Because Rank-2 NMF is memory band-
width bound, we expect limited speedup on one node due to the memory bandwidth
not scaling linearly with the number of cores. Figure 3.4 shows that there is enough
speedup (14× on 42 cores) for it to be worth parallelizing such a small problem, but
perfect scaling requires more memory bandwidth. In this experiment, the processes
were distributed across both sockets so that an even number of cores on each socket
are used.
38
1 10 20 30 40
Number of Compute Nodes
0
10
20
30
40
Rel
ativ
eS
pee
dup
(a) Synthetic Data
10 20 30 40 50 60 70 80
Number of Compute Nodes
1
2
3
4
5
6
7
Rel
ativ
eS
pee
dup
(b) SIIM-ISIC Data
Figure 3.5: Strong Scaling Speedup for Rank-2 NMF
Rank-2 NMF Strong Scaling
We perform strong scaling experiments for a single Rank-2 NMF (Algorithm 4) on
the synthetic and SIIM-ISIC datasets. The theory (Equation (3.1)) suggests that
perfect strong scaling is possible as long as the execution time is dominated by local
computation. Both the matrix multiplications and NNLS solves scale linearly with
1/p (we expect MatMul to dominate), but the bandwidth cost is independent of p
and latency increases slightly with p.
Figures 3.5a and 3.5b show performance relative to the smallest number of com-
pute nodes required to store data and factor matrices. For these data sets, we observe
nearly perfect strong scaling, with 42× speedup on 40 compute nodes (over 1 compute
node) for synthetic data and 7.1× speedup on 80 compute nodes (over 10 compute
nodes) for SIIM-ISIC data.
The relative time breakdowns are presented in Figures 3.6 and 3.7 and explain
the strong scaling performance. Each experiment is normalized to 100% time, so
comparisons cannot be readily made across numbers of compute nodes. For both data
sets, we see that the time is dominated by MatMul, which is the primary reason for
the scalability. The dominant matrix multiplications are between a large matrix and
a matrix with 2 columns, so it is locally memory bandwidth bound, with performance
39
1 10 20 30 40
Number of Compute Nodes
0.0
0.2
0.4
0.6
0.8
1.0
Rel
ativ
eT
ime
MatMul
NNLS
Gram
Comp-Sigma
AllGather
ReduceScatter
AllReduce
Comm-Sigma
Figure 3.6: Time Breakdown for Rank-2 NMF on Synthetic
40
10 20 30 40 50 60 70 80
Number of Compute Nodes
0.0
0.2
0.4
0.6
0.8
1.0
Rel
ativ
eT
ime
MatMul
NNLS
Gram
Comp-Sigma
AllGather
ReduceScatter
AllReduce
Comm-Sigma
Figure 3.7: Time Breakdown for Rank-2 NMF on SIIM-ISIC
proportional to the size of the large matrix. In each plot, we also see the relative time
of all-gather and reduce-scatter increasing, which is because the local computation is
decreasing while the communication cost is slightly increasing with p. This pattern
will continue as p increases, which will eventually limit scalability, but for these data
sets the MatMul takes around 80% of the time at over 2000 cores.
Hierarchical Clustering Strong Scaling
From Equation (3.4), we expect to see perfect strong scaling in a computationally
bound clustering problem with target cluster count k = 100. As k is large, we expect
the latency cost of small problems deep in the tree to limit scalability.
Figure 3.8a demonstrates the scalability of the synthetic data set on up to 40 nodes,
and we observe a 15× speedup compared to 1 node. Figure 3.9 shows the relative
41
0 5 10 15 20 25 30 35 40
Number of Compute Nodes
2
4
6
8
10
12
14
Rel
ativ
eS
pee
dup
(a) Synthetic Data
10 20 30 40 50 60 70 80
Number of Compute Nodes
1
2
3
4
5
6
Rel
ativ
eS
pee
dup
(b) SIIM-ISIC Data
Figure 3.8: Strong Scaling Speedup for Clustering
time breakdown and explains the limitation on scaling. On 40 nodes, computation
still takes 60% of the total time, but the all-gather and reduce-scatter costs have
grown in relative time because they do not scale with p. Because all-reduce involves
only a constant amount of data and its time remains relatively small, we conclude
the communication is bandwidth bound at this scale.
With the larger SIIM-ISIC dataset, it’s possible to scale much further as seen in
Figure 3.8b, where we observe a 5.9× speedup of 80 compute nodes compared to 10.
From Figure 3.10, we see that the communication cost constitutes less than 20% of
the total time even at 80 compute nodes.
We note that the speedup of the overall hierarchical clustering algorithm is not
as high as for a single Rank-2 NMF (measured at the root node). This is due to
inefficiencies in the lower levels of the tree, as we explore in the next section.
Level Scaling
To compare execution time across levels of a particular tree, we consider only complete
levels. From Equation (3.3), the dominant computational term (due to MatMul) is
constant per level, the lower order computational term (represented by NNLS) grows
like O(2`), and the latency cost grows similarly like O(2`).
42
1 10 20 30 40
Number of Compute Nodes
0.0
0.2
0.4
0.6
0.8
1.0
Rel
ativ
eT
ime
MatMul
NNLS
Gram
Comp-Sigma
AllGather
ReduceScatter
AllReduce
Comm-Sigma
Figure 3.9: Time Breakdown for Clustering on Synthetic
43
10 20 30 40 50 60 70 80
Number of Compute Nodes
0.0
0.2
0.4
0.6
0.8
1.0
Rel
ativ
eT
ime
MatMul
NNLS
Gram
Comp-Sigma
AllGather
ReduceScatter
AllReduce
Comm-Sigma
Figure 3.10: Time Breakdown for Clustering on SIIM-ISIC
44
Figure 3.11 show absolute time across levels for the synthetic data set on 1 node.
The MatMul cost decreases slightly per level, which may be explained by cache effects
in the local matrix multiply, as each node’s subproblem decreases in size. The NNLS
grows exponentially, as expected, and communication is negligible.
Figure 3.12 shows the level breakdown for the synthetic data on 40 nodes, where we
see different behavior. MatMul cost is again constant across levels and the NNLS cost
becomes dominating at lower levels suggesting it does not scale as well as MatMul.
We also see all-reduce time becoming significant as communication time increases,
indicating that the nodes at lower levels are becoming more latency bound. Thus,
we see that the poorer scaling at the lower levels of the tree is the main reason the
overall hierarchical clustering algorithm does not scale as well as the single Rank-2
NMF at the root node.
Rank Scaling
To confirm the slow growth in running time of the hierarchical algorithm in terms
of the number of clusters k, we perform rank scaling experiments for DC-HYDICE
and synthetic data. Assuming a balanced tree and relatively small k, Equation (3.4)
shows that the dominant computational cost is proportional to log k, while a flat
NMF algorithm has a dominant cost that is linear in k [32]. Figure 3.13 shows the
raw time for various values of k, confirming that running time for HierNMF grows
more slowly in k than a flat NMF algorithm (based on Block Principal Pivoting)
from PLANC [20] with the same number of columns and processor grid. We see that
for sufficiently large k, the hierarchical algorithm outperforms flat NMF and it scales
much better with k.
45
0 1 2 3 4 5
Levels
0
50
100
150
200
250
Wal
lC
lock
Tim
e(i
nS
ecs)
MatMul
NNLS
Gram
Comp-Sigma
AllGather
ReduceScatter
AllReduce
Comm-Sigma
Figure 3.11: Level Times for 1 Compute Node on Synthetic
46
0 1 2 3 4 5
Levels
0
5
10
15
20
25
30
35W
all
Clo
ckT
ime
(in
Sec
s)
MatMul
NNLS
Gram
Comp-Sigma
AllGather
ReduceScatter
AllReduce
Comm-Sigma
Figure 3.12: Level Times for 40 Compute Nodes on Synthetic
10 20 30 40 50
Number of Clusters k
50
100
150
200
Tim
e(s
)
Hier NMF Flat NMF
(a) DC-HYDICE Data
10 20 30 40 50 60 70 80 90 100
Number of Clusters k
200
300
400
500
600
700
800
900
Tim
e(s
)
Hier NMF Flat NMF
(b) Synthetic Data
Figure 3.13: Rank Scaling for Hierarchical and Flat NMF
47
3.6 Conclusion
As shown in the theoretical analysis (Section 3.4.2) and experimental results (Sec-
tion 3.5.3), Algorithm 2 can efficiently scale to large p as long as the execution time
is dominated by local matrix multiplication. The principal barriers to scalability are
the bandwidth cost due to Rank-2 NMF, which is consistent across levels of the tree
and proportional to the number of columns n of the original data set, and the latency
cost due to large numbers of tree nodes in lower levels of the tree. When n is small
relative to m and the number of leaves k and levels ` are small, then these barriers
do not pose a problem until p is very large. However, if the input matrix is short and
fat (i.e., has many samples with few features), then the bandwidth cost can hinder
performance for smaller p. Likewise, if k is large or the tree is lopsided, then achieving
scalability for very small problems is more difficult. We also note that in the case of
sparse A, it becomes more difficult to hide communication behind the cheaper matrix
multiplications, and other costs may become more dominant.
One approach for reducing the bandwidth cost of Rank-2 NMF is to choose a
more balanced data distribution over a 2D grid, as proposed by Kannan et al. [31].
This will reduce the communicated data and achieve a local data matrix that is more
square, which can improve local matrix multiplication performance. The downside
of this approach is requiring a redistribution of the data for each split, but if many
NMF iterations are required, then the single upfront cost may be amortized.
Another approach to alleviate the rising latency costs of lower levels of the tree
is to parallelize across nodes of the tree. This will result in fewer processors working
on any given node, reducing the synchronization time among them, and it will allow
small, latency-bound problems to be solved simultaneously. Prioritizing the sequence
of node splits is more difficult in this case, but modifying the stopping criterion for
splitting to use a score threshold instead of a target number of leaves will allow truly
48
independent computation.
In the future, we also plan to compare performance of Algorithm 2 with flat NMF
algorithms and employ the Divide-and-Conquer NMF technique [19] of seeding an
iterative flat NMF algorithm with the feature vectors of the leaf nodes. The parallel
technique proposed here can be combined with the existing PLANC library [20] to
obtain faster overall convergence for very large datasets.
49
Chapter 4: Tensor Train Rounding using Gram Matrices
The following chapter is a manuscript that has been submitted. For this work, I
contributed mainly to the results section by performing experiments and generating
plots.
4.1 Abstract
Tensor Train (TT) is a low-rank tensor representation consisting of a series of three-
way cores whose dimensions specify the TT ranks. Formal tensor train arithmetic
often causes an artificial increase in the TT ranks. Thus, a key operation for appli-
cations that use the TT format is rounding, which truncates the TT ranks subject
to an approximation error guarantee. Truncation is performed via SVD of a highly
structured matrix, and current rounding methods require careful orthogonalization
to compute an accurate SVD. We propose a new algorithm for TT rounding based
on the Gram SVD algorithm that avoids the expensive orthogonalization phase. Our
algorithm performs less computation and can be parallelized more easily than ex-
isting approaches, at the expense of a slight loss of accuracy. We demonstrate that
our implementation of the rounding algorithm is efficient, scales well, and consistently
outperforms the existing state-of-the-art parallel implementation in our experiments.
4.2 Preliminaries
4.2.1 Tensor Train Notation
An order-N low rank tensor X ∈ RI1×···×IN is in the Tensor Train (TT) format if there
exist strictly positive integers R0, . . . , RN with R0 = RN = 1 and N order-3 tensors
50
TX,1, . . . ,TX,N , called TT cores, with TX,n ∈ RRn−1×In×Rn , such that:
X(i1, . . . , iN) = · · · · · · .
Since R0 = RN = 1, the first and last TT cores are (order-2) matrices so ∈ RR1 and
∈ RRN−1 and hence · · · · · · ∈ R. We refer to the Rn−1 × Rn matrix as the inth slice
of the nth TT core of X, where 1 ≤ in ≤ In.
Different types of matricization (also known as unfolding) of a tensor are used
to express linear algebra operations on tensors. In this work, we will often use two
particular matricization of 3D tensors. The horizontal unfolding of TT core TX,n
corresponds to stacking the slices for in = 1, . . . , In horizontally. The horizontal
unfolding operator is denoted by H, therefore, H(TX,n) ∈ RRn−1×RnIn . The vertical
unfolding corresponds to stacking the slices for in = 1, . . . , In vertically. The vertical
unfolding operator is denoted by V , therefore, V(TX,n) ∈ RRn−1In×Rn . These two
unfoldings are important for the linearization of tensor entries in memory as they
enable performing matrix operations on the TT core without shuffling or permuting
data.
Another type of unfolding which we will use to express mathematical relationships
among TT cores maps the first n modes to rows and the rest to columns [49]. We use
the notation X(1:n) to represent this unfolding, so that X(1:n) ∈ RI1···In×In+1···IN . The
n TT rank of X is the rank of X(1:n).
4.2.2 Cholesky QR and Gram SVD
Given a tall and skinny matrix A, recall that the corresponding Gram matrices are
AAT and ATA. We are typically interested in GA = ATA for efficient algorithms
because it is a smaller matrix.
Cholesky QR is an algorithm that exploits the fact that, for A full rank, the
upper triangular Cholesky factor of GA is also the upper triangular factor in the QR
51
decomposition of A. That is, for A = QR, we have GA = RTQTQR = RTR. If
A is full rank, then R is invertible and Q can be recovered as Q = AR−1 using
a triangular solve. In finite precision, Cholesky QR obtains a small decomposition
error ‖A−QR‖, but the orthogonality error ‖QTQ−I‖ grows quadratically with the
condition number of A. By comparison, Householder QR obtains small orthogonality
error regardless of the conditioning of A [30]. We note there are techniques for
improving the numerical properties of Cholesky QR, by using 2 or 3 passes [22, 23].
Likewise, Gram SVD is an algorithm that exploits the connection between the SVD
of a matrix and the eigenvalue decompositions of its Gram matrices. For A = UΣVT,
we have GA = VΣUTUΣVT = VΣ2VT. We see that the eigenvalues of GA are the
squares of the singular values of A and the eigenvectors of GA are the right singular
vectors of A. We can recover the left singular vectors via U = AVΣ−1 (assuming full
rank). Like Cholesky QR, Gram SVD computes an accurate decomposition but suffers
from higher orthogonality error of U as well as reduced accuracy of the singular values.
SVD algorithms using orthogonal transformations compute singular values with error
proportional to ‖A‖ · ε, where ε is the working precision, while the error for Gram
SVD can be larger by a factor as large as the condition number of A [60]. This implies
that backwards stable SVD algorithms can compute singular values in a range of 1/ε,
while Gram SVD is limited to computing singular values in a range of 1/√ε.
4.2.3 Cookies Problem and TT-GMRES
As a concrete example of a parametrized PDE for which TT methods work well, we
consider the two-dimensional cookies problem [37, 59] described as follows:
−div(σ(x, y;ρ)∇(u(x, y))) = f(x, y) in Ω,
u(x, y) = 0 on δΩ,
52
where Ω is (−1, 1)× (−1, 1), δΩ is the boundary of Ω and σ is defined as:
σ(x, y;ρ) =
1 + ρi if (x, y) ∈ Di
1 elsewhere
where Di for i = 1, . . . , p are disjoint disks distributed in Ω such that their centers
are equidistant and ρi is selected from a set of samples Ji ⊂ R for i = 1, . . . , p. To
solve this problem, for each combination of values (ρ1, . . . , ρp), one can solve the linear
system (G1,1 +∑p
i=1 ρiGi+1,1) u = f , where G1,1 ∈ RI1×I1 is the discretization of the
operator −div(∇(·)) in Ω, Gi+1,1 is the discretization of −div(χDi∇(·)) in Ω where
χS is the indicator function of the set S, and f is the discretization of the function f .
The number of linear systems to solve in that case is the product of the cardinalities
of the sets (Ji)1≤i≤p. Knowing that the set of solutions can be well approximated by a
low-rank tensor [13, 26], another approach to solve the problem is to use an iterative
method that exploits the low-rank structure and solves one large system including
all combinations of parameters. That is, to solve a (p + 1)-order problem of the
form GU = F. The operator G is given as G =∑p+1
i=1 Gi,1 ⊗ · · · ⊗ Gi,p+1, , where
Gi,i ∈ RIi×Ii for i = 2, . . . , p + 1 is a diagonal matrix containing the samples of ρi,
and the remaining matrices Gi,j for i = 1, . . . , p + 1, j = 2, . . . , p + 1 and j 6= i are
the identity matrices of suitable size. The right-hand side F = f ⊗ 1I2 ⊗ · · · ⊗ 1Ip+1 ,
where 1Ii is the vector of ones of size Ii.
In this application and many others, the operator G has an operator rank that is
low and the right-hand side F is given in a low-rank form [3,7,9,37,63,64]. One way to
approximate the solution by a low-rank tensor is to apply a Krylov method adapted
to low rank tensors such as TT-GMRES [16]. In each iteration, the operator G is
applied to a low rank tensor leading to a formal expansion of the ranks. Furthermore,
one needs to orthonormalize the new basis tensor against previous ones by using a
Gram–Schmidt procedure, see algorithm 6. Again, the ranks will increase formally.
53
In order to keep memory and computations tractable, one has to round the resulting
tensors after performing these two steps. Most of the time, a small reduction in
the final relative residual norm is sufficient, which allows performing aggressive TT
rounding with loose tolerances.
Algorithm 6 TT-GMRES [16]
1: function U = TT-GMRES(G,F,m, ε)2: Set β = ‖F‖F , V1 = U/β, r = β3: for j = 1 : m do4: Set δ = εβ
r5: W = TT-Round(GVj , δ)6: for i = 1 : j do7: H(i, j) = InnerProd(W,Vi)8: end for9: W = TT-Round(W−
∑ji=1 H(i, j)Vi, δ)
10: H(j + 1, j) = ‖W‖F11: r = min ‖H(1 : j + 1, 1 : j)y − βe1‖212: Vj+1 = W/H(j + 1, j)13: end for14: ym = argminy ‖Hy − βe1‖215: U =
∑mj=1 ym(j)Vj
16: end function
4.2.4 TT-Rounding via Orthogonalization
The standard algorithm for TT-rounding [47] is given in algorithm 7. This procedure
is composed of two phases, an orthogonalization phase and a truncation phase. The
orthogonalization phase consists of a sequence of QR decompositions of the vertical
unfolding of each core starting from the leftmost to orthonormalize its columns and
then a multiplication of the triangular factor by the following core. The truncation
phase consists of a sequence of truncated SVDs of the horizontal unfolding of each
core starting from the rightmost, leaving its rows orthonormal (set as the leading right
singular vectors), and multiplying the preceding core by the singular values and the
leading left singular vectors. The direction of these two phases can be reversed.
54
Given a required accuracy, the TT-Rounding procedure provides a quasi-optimal
approximation with given TT ranks [47].
Algorithm 7 TT-Rounding via Orthogonalization [1, 47]
1: function Y = TT-Round-QR(X, ε)2: Set TY,1 = TX,1
3: for n = 1 to N − 1 do4: [V(TY,n),R] = QR(V(TY,n))5: H(TY,n+1) = RH(TX,n+1)6: end for7: Compute ‖X‖ = ‖TY,N‖F and ε0 = ‖X‖F√
N−1ε
8: for n = N down to 2 do9: [Q,R] = QR(H(TY,n)
T)
10: [U, Σ, V] = tSVD(R, ε0)11: H(TY,n)
T = QU
12: V(TY,n−1) = V(TY,n−1)VΣ13: end for14: end function
4.2.5 Previous Work on Parallel TT-Rounding
Algorithm 7 has been parallelized by Al Daas et al. [1], who use a 1-D distribution
of TT cores to partition a TT tensor across processors. Each core is distributed over
all processors along the physical mode such that each processor owns Ik/P slices of
the kth core. This distribution guarantees a load balancing and allows to perform
TT arithmetic efficiently. In particular, the QR decompositions are performed via the
Tall-Skinny QR algorithm [14], and multiplications involving TT cores are parallelized
following the 1D distributions. We improve upon this prior work by using an alternate
TT-rounding approach that avoids QR decompositions, reducing arithmetic by a
constant factor and also reducing communication.
55
4.3 Introduction
Low-rank representations of tensors help to make algorithms addressing large-scale
multidimensional problems computationally feasible. While the size of explicit rep-
resentations of these tensors grows very quickly (an instance of the “curse of dimen-
sionality”), low-rank representations can often approximate explicit forms to sufficient
accuracy while requiring orders of magnitude less space and computational time. For
example, suppose a parametrized PDE depends on 10 parameters, where each param-
eter has 10 possible values. Computing the solution for each of the 1010 configurations
becomes infeasible even for modest discretizations of the state space, but if the so-
lution depends smoothly on the parameters, then the qualitative behavior of the
solution over the entire configuration space can be captured using far fewer than 1010
parameters [13,26,37].
As we describe in detail in section 3.3, the Tensor Train (TT) format [47] is a
low-rank representation with a number of parameters that is linear in the sum of the
tensor dimensions, as compared to an explicit representation whose size is the prod-
uct of the tensor dimensions. The TT format consists of a series of 3-way tensors,
or TT cores, with one dimension corresponding to an original tensor dimension and
two dimensions corresponding to much smaller TT ranks. TT approximations can be
computed from explicit tensors as a means of compression for scientific computing and
machine learning applications [27,47,50,66], but they are also often used to represent
tensors that cannot be formed explicitly at all. In the context of parametrized PDEs,
the TT format has been used to represent both the discretized operators as well as
the solution, residual, and other related vectors [7–9, 16]. In this case, TT tensors
are manipulated using operations such as additions, dot products, and elementwise
multiplications, which causes the TT ranks to grow in size. The key operation that
prevents uncontrolled growth in TT ranks is known as TT rounding, in which a TT
56
tensor is approximated by another TT tensor with minimal ranks subject to a spec-
ified approximation error. This operation requires a sequence of highly structured
matrix singular value decomposition (SVD) problems, and is typically a computa-
tional bottleneck.
There exists a wide array of high-performance, parallel implementations of tensor
computations for computing decompositions such as CP and Tucker of dense and
sparse tensors [5, 10, 12, 20, 33, 53], as well as for performing contractions of dense,
sparse, and structured tensors [2, 54, 56]. However, the available software for com-
puting, manipulating, and rounding TT tensors is largely limited to productivity
languages such as MATLAB and Python [44, 61]. Aside from the work of Al Daas
et al. [1], which we describe in section 3.3 and compare against in section 4.6, we
are not aware of other HPC implementations of TT-based algorithms. One of the
aims of this paper is to raise the bar for parallel performance for TT rounding and
demonstrate that TT-based approaches can scale to scientific problems with more
and higher dimensions using efficient parallelization.
The TT rounding algorithm utilizes multiple truncated SVDs. The central con-
tribution of this paper is the development of a parallel algorithm that performs these
truncated SVDs more efficiently than the existing approach, by reducing both com-
putational and communication costs. The basic tool of the algorithm is the Gram
SVD algorithm, which exploits the connection between the SVD of a matrix A and
the eigenvalue decomposition of its Gram matrix ATA. The truncated SVD must
be performed on a highly structured matrix which is analogous a matrix represented
as X = ABT, where A and B are tall-skinny matrices. We present our approach in
full detail for this matrix analogue in section 4.4, including empirical results for the
numerical properties, and then show how it can be applied within the TT rounding
algorithm in section 4.5. The key to efficiency in the context of TT rounding is the
57
computation of Gram matrices of matrices with overlapping TT structure.
We present performance results in section 4.6, demonstrating the efficiency of our
algorithm compared to the existing state of the art. In a MATLAB-based experi-
ment, we show that improvement of a TT-rounding implementation leads to overall
performance improvement for a TT-based linear solver. Then we demonstrate that
our C/MPI implementation is both weakly and strongly scalable on TT tensors with
representative dimensions and ranks. In particular, we achieve up to Y× parallel
speedup when scaling to 64 nodes of a distributed-memory platform for a Z-way
tensor with dimensions of size W and TT ranks of size Q. We also achieve up to
a 8× speedup over a state-of-the-art implementation of the standard TT-rounding
approach. Our results demonstrate that TT rounding is highly scalable using our
algorithm, and we target parallelization of TT-based solvers based on our approach
as future work.
4.4 Truncation of Matrix Product
To gain intuition for the use of Gram SVD within TT-Rounding, we focus in the
section on the (degenerate) case of TT with 2 modes, with dimensions I × J . In this
case, the tensor is a matrix represented by a low-rank product of matrices:
X = ABT, (4.1)
where A and B are tall and skinny matrices with R columns. The goal is to approx-
imate X with a lower rank representation
X ≈ ABT, (4.2)
where A and B have L < R columns.
58
Algorithm 8 Rounding Matrix Product ABT using QR
function [A, B] = Mat-Rounding-QR(A,B, ε)[QA,RA] =QR(A)[QB,RB] =QR(B)[U, Σ, V] =tSVD(RART
B, ε)
A = QA
(UΣ
1/2)
B = QB
(VΣ
1/2)
end function
4.4.1 Truncation via Orthogonalization
A numerically accurate and reasonably efficient approach to truncate the represen-
tation of X is via orthogonalization. By computing (compact) QR decompositions
A = QARA and B = QBRB, we have
X = QARARTBQT
B (4.3)
and the SVD of RARTB yields the (compact) SVD of X because QA and QB have
orthonormal columns. Note that RARTB is R × R, so its SVD is much cheaper to
compute.
We formalize this approach in algorithm 8. In order to truncate the rank of X,
we can truncate the SVD of RARTB. To obtain factors A and B, we apply QA and
QB to the left and right singular vectors, respectively. The singular values can be
distributed arbitrarily, we choose to distribute them evenly to left and right factors.
4.4.2 Truncation via Gram SVD
We now show our proposed method for a faster but potentially less accurate rounding
algorithm for the matrix product. Our method is based on the Gram SVD algorithm,
but we note it is not a straightforward application. For example, we can represent
XXT as ABTBAT, and while BTB is R×R, we cannot obtain the eigenvalue decom-
position easily without orthogonalizing A. Instead, we consider the Gram matrices
59
of A and B separately, letting GA = ATA and GB = BTB. For clarity, we first
describe the method using Cholesky QR, then discuss pivoting within Cholesky, and
finally explain the use of Gram SVD. We compare numerical results for the matrix
product case in section 4.4.4.
Cholesky QR
Let us first assume A and B are full rank, and use Cholesky QR to orthonormalize
the columns of A and B. Computing Cholesky decompositions, we have RTARA = GA
and RTBRB = GB. Then eq. (4.3) becomes
X = (AR−1A )RART
B(BR−1B )T.
Given the truncated SVD UΣVT
= RARTB, we can compute
A = A(R−1A UΣ
1/2)
and B = B(R−1B VΣ
1/2)
to obtain eq. (4.2).
Pivoted Cholesky QR
Now suppose that A and B are low rank with ranks LA and LB. While the standard
Cholesky algorithm will fail in this case, we can employ pivoted Cholesky to obtain
RTARA = PT
AGAPA and RTBRB = PT
BGBPB, where PA and PB are permutation
matrices and RA and RB can be written
RA =
[RA RA
0
]and RB =
[RB RB
0
],
60
with R−1
A and R−1
B having dimensions LA × LA and LB × LB, respectively. Then
eq. (4.3) becomes
X = QA
[RA RA
]PTAPB
[R
T
B
RT
B
]︸ ︷︷ ︸
M
QT
B = QAMQT
B
where
QA = APA
[R−1
A
0
]and QB = BPB
[R−1
B
0
].
Given the truncated SVD UΣVT
= M, we compute
A = A
(PA
[R−1A
0
]UΣ
1/2
)and B = B
(PB
[R−1B
0
]VΣ
1/2
)
to obtain eq. (4.2).
Gram SVD
Pivoted Cholesky QR works well for the low rank case in exact arithmetic, but in the
case of numerically low rank matrices, it provides a sharp truncation for each of A and
B individually. We now consider using the Gram SVD approach, which we will see in
section 4.4.4 is more robust than pivoted Cholesky QR. Here, we consider A and B to
be possibly low rank. Given the SVDs A = UAΣAVTA and B = UBΣBVT
B, we have
eigenvalue decompositions GA = VAΣ2AVT
A = VAΣ2
AVT
A and GB = VBΣ2BVT
B =
VBΣ2
BVT
B, where ΣA and ΣB represent the nonzero singular values and VA and VB
are the corresponding vectors. We can then write the corresponding left singular
vectors via UA = AVAΣ−1
A and UB = AVBΣ−1
B . With these quantities, eq. (4.1)
becomes
X = (AVAΣ−1
A︸ ︷︷ ︸UA
) ΣAVT
AVBΣB︸ ︷︷ ︸M
(BVBΣ−1
B︸ ︷︷ ︸UB
)T = UAMUT
B.
61
Algorithm 9 Truncated SVD of ABT using Gram SVDs
1: function [A, B] = tSVD-ABt-Gram(A,B, ε)2: GA = ATA3: GB = BTB4: [VA,ΛA] = Eig(GA)5: [VB,ΛB] = Eig(GB)
6: [U, Σ, V] =tSVD(Λ1/2A VT
AVBΛ1/2B , ε)
7: A = A(VAΛ
−1/2A UΣ
1/2)
8: B = B(VBΛ
−1/2B VΣ
1/2)
9: end function
Given the truncated SVD UΣVT
= M, we compute
A = A(VAΣ
−1
A UΣ1/2)
and B = B(VBΣ
−1
B VΣ1/2)
to obtain eq. (4.2). The algorithm for the Gram SVD approach is given as algorithm 9,
which can be adapted to pivoted Cholesky QR following the algebra of section 4.4.2.
4.4.3 Complexity Analysis
We now consider the computational complexity of the truncation methods, where
we assume A is I × R, B is J × R, A is I × L, and B is J × L. Truncation via
orthogonalization is specified in algorithm 8. The QR decompositions in lines 2 and 3
require 2(I+J)R2 flops, where we assume that the orthogonal factors QA and QB are
maintained in implicit (e.g., Householder) form. The multiplication and truncated
SVD of line 4 cost O(R3). Applying the implicit orthogonal factors to R×L matrices
to compute A and B require 4(I + J)RL flops for a total cost bounded by
2(I + J)R2 + 4(I + J)RL+O(R3). (4.4)
In the case of the Gram SVD approach, we unify the analysis for Cholesky QR
and Gram SVD. Algorithm 9 gives the explicit steps assuming Gram SVD is used.
62
The cost of lines 2 and 3 together is (I + J)R2 operations, which is performed for
either method. The eigendecompositions of lines 4 and 5 is O(R3). This cost is
approximately 10 times more expensive than performing Cholesky decomposition of
the Gram matrices, but we note that O(R3) is a lower order term compared to the cost
of computing the Gram matrices. The matrix multiplications and truncated SVD of
line 6 are also O(R3), possibly less if A and B are low rank and similar across the two
methods. Finally, lines 7 and 8 first involve computations of small matrices (of size
R×L or smaller) followed by a single multiplication with the large A or B matrices,
which together cost 2(I + J)RL. Overall, the computational cost of the Gram SVD
method is bounded by
(I + J)R2 + 2(I + J)RL+O(R3), (4.5)
which is about half the cost of that of the orthogonalization approach, given in
eq. (4.4). Furthermore, the dominant costs of eq. (4.5) come from (symmetric) ma-
trix multiplication rather than computation of/with implicit orthogonal factors, so
we expect higher efficiency for the Gram SVD approach in addition to the reduced
arithmetic.
4.4.4 Numerical Examples
In this section, we will demonstrate the empirical error of computing a truncated SVD
of X = ABT using Gram matrices and compare it to the more accurate orthogonal-
ization approach. We consider 3 synthetic input matrices with differing condition-
ing properties to illustrate the differences among the three methods (including both
Cholesky QR and Gram SVD approaches).
In each case, we construct input matrices A and B each to be 1000 × 50 and
to have geometrically distributed singular values with random left and right singular
vectors. We use double precision in these experiments. In the first case, we construct
63
both A and B to have condition numbers of 106: κ(A) = κ(B) = 106. That is,
the largest singular value of each matrix is 106, the smallest is 100, and the rest are
geometrically distributed within that range. The condition number of X in this case
is bounded above by 1012. The second synthetic case has input matrices that are
more ill-conditioned: κ(A) = κ(B) = 1012. The third case has input matrices that
are imbalanced, with κ(A) = 1012 and κ(B) = 100.
Figure 4.1 reports the results from truncation via QR (algorithm 8), Gram SVD
(algorithm 9), and Cholesky QR (variant of algorithm 9 described in section 4.4.2).
Each column of the figure corresponds to a different pair of inputs, the top row
plots the computed relative singular values (normalized by σ1 so that the first index
is equal to 1), the middle row reports the approximation error after truncation for
various tolerances, and the bottom row reports the computed truncation ranks.
In the left column, we see an example of a typical use case of the algorithm: all
algorithms perform equivalently and the approximation error matches the specified
tolerance. Note that when the tolerance is smaller than the smallest singular value,
no truncation is performed. If both input matrices have condition number smaller
than the inverse of the square root of machine precision, then we expect no distinction
among algorithms. In this case, the conditioning of the Gram matrices is such that
the eigenvalues can be computed accurately and Cholesky decomposition will not fail.
In the middle column, we see an example of input matrices whose condition num-
bers are larger than 108. In this case, the Gram matrices are numerically low rank,
causing truncation of the Cholesky decomposition and a loss of accuracy of the small-
est eigenvalues. This causes a sharp truncation of the rank in the case of Cholesky
and an overestimate of the singular values of X in the case of Gram SVD. For toler-
ances smaller than 10−8, we see that the approximation error of Cholesky QR does
not drop below the square root of machine precision. The Gram SVD approach’s rank
64
0 10 20 30 40 50
100
10−4
10−8
10−12
10−16
Index
Com
pSin
gV
als
κ(A)=106, κ(B)=106
QRGram SVDCholQR
0 10 20 30 40 50
Index
κ(A)=1012, κ(B)=1012
0 10 20 30 40 50
Index
κ(A)=1012, κ(B)=101
100 10−4 10−8 10−12
100
10−4
10−8
10−12
10−16
Tol
Err
or
100 10−4 10−8 10−12
Tol
100 10−4 10−8 10−12
Tol
100 10−4 10−8 10−12
10
20
30
40
50
Tol
Ran
ks
100 10−4 10−8 10−12
Tol
100 10−4 10−8 10−12
Tol
Figure 4.1: Numerical results for truncation of matrix product X = ABT. Columnscorrespond to input matrices with different conditioning properties (details given insection 4.4.4). Top row specifies computed relative singular values before truncation,middle row reports relative approximation error after truncation, and bottom rowspecifies the truncation rank used for various requested tolerances.
65
selection deviates slightly from that of QR, but only for very small tolerances near
10−14. We note that for tolerance larger than 10−8, we see no deviation in behavior
across all three algorithms.
In the right column, we consider input with one matrix close to low rank but
the other well conditioned. Again, for tolerances larger than 10−8, all algorithms
perform well. For tighter tolerances, however, we see that the inaccuracy of small
eigenvalues of the Gram matrix of A causes deviation in truncation rank selection
and approximation error. As in the second case, the Cholesky QR approach does not
attain error below 10−8 because of the sharp truncation performed by the pivoted
algorithm. The Gram SVD approach computes approximation errors that match
the tolerance closely below 10−8, but as the tolerance tightens, the method begins
overestimating the truncation rank and eventually stops truncating at all. In this
way, the approximation error satisfies the tolerance, but the rank is not truncated as
much as possible.
Based on these results, we conclude that for tolerances greater than the square
root of machine precision, truncation using Gram matrices is sufficiently accurate.
While small singular of A and B are not computed as accurately via the Gram
SVD approach, they are not necessary for computing low rank approximations with
large approximation error. We note that the relationship between the SVDs of A,
B, and X have an effect of the overall accuracy. Even if a less accurate method
is used for the SVD of A and B, these results show that the Gram SVD approach
can compute singular values of X that are smaller than the square root of machine
precision. Despite the fact that the cheapest approach using pivoted Cholesky QR
is sufficiently accurate for large tolerances, we use the Gram SVD approach in the
context of TT rounding because it is more robust for smaller tolerances and because
the extra computation has little effect on the overall run time.
66
TX,1 TX,2 TX,3 TX,4 TX,5 TX,6
R1 R2 R3 R4 R5
I1 I2 I3 I4 I5 I6
A BQ Z
X
(a) Equation (4.6) for N =6, n = 3
TX,1
TX,1
TX,2
TX,2
TX,3
TX,3
R1
R1
R2
R2
R3
R3
I1 I2 I3
A
A
GL1
GL2
GL3
(b) A(1:n)TA(1:n) for n = 3
TX,3
TX,3
R3
R3
I3GL2
R2
R2
GL3
TC,3
(c) Computing GLn from
GLn−1 for n = 3
Figure 4.2: Tensor network diagrams
4.5 TT-Rounding via Gram SVD
We now apply the approach described in section 4.4 for X = ABT to the case of
TT rounding. In section 4.5.1, we explain the analogues of matrices A and B within
the TT rounding algorithm, and in section 4.5.2 we show how to compute the Gram
matrices for the associated structure matrices. We then present two algorithmic vari-
ants of TT rounding based on the approach in section 4.5.3 and provide complexity
analysis in section 4.5.5 with comparison against the standard TT rounding via or-
thogonalization.
4.5.1 TT Rounding Structure
The nth TT rank of a tensor X is the rank of the unfolding X(1:n), which is an I1 · · · In×
In+1 · · · IN matrix where each column is a vectorization of an n-mode subtensor. If
X is already in TT format, then X(1:n) has the following structure [1, Eq. (2.3)]:
X(1:n) = (IIn ⊗Q(1:n−1))V(TX,n)H(TX,n+1)(IIn+1 ⊗ Z(1)), (4.6)
where Q is I1 × · · · × In−1 ×Rn−1 with
Q(i1, . . . , in−1, rn−1) = · · · ·TX,n−1(:, in−1, rn−1),
and Z is Rn+1 × In+2 × · · · × IN with
Z(rn+1, in+2, . . . , iN) = TX,n+2(rn+1, in+2, :) · · · · .
67
Truncating or rounding the TT rank of X in this case corresponds to performing a
truncated SVD of X(1:n). The correctness of algorithm 7 stems from the fact that
at the nth step of the truncation loop, the matrix IIn ⊗ Q(1:n−1) has orthonormal
columns and the matrix H(TX,n+1)(IIn+1 ⊗Z(1)) has orthonormal rows, and therefore
the truncated SVD of V(TX,n) yields the truncated SVD of X(1:n).
In our proposed approach, we do not impose orthogonality on the exterior ma-
trices and instead use a Gram SVD based approach. To follow the analogy of
section 4.4, we consider A = A(1:n) = (IIn ⊗ Q(1:n−1))V(TX,n) and BT = B(1) =
H(TX,n+1)(IIn+1 ⊗Z(1)), where A and B are tensors with dimensions I1× · · · In×Rn
and Rn× In+1×· · ·× IN , respectively. We visualize these relationships using a tensor
network diagram [48] in fig. 4.2a. In these diagrams, a node represents a tensor, edges
represent modes (so that the degree of a node is its dimension), and adjacent nodes
represent contractions. To perform the truncation, we first compute ATA and BTB
as described in section 4.5.2. Then we follow the approach of algorithm 9 and finally
compute A and B by updating only V(TX,n) and H(TX,n+1), leaving the TT cores
that constitute Q and Z unchanged.
4.5.2 Structured Gram Matrix Computation
Considering A(1:n) = (IIn ⊗Q(1:n−1))V(TX,n) as the matrix A in our matrix product
example, our goal is to compute ATA exploiting the structure of A (and the internal
structure of Q(1:n−1)). This can also be seen as a contraction between A, a tensor of
dimension n+ 1, and itself in the first n modes.
The structure is easiest to understand in the form of a tensor network diagram, as
we show in fig. 4.2b. In the figure, we have n = 3, so that A is a 4-way tensor composed
of 3 TT cores. To visualize contracting A with itself and compute GL3 = A(1:3)
TA(1:3),
we draw A twice and connect edges corresponding to the modes with dimensions I1,
68
I2, and I3. After all connected modes are contracted, we are left with 2 un-contracted
modes, each of dimension R3, corresponding to a square output matrix (which is also
symmetric). We use the notation GL3 to signify that A is composed of left-most cores
and has dimension R3 ×R3.
The most efficient way to perform the contractions to compute GLn = A(1:n)
TA(1:n)
is to work left to right, first contracting the mode with dimension I1. Because the
operation involves two tensors with dimension 2, it corresponds to the (symmetric)
matrix multiplication GL1 = V(TX,1)TV(TX,1), where we use the notation GL
1 because
the result is the contraction between the left-most cores and has dimension R1 ×R1.
The next step is to contract the two TX,2 nodes with GL1 to compute GL
2 . These two
contractions can be performed in either order or simultaneously, exploiting symmetry
as we describe below. We continue this process of computing each symmetric Gram
matrix from the previous mode’s, finally computing GLn from GL
n−1 and the two TX,n
cores. Figure 4.2c shows the structure of the tensor network before GL3 is computed
from GL2 and the two TX,3 cores.
The key to the efficiency of the structured Gram matrix computation in the context
of TT rounding is the fact that we obtain all Gram matrices GLn as a by-product
of computing the last one, GLN−1. In this way, we have performed the ATA-analogue
computations for truncating all TT ranks with one left-to-right pass over the TT
representation of the tensor. In order to compute the BTB-analogue quantities, we
make a similar pass from right to left to obtain GRn for 1 ≤ n ≤ N − 1. Note
that GRn is the contraction between the right-most cores to the right of (and not
including) the nth core, so that GLn and GR
n are the Gram matrices associated with
the truncation of the nth TT rank and are both Rn ×Rn.
We now consider two ways of computing GLn from GL
n−1 and two TX,n cores, which
we refer to as non-symmetric and symmetric approaches. Computations for GRn from
69
GRn+1 are analogous. In the nonsymmetric approach, we contract GL
n−1 with one of
the cores, letting TC,n represent the temporary result as illustrated in fig. 4.2c. Here
we consider C to be a TT-format tensor with the same dimensions and ranks as X
for convenient notation. This contraction is a tensor-times-matrix operation and can
be expressed as TC,n = TX,n ×1 GLn−1 and computed as H(TC,n) = GL
n−1H(TX,n).
After the first contraction, TC,n and the remaining TX,n share two modes, and the
second contraction is across both modes. This operation can be performed via GLn =
V(TX,n)TV(TC,n). Note that while the result is symmetric in exact arithmetic, this
approach does not assume symmetry, and the result will not be bit-wise symmetric
due to roundoff error.
In the symmetric approach, we can use the fact that every Gram matrix is sym-
metric and positive semi-definite. Thus, we can compute a (pivoted) Cholesky decom-
position GLn−1 = LLT. Then we can contract each L factor with one of the TX,n nodes,
permuting slices of TX,n if necessary. Here, one contraction is sufficient because they
are equivalent operations, and we can exploit the triangular structure of L to save half
the arithmetic of the tensor-times-matrix operation. Letting TD,n = TX,n×1 L repre-
sent the result, the second contraction is performed via GLn = V(TD,n)TV(TD,n) which
can be performed symmetrically, again saving half the arithmetic and producing an
exactly symmetric result.
As illustrated in , GLn−1 is a matrix with dimension Rn−1 × Rn−1 and TX,n has
dimensions Rn−1 × In × Rn. In the nonsymmetric approach, the first contraction
requires 2InR2n−1Rn operations, and the second contraction requires 2InRn−1R
2n op-
erations. In the symmetric approach, the Cholesky decomposition requires O(R3n−1)
operations, and the two contractions together require InR2n−1Rn + InRn−1R
2n opera-
tions, not including any pivoting that must be performed. Despite the fact that they
symmetric approach saves half the flops, we use the nonsymmetric approach in our
70
later experiments because of the empirical performance benefits. We found that the
superior performance of gemm over trmm and syrk (and the need to copy data for
trmm) on our platform outweighs the reduction in arithmetic.
4.5.3 Algorithms
Given the approach to computing Gram matrices of the TT-structured matrices de-
scribed in section 4.5.2, we now present algorithms for TT-rounding using the Gram
SVD approach. We follow the basic steps outlined in section 4.4 and algorithm 9:
compute Gram matrices of factors, perform eigenvalue decompositions, truncate the
combined results using SVD, then apply updates to factors to reduce their dimensions.
As described in section 4.5.2, with a left-to-right and right-to-left pass of the TT
structure, we can obtain the Gram matrices associated with every TT rank truncation.
Given its pair of Gram matrices, each TT rank can be truncated independently of all
others. We call this approach the simultaneous variant to distinguish it from a more
computationally efficient method that truncates ranks in sequence (described below).
The simultaneous variant of the algorithm is given as algorithm 10. Line 2 to line 11
show the set of contractions used to obtain Gram matrices across all modes. Lines 14
to 16 perform the eigenvalue and singular value decompositions of small matrices.
Finally, lines 17 and 18 update the TT cores and reduce their dimension. Note that
the singular values are distributed evenly to each interior factor, as each is scaled by
Σ1/2
.
Alternatively, we can truncate the TT ranks in sequence to save some arith-
metic by exploiting orthogonality. Following the original approach of TT-Rounding
via orthogonalization (algorithm 7), if we truncate the ranks from left to right and
pass all singular value to the right, then we maintain orthogonality of the left-most
cores. That is, when truncating the nth rank and considering eq. (4.6), we have that
71
Algorithm 10 TT-Rounding via Gram SVD (Simultaneous)
1: function Y = TT-Round-Gram-Sim(X, ε)2: GL
1 = V(TX,1)TV(TX,1)
3: for n = 2 to N − 1 do4: H(TC,n) = GL
n−1H(TX,n)5: GL
n = V(TX,n)TV(TC,n)
6: end for7: GR
N−1 = H(TX,N )H(TX,N )T
8: for n = N − 1 down to 1 do9: V(TC,n) = V(TX,n)G
Rn
10: GRn−1 = H(TC,n)H(TX,n)
T
11: end for12: Compute ‖X‖ = (GR
0 )1/2
and ε0 = ‖X‖√N−1
ε
13: for n = 1 to N − 1 do14: [VL,ΛL] = Eig(GL
n)15: [VR,ΛR] = Eig(GR
n )
16: [U, Σ, V] =tSVD(Λ1/2L VT
LVRΛ1/2R , ε0)
17: V(TY,n) = V(TX,n)·(VLΛ−1/2L UΣ
1/2)
18: H(TY,n+1) = (Σ1/2
VTΛ−1/2R VT
R)·H(TY,n+1)19: end for20: end function
Q(1:n−1) has orthonormal columns. Thus, truncating X(1:n) is equivalent to truncat-
ing V(TX,n)H(TX,n+1)(IIn+1 ⊗ Z(1)). In the standard approach, we also have that
H(TX,n+1)(IIn+1 ⊗ Z(1)) has orthogonal rows, but that does not apply here. Instead,
we use the analogue of A = V(TX,n) and BT = H(TX,n+1)(IIn+1⊗Z(1)). We note that
BT is identical to the simultaneous case, so BTB is exactly GRn . The A matrix is
different, but because it corresponds to a single core, the Gram matrix computation
is much cheaper to compute: GLn = V(TX,n)TV(TX,n).
Thus, we can make a single right-to-left pass to pre-compute all Gram matrices
corresponding to BTB, and then we can make a left-to-right truncation pass where we
maintain orthogonality of the left-most cores and compute Gram matrices for ATA
in sequence. The other added benefit of this approach is that the nth core already
has one dimension truncated (from the previous mode) when its Gram matrix is
72
computed. This sequence variant is presented in algorithm 11.
Algorithm 11 TT-Rounding via Gram SVD (Sequence RLR)
1: function Y = TT-Round-Gram-Seq(X, ε)2: GR
N−1 = H(TX,N )H(TX,N )T
3: for n = N − 1 down to 1 do4: V(TC,n) = V(TX,n)G
Rn
5: GRn−1 = H(TC,n)H(TX,n)
T
6: end for7: Compute ‖X‖ = (GR
0 )1/2
and ε0 = ‖X‖√N−1
ε
8: for n = 1 to N − 1 do9: GL
n = V(TX,n)TV(TX,n)
10: [VL,ΛL] = Eig(GLn)
11: [VR,ΛR] = Eig(GRn )
12: [U, Σ, V] =tSVD(Λ1/2L VT
LVRΛ1/2R , ε0)
13: V(TY,n) = V(TX,n)·(VLΛ−1/2L U)
14: H(TY,n+1) = (ΣVTΛ−1/2R VT
R)·H(TY,n+1)15: end for16: end function
We note that the sequence order is arbitrary. Algorithm 11 truncates ranks in
left-to-right order, but it can also truncate right-to-left if the Gram matrix sweep
is done left-to-right. Following prior work [1], we use the acronym RLR to signify
a right-to-left Gram sweep followed by a left-to-right truncation sweep, and we use
LRL to signify left-to-right Gram sweep followed by a right-to-left truncation sweep.
4.5.4 Parallelization
Algorithms 10 and 11 are presented as sequential algorithms. We describe the parallel
version of the algorithm in words here, as we have chosen the algorithm for its ease
of parallelization. We follow the same parallel distribution as prior work on TT-
Rounding via orthogonalization [1] described in section 4.2.5, with each TT core
distributed across all processors and each processor owning a subset of the slices in
1D-distribution fashion.
There are two main parallel operations to consider in these algorithms: (1) a
73
TT-core times a small matrix in one mode (e.g., line 4 in algorithm 10), and (2)
the contraction of two TT cores across two modes (e.g., line 5 in algorithm 10).
Given the parallel distribution, a TT-core times a small matrix in one node, which is
expressed as pre-multiplication of the horizontal unfolding or post-multiplication of
the vertical unfolding by a small matrix) can be performed independently, with no
communication, if all processors have access to the small matrix. Also, the contraction
of two TT cores (expressed as the transpose of a vertical unfolding times another
vertical unfolding or a horizontal unfolding times the transpose of another horizontal
unfolding) can be performed via parallel reduction with a small matrix as output:
after local contraction, a single all-reduce computes and stores the result across all
processors.
In the simultaneous variant (algorithm 10), computing the left and right Gram
matrices consists of alternating these two operations. Consider line 2 to line 6: if each
GLn contraction operation uses an all-reduce, then the subsequent core-times-matrix
operation requires only local computation and no communication. The same pattern
applies to computing the GRn matrices. Given that the Gram matrices are all avail-
able on all processors, the EVD and SVD operations can be performed redundantly
so that the update operations in lines 17 and 18 also require no communication. We
note that in the simultaneous variant, the EVD and SVD operations are independent
across modes. It is thus possible to distribute these computations across processors,
allowing N different processors to work simultaneously on all modes. In this case, the
processors need to broadcast their results in order to perform the update operations.
This optimization improves scalability at the expense of slightly higher communica-
tion costs. We have not implemented this approach because the sequence variant of
the algorithm outperforms the simultaneous variant in our experiments.
In the sequence variant (algorithm 11), we pre-compute only one set of Gram
74
matrices. Computing these Gram matrices is parallelized the same as in the simul-
taneous variant. The unique operation for the sequence variant is line 9, which is a
contraction of a TT core with itself, which is performed via local computation and
an all-reduce. As before, the EVD and SVD operations are performed redundantly
and the updates require no communication.
4.5.5 Complexity Analysis
We perform complexity analysis using the simplifying assumptions that all tensor
dimensions are equivalent, all ranks are equivalent, and all reduced ranks are equiv-
alent. That is, we assume that In = I for 1 ≤ n ≤ N and that original and reduced
ranks Rn = R and Ln = L for 1 ≤ n ≤ N − 1. For comparison, the parallel cost of
TT-Rounding via orthogonalization (algorithm 7) is given by
γ ·(NIR
3R2 + 6RL+ 4L2
P+O(NR3 logP )
)+β ·O(NR2 logP )+α ·O(N logP ),
where γ, β, and α are the costs per flop, word, and message, respectively [1, Eq.
(3.6)].
Algorithm 10 (the simultaneous variant) performs two passes to compute Gram
matrices. For each mode, the local computation involves the multiplication between
a local tensor core of dimension R × (I/P ) × R with an R × R matrix, for a cost
of 2IR3/P flops, and a contraction between two cores, which requires 2IR3/P flops.
Thus, the total arithmetic cost of the Gram matrix computations is 8NIR3/P . As
described in section 4.5.2, by exploiting symmetry we can reduce the constant factor
from 8 to 4. The EVD and SVD operations are performed on R × R matrices for
a total cost of O(NR3) flops (note there is no parallelism in these operations). The
updates of the cores are multiplications of the cores with two R × L matrices. The
first multiplication costs 2IR2L/P flops, while the second costs 2IRL2/P because
75
it involves a core with one mode of already reduced dimension. Thus, the total
arithmetic cost for the updates is 2NIR2L/P + 2NIRL2/P .
The communication cost of algorithm 10 is that of two all-reduces for each mode
(one for each direction of Gram matrix computation). Thus, the communication
costs across all modes are β ·O(NR2) + α ·O(N logP , and the total parallel cost for
algorithm 10 (assuming symmetry is exploited) is
γ ·(NIR
4R2 + 2RL+ 2L2
P+O(NR3)
)+ β · O(NR2) + α · O(N logP ).
Algorithm 10 (the sequence variant) performs only one pass to compute Gram
matrices, for an arithmetic cost of 4NIR3/P flops across all modes, or 2NIR3/P
flops if we use the symmetric approach. Computing the Gram matrix for the nth TT
core in line 9 costs IR2L/P flops, because its first mode has already been reduced
in dimension from R to L. The EVD and SVD operations and the updates of the
cores are the same as in the simultaneous variant. The communications costs are
identical to the simultaneous variant as well: there is one all-reduce for each mode in
the Gram pass and one all-reduce in each mode for line 9. Thus, the total parallel
cost for algorithm 11 (assuming symmetry is exploited) is
γ ·(NIR
2R2 + 3RL+ 2L2
P+O(NR3)
)+ β · O(NR2) + α · O(N logP ).
We note that, compared to the orthogonalization approach, the Gram SVD ap-
proaches have reduced constants on the leading arithmetic terms and smaller band-
width terms (by a factor of O(logP )). We will see in the numerical results that
the reduced arithmetic provides significant speedup in practice, in part because the
performance of the operations (which are all based on gemm for Gram SVD) also im-
proves. At higher processor counts, the simplified communication structure (using
76
Model Modes Dimensions Memory1 50 2K × . . .× 2K 77 MB2 16 100M × 50K × . . .× 50K × 1M 8 GB3 30 2M × . . .× 2M 45 GB4 10 10K × 20× . . .× 20 930 KB
Table 4.1: Synthetic TT models used for performance experiments. All formal ranksare 20 and are cut in half to 10 by the TT rounding procedure.
a single well-optimized collective) also provides speedup over the more complicated
communication of Tall-Skinny QR of the orthogonalization approach.
4.6 Numerical Results
4.6.1 Experimental Setup
All parallel scaling experiments are performed on the Andes supercomputer at Oak
Ridge Leadership Computing Facility. Andes is a 704-node Linux cluster. Each node
contains 256 GB of RAM and 2 AMD EPYC 7302 16-Core processors for a total of
32 cores per node. We build our Gram rounding subroutines on top of the library
MPI ATTAC [43], and we use OpenBLAS implementation for BLAS and LAPACK
routines [46] and OpenMPI [24].
As described in table 4.1, we use 4 synthetic TT models for scaling experiments.
Models 1-3 are analogous to the synthetic models used in prior work [1]. Model
4 is identical in shape to the problem we solve via TT-GMRES in the MATLAB
implementation of TT Rounding (see section 4.6.4). For each model, we scale using
the three Gram SVD algorithms described in section 4.5.3 and the original QR-based
TT Rounding algorithm given by algorithm 7. All reported numbers are the minimum
of 5 trials on 5 different allocations. The sequential experiments using MATLAB were
performed on a machine with an Intel Xeon Gold 6226R CPU and 256 GB of RAM.
77
25 26 27 28 29 210 211
2−6
2−4
2−2
20
22
Cores
Tim
e(s
)
QRSIMLRLRLR
Figure 4.3: Strong Scaling for Model 2
26 27 28 29 210 2112−3
2−1
21
23
25
Cores
Tim
e(s
)
QRSIMLRLRLR
(a) Strong Scaling
64 128 256 512 1024 20480
0.2
0.4
0.6
0.8
1
Cores
Tim
e(s
)
QRSIMLRLRLR
(b) Timing Breakdown
Figure 4.4: Performance results for Model 3. Dark signifies computation, and lightsignifies communication.
4.6.2 Parallel Scaling of TT Rounding
Figures 4.3 and 4.4a present strong scalability comparisons using models 2 and 3,
respectively, among different rounding procedures. In fig. 4.3, we see that Gram-
SVD-based rounding methods scale well to 32 nodes, with parallel speedups of 26×,
21×, and 21×. The LRL variant is fastest, reaching a speedup of a factor of up to
78
21× compared to the QR-based rounding. We note that since the mode sizes of the
boundary modes are different, the computation complexity costs for the LRL and
RLR variants become different, with LRL performing approximately half the flops
of RLR. As expected, we see a performance difference between LRL and RLR of
nearly 2× when the performance is computation bound, and the run times converge
as communication costs begin to dominate. The scalability limit is caused by the
machine and is not inherent to the algorithm, as we explain in section 4.6.3.
In the case of model 3, the mode sizes are all equal, and the complexity analysis in
section 4.5.5 tells us that the LRL and RLR approaches are about twice faster than
the Gram-Sim approach. This analysis is confirmed by the experiment when the time
is computation bound, as we see in fig. 4.4a. Speedups of Gram SVD over QR range
from 6× to 8×, and the parallel speedups for the Gram SVD algorithms on 64 nodes
are 42×, 27×, and 15×.
4.6.3 Time Breakdown of TT Rounding
Figure 4.4b presents the relative communication/computation runtime of the strong
scalability test using model 3, matching the data of fig. 4.4a. We remark that the com-
munication time is more significant when using the QR-based TT rounding. The com-
munication costs for the QR-based are a factor O(logP ) larger than the Gram round-
ing procedures in theory. Further, the Gram SVD variants use the MPI Allreduce
routine which seems to be more efficient than the TSQR implementation used in the
QR-based rounding.
Figure 4.5 presents the communication/computation runtime breakdown of a weak
scalability test using model 1 and different variants of TT rounding procedures. We
remark that the computation time for each method is the same when increasing
the number of processors, and the relative computation time affirms the theoretical
79
32 64 128 256 512 102420480
1
2
3
Cores
Tim
e(s
)
QRSIMLRLRLR
Figure 4.5: Weak scaling time breakdowns for Model 1. Dark signifies computation,and light signifies communication.
analysis of the constant factors n the leading terms. The communication time of
Gram rounding procedures shows a logarithmic increase up to 32 nodes (1024 cores)
and increases significantly on 64 nodes. This behavior appears even earlier, at 256
processors, when using the QR-based TT rounding. In order to understand this
behavior, we performed a scalability test on the MPI Allreduce routine on Andes
using a single scalar and observed similar behavior costs as in fig. 4.5: the time
increases like logP until 32 nodes and then begins to increase more quickly than
theory suggests. Thus, we believe the scalability limit is reached due to an artifact
of the machine rather than a limitation of the algorithm, whose latency costs should
grow with O(logP ).
4.6.4 TT-GMRES Performance
Here we consider a parameter dependent PDE model where we seek an increasingly
accurate solution by refining the mesh in space. This mesh refinement will increase
the size of mode 1 and leave the parameters modes’ sizes the same.
80
500 1,000 1,5000
20
40
60
Tensor Dimension
Tim
e(s
)
QR
SIM
LRL
Figure 4.6: TT-GMRES timing for MATLAB implementation. Dark signifies TTrounding, and light signifies other computation.
MATLAB Performance for Small Problem
In this experiment, we use TT-GMRES to solve the Cookies problem described in
section 4.2.3 using p = 4 parameters. The values of each parameter are distributed
logarithmically in the interval [0.1,10]. The discretization of the PDE is obtained
by using FreeFem++ [29]. For each variant of TT rounding, we perform 10 itera-
tions. The variance between relative residual norm obtained by different methods is
negligable. For all methods we obtain an accuracy of approximately 10−3.
Figure 4.6 shows the performance of the original TT Rounding using QR on a
MATLAB implementation of TT-GMRES compared to the Gram-Sim and Gram-Seq
(LRL) implementations of TT rounding in MATLAB. We note that TT-Rounding is
at least half of the runtime of TT-GMRES using QR and that the Gram-Sim gives
at least a 2× speedup over the QR implementation of TT-Rounding for an overall
faster TT-GMRES algorithm.
81
25 26 27 28 29 210 211
2−8
2−6
2−4
2−2
Cores
Tim
e(s
)
QRSIMLRL
Figure 4.7: TT-GMRES Weak Scaling
Weak Scaling of TT Rounding for Larger Problems
Using a TT tensor of the same dimensions as the one used in section 4.6.4, we weakly
scale the spatial dimension on Andes, keeping all other modes fixed, and report the
results in Figure 4.7. We remark that the LRL variant does less computation than
RLR, so we report only LRL performance, which we see weakly scales well until 210
cores.
4.7 Conclusion
We present in this work a parallel rounding procedure for low-rank TT tensors based
on Gram SVD. In contrast with the orthogonalization-based rounding procedure that
relies heavily on QR decomposition of tall and skinny matrices, this method relies on
matrix multiplication. Not only does the Gram SVD approach reduce the computa-
tional complexity, but existing on-node implementations of matrix multiplication are
typically more efficient than those computing and multiplying by orthogonal matrices.
Our scalability experiments show that the proposed method scales as well as or
82
better than the state of the art, in large part because all the communication is cast
in terms of all-reduce collectives. We observe a maximum speedup over the previous
work of 21× for a 16-mode tensor on 16 nodes (512 cores). Our numerical experiments
also show that the loss of accuracy inherent in the Gram SVD does not affect the final
accuracy of the solution when used in iterative low rank solvers such as TT-GMRES
where aggressive truncation, hence, low accuracy, can be used.
We consider simultaneous and sequence variants of the Gram SVD approach. The
theoretical analysis and experimental results show that the reduced arithmetic of the
sequence variants leads to shorter run times in almost all cases. Within the sequence
variant, we observe that the LRL and RLR orderings are both possible and typically
have comparable run times. We note that for some applications where the first mode
size is much larger than the last mode size (which is common for parametrized PDE
problems), the LRL approach should be used as it has lower computation complexity.
In the light of the numerical experiments, we plan in the future to study ran-
domized methods to perform rounding procedures. Using randomized methods could
outperform the proposed procedures as they reduce arithmetic further and also rely
on matrix multiplication. Encouraged by the results of the MATLAB implementation
of TT-GMRES, we also plan to develop a scalable implementation of the TT-based
linear solver than can use our parallel TT rounding algorithms.
83
Chapter 5: Conclusion
Low-rank approximations of matrices and tensors are applicable for compressing
and interpreting data. By designing and implementing distributed-memory parallel
algorithms for low-rank approximations, we can feasibly compute with larger datasets
in a reasonable amount of time and without exceeding memory constraints.
Chapter 3 shows a distributed-memory implementation for Hierarchical NMF. We
showed that this algorithm can scale well as long as the local matrix multiplication
problem dominates in time. This holds true when the data matrices have many more
features than samples, and so have an aspect ratio that is “tall-and-skinny”. However,
applications like hyperspectral imaging [45] have many more samples (pixels) than
features (spectral bands) and so have an aspect ratio of “short-and-fat”. This is due
to the fact that the AVIRIS hyperspectral camera only captures 224 spectral bands,
while it can be used in high altitude imaging that covers thousands of miles for a
total of billions of pixels. This aspect ratio means that it is difficult to scale with
AVIRIS data using our current 1-D row distribution. In order to scale for this type
of data, future work should involve a more general 2-D row and column distribution.
This adds its own scaling difficulties since such distributions require processors to
redistribute data as the hierarchical tree is built.
Chapter 4 describes an improvement on the distributed-memory Tensor Train
Rounding algorithm using Gram matrices. By using Gram matrices instead of QR to
compute truncated SVDs, this algorithm gives at least a 2X speedup over the state-
of-the-art approach. Like in chapter 3, these truncated SVDs work well when the
matrices are “tall-and-skinny”. This means that it works well when dimensions of a
tensor are much larger than its TT-ranks, as is the case for many problems arising from
84
parametrized PDEs. However, in applications like the TT-GMRES cookie example
described in section 4.2.3, the dimensions of the TT tensor are either of equal size or
smaller than the TT-Ranks for some modes. In some quantum physics applications,
the tensor dimensions are very small (less than 10 even) and the ranks are very large
(greater than 1000) [57]. With this type of problem, the computational bottleneck
goes from computing the Gram matrices (done sequentially) to computing the SVDs
for truncation (which can be done in parallel). So future work should implement
the simultaneous Gram variant for this application, since it can be advantageous in
parallelizing the SVD computations.
85
Bibliography
[1] Hussam Al Daas, Grey Ballard, and Peter Benner. Parallel algorithms for tensor
train arithmetic. Technical Report 2011.06532, arXiv, 2020. URL: https://
arxiv.org/abs/2011.06532.
[2] E. Apra, E. J. Bylaska, et al. NWChem: Past, present, and future. The Journal
of Chemical Physics, 152(18):184102, 2021/04/09 2020.
[3] J. Ballani and L. Grasedyck. A projection method to solve linear systems
in tensor format. Numerical Linear Algebra with Applications, 20(1):27–43,
2021/04/09 2013.
[4] G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz.
Communication lower bounds and optimal algorithms for numerical linear alge-
bra. Acta Numerica, 23:1–155, May 2014. doi:10.1017/S0962492914000038.
[5] Grey Ballard, Alicia Klinvex, and Tamara G. Kolda. TuckerMPI: A parallel
C++/MPI software package for large-scale data compression via the tucker
tensor decomposition. ACM Transactions on Mathematical Software, 46(2),
June 2020. URL: https://dl.acm.org/doi/10.1145/3378445, doi:10.1145/
3378445.
[6] E. Battenberg and D. Wessel. Accelerating non-negative matrix factorization for
audio source separation on multi-core and many-core architectures. In ISMIR,
pages 501–506, 2009. URL: https://archives.ismir.net/ismir2009/paper/
000089.pdf.
86
[7] P. Benner, S. Dolgov, A. Onwunta, and M. Stoll. Low-rank solvers for unsteady
stokes–brinkman optimal control problem with random data. Computer Methods
in Applied Mechanics and Engineering, 304:26–54, 2016.
[8] P. Benner, S. Dolgov, A. Onwunta, and M. Stoll. Low-rank solution of an optimal
control problem constrained by random navier-stokes equations. International
Journal for Numerical Methods in Fluids, 92(11):1653–1678, 2020/11/10 2020.
[9] Peter Benner, Serkan Gugercin, and Karen Willcox. A survey of projection-
based model reduction methods for parametric dynamical systems. SIAM
Review, 57(4):483–531, 2015. arXiv:https://doi.org/10.1137/130932715,
doi:10.1137/130932715.
[10] V. T. Chakaravarthy, J. W. Choi, D. J. Joseph, X. Liu, P. Murali, Y. Sabharwal,
and D. Sreedhar. On optimizing distributed Tucker decomposition for dense ten-
sors. In 2017 IEEE International Parallel and Distributed Processing Symposium
(IPDPS), pages 1038–1047, May 2017. doi:10.1109/IPDPS.2017.86.
[11] E. Chan, M. Heimlich, A. Purkayastha, and R. van de Geijn. Collective com-
munication: theory, practice, and experience. Concurrency and Computation:
Practice and Experience, 19(13):1749–1783, 2007. doi:10.1002/cpe.1206.
[12] Jee Choi, Xing Liu, and Venkatesan Chakaravarthy. High-performance dense
tucker decomposition on gpu clusters. In Proceedings of the International Con-
ference for High Performance Computing, Networking, Storage, and Analysis,
SC ’18, pages 42:1–42:11, Piscataway, NJ, USA, 2018. IEEE Press. URL:
http://dl.acm.org/citation.cfm?id=3291656.3291712.
[13] Wolfgang Dahmen, Ronald DeVore, Lars Grasedyck, and Endre Suli. Tensor-
sparsity of solutions to high-dimensional elliptic partial differential equations.
87
Foundations of Computational Mathematics, 16(4):813–874, 2016.
[14] J. Demmel, L. Grigori, M. Hoemmen, and J. Langou. Communication-optimal
parallel and sequential QR and LU factorizations. SIAM Journal on Scientific
Computing, 34(1):A206–A239, 2012. URL: http://epubs.siam.org/doi/abs/
10.1137/080731992, doi:10.1137/080731992.
[15] C. Ding, X. He, and H.D. Simon. On the equivalence of nonnegative matrix
factorization and spectral clustering. In SDM ’05, pages 606–610. SIAM, 2005.
doi:10.1137/1.9781611972757.70.
[16] S. V. Dolgov. TT-GMRES: solution to a linear system in the structured tensor
format. Russian Journal of Numerical Analysis and Mathematical Modelling,
28(2):149–172, 01 Apr. 2013. doi:10.1515/rnam-2013-0009.
[17] B. Drake, S. Lee-Urban, and H. Park. Smallk v1.6.2. http://smallk.github.
io/, June 2017.
[18] Bruce A. Draper, Kyungim Baek, Marian Stewart Bartlett, and J.Ross Bev-
eridge. Recognizing faces with pca and ica. Computer Vision and Image Under-
standing, 91(1):115–137, 2003. Special Issue on Face Recognition. URL: https:
//www.sciencedirect.com/science/article/pii/S1077314203000778, doi:
https://doi.org/10.1016/S1077-3142(03)00077-8.
[19] R. Du, D. Kuang, B. Drake, and H. Park. DC-NMF: nonnegative matrix
factorization based on divide-and-conquer for fast clustering and topic mod-
eling. Journal of Global Optimization, 68(4):777–798, 2017. doi:10.1007/
s10898-017-0515-z.
[20] Srinivas Eswar, Koby Hayashi, Grey Ballard, Ramakrishnan Kannan, Michael A.
Matheson, and Haesun Park. PLANC: Parallel low rank approximation with
88
non-negativity constraints. Technical Report 1909.01149, arXiv, 2019. URL:
https://arxiv.org/abs/1909.01149.
[21] J.P. Fairbanks, R. Kannan, H. Park, and D.A. Bader. Behavioral clusters in
dynamic graphs. Parallel Computing, 47:38–50, 2015. doi:10.1016/j.parco.
2015.03.002.
[22] Takeshi Fukaya, Ramaseshan Kannan, Yuji Nakatsukasa, Yusaku Yamamoto,
and Yuka Yanagisawa. Shifted Cholesky QR for computing the QR factorization
of ill-conditioned matrices. SIAM Journal on Scientific Computing, 42(1):A477–
A503, 2020. doi:10.1137/18M1218212.
[23] Takeshi Fukaya, Yuji Nakatsukasa, Yuka Yanagisawa, and Yusaku Yamamoto.
CholeskyQR2: A simple and communication-avoiding algorithm for computing
a tall-skinny QR factorization on a large-scale parallel system. In Proceedings
of the 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale
Systems, ScalA ’14, pages 31–38, Piscataway, NJ, USA, 2014. IEEE Press. URL:
http://dx.doi.org/10.1109/ScalA.2014.11, doi:10.1109/ScalA.2014.11.
[24] Edgar Gabriel, Graham E. Fagg, et al. Open MPI: Goals, concept, and design of a
next generation MPI implementation. In Proceedings, 11th European PVM/MPI
Users’ Group Meeting, pages 97–104, Budapest, Hungary, September 2004.
[25] N. Gillis, D. Kuang, and H. Park. Hierarchical clustering of hyperspectral im-
ages using rank-two nonnegative matrix factorization. IEEE Transactions on
Geoscience and Remote Sensing, 53(4):2066–2078, April 2015. doi:10.1109/
TGRS.2014.2352857.
[26] Lars Grasedyck. Existence and computation of low Kronecker-rank approxima-
tions for large linear systems of tensor product structure. Computing, 72(3-
89
4):247–265, 2004.
[27] L. Grigori and S. Kumar. Parallel Tensor Train through Hierarchical Decompo-
sition. working paper or preprint, February 2021. URL: https://hal.inria.
fr/hal-03081555.
[28] W. Gropp, L.N. Olson, and P. Samfass. Modeling MPI communication per-
formance on SMP nodes: Is it time to retire the ping pong test. In EuroMPI
’16, pages 41–50, New York, NY, USA, 2016. ACM. doi:10.1145/2966884.
2966919.
[29] F. Hecht. New development in freefem++. J. Numer. Math., 20(3-4):251–265,
2012. URL: https://freefem.org/.
[30] N. J. Higham. Accuracy and Stability of Numerical Algorithms. SIAM, Philadel-
phia, PA, 2nd edition, 2002.
[31] R. Kannan, G. Ballard, and H. Park. A high-performance parallel algorithm for
nonnegative matrix factorization. In PPoPP ’16, pages 9:1–9:11, New York, NY,
USA, February 2016. ACM. doi:10.1145/2851141.2851152.
[32] R. Kannan, G. Ballard, and H. Park. MPI-FAUN: An MPI-based framework for
alternating-updating nonnegative matrix factorization. IEEE Transactions on
Knowledge and Data Engineering, 30(3):544–558, March 2018. doi:10.1109/
TKDE.2017.2767592.
[33] Oguz Kaya and Bora Ucar. Scalable sparse tensor decompositions in distributed
memory systems. In Proceedings of the International Conference for High Per-
formance Computing, Networking, Storage and Analysis, SC ’15, pages 77:1–
77:11, New York, NY, USA, 2015. ACM. URL: http://doi.acm.org/10.1145/
2807591.2807624, doi:10.1145/2807591.2807624.
90
[34] J. Kim, Y. He, and H. Park. Algorithms for nonnegative matrix and ten-
sor factorizations: a unified view based on block coordinate descent frame-
work. Journal of Global Optimization, 58(2):285–319, 2014. doi:10.1007/
s10898-013-0035-4.
[35] J. Kim and H. Park. Fast nonnegative matrix factorization: An active-set-like
method and comparisons. SIAM Journal on Scientific Computing, 33(6):3261–
3281, 2011. doi:10.1137/110821172.
[36] Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications.
SIAM Review, 51(3):455–500, September 2009. doi:10.1137/07070111X.
[37] Daniel Kressner and Christine Tobler. Krylov subspace methods for linear sys-
tems with tensor product structure. SIAM J. Matrix Anal. Appl., 31(4):1688–
1714, 2009/10. doi:10.1137/090756843.
[38] D. Kuang and H. Park. Fast rank-2 nonnegative matrix factorization for hierar-
chical document clustering. In KDD ’13, pages 739–747, New York, NY, USA,
2013. ACM. doi:10.1145/2487575.2487606.
[39] Oak Ridge National Laboratory. Summit: America’s newest and smartest su-
percomputer. https://www.olcf.ornl.gov/summit/.
[40] D. Landgrebe and L. Biehl. Multispec - hyperspectral images. https:
//engineering.purdue.edu/~biehl/MultiSpec/hyperspectral.html,
February 2020.
[41] G.E. Moon, J.A. Ellis, A. Sukumaran-Rajam, S. Parthasarathy, and P. Sadayap-
pan. ALO-NMF: Accelerated locality-optimized non-negative matrix factoriza-
tion. In KDD ’20, 2020. doi:10.1145/3394486.3403227.
91
[42] Gordon E. Moore. Cramming more components onto integrated circuits,
reprinted from electronics, volume 38, number 8, april 19, 1965, pp.114 ff. IEEE
Solid-State Circuits Society Newsletter, 11(3):33–35, 2006. doi:10.1109/N-SSC.
2006.4785860.
[43] MPI ATTAC. URL: https://gitlab.com/aldaas/mpi_attac.
[44] Alexander Novikov, Pavel Izmailov, Valentin Khrulkov, Michael Figurnov, and
Ivan V Oseledets. Tensor Train decomposition on TensorFlow (T3F). Journal
of Machine Learning Research, 21(30):1–7, 2020.
[45] Jet Propulsion Laboratory California Institute of Technology. Aviris data portal
2006-2020. https://aviris.jpl.nasa.gov/dataportal.
[46] OpenBLAS. URL: https://github.com/xianyi/OpenBLAS.
[47] I. Oseledets. Tensor-train decomposition. SIAM Journal on Scientific Comput-
ing, 33(5):2295–2317, 2011. doi:10.1137/090752286.
[48] Roger Penrose. Applications of negative dimensional tensors. Combinatorial
mathematics and its applications, 1:221–244, 1971.
[49] Anh-Huy Phan, Petr Tichavsky, and Andrzej Cichocki. Fast alternating LS al-
gorithms for high order CANDECOMP/PARAFAC tensor factorizations. IEEE
Transactions on Signal Processing, 61(19):4834–4846, Oct 2013. doi:10.1109/
TSP.2013.2269903.
[50] Melven Rohrig-Zollner, Jonas Thies, and Achim Basermann. Performance of
low-rank approximations in tensor train format (tt-svd) for large dense tensors,
2021. arXiv:2102.00104.
92
[51] F. Shahnaz, M.W. Berry, V.P. Pauca, and R.J. Plemmons. Document clustering
using nonnegative matrix factorization. Information Processing & Management,
42(2):373–386, 2006. doi:10.1016/j.ipm.2004.11.005.
[52] The ISIC 2020 challenge dataset, 2020. doi:10.34970/2020-ds01.
[53] Shaden Smith, Niranjay Ravindran, Nicholas D. Sidiropoulos, and George
Karypis. SPLATT: Efficient and parallel sparse tensor-matrix multiplication.
In Proceedings of the 2015 IEEE International Parallel and Distributed Pro-
cessing Symposium, IPDPS ’15, pages 61–70, Washington, DC, USA, 2015.
IEEE Computer Society. URL: http://dx.doi.org/10.1109/IPDPS.2015.27,
doi:10.1109/IPDPS.2015.27.
[54] Edgar Solomonik, Devin Matthews, Jeff R Hammond, John F Stanton,
and James Demmel. A massively parallel tensor contraction framework for
coupled-cluster computations. Journal of Parallel and Distributed Computing,
74(12):3176–3190, 2014.
[55] Qingquan Song, Hancheng Ge, James Caverlee, and Xia Hu. Tensor comple-
tion algorithms in big data analytics. ACM Trans. Knowl. Discov. Data, 13(1),
January 2019. doi:10.1145/3278607.
[56] E. Stoudenmire and S. R. White. ITensor: A C++ library for creating efficient
and flexible physics simulations based on tensor product wavefunctions, 2016.
Available online. URL: http://itensor.org/.
[57] E. M. Stoudenmire and Steven R. White. Real-space parallel density matrix
renormalization group. Phys. Rev. B, 87:155137, Apr 2013. URL: https://
link.aps.org/doi/10.1103/PhysRevB.87.155137, doi:10.1103/PhysRevB.
87.155137.
93
[58] R. Thakur, R. Rabenseifner, and W. Gropp. Optimization of collective com-
munication operations in MPICH. International Journal of High Performance
Computing Applications, 19(1):49–66, 2005. doi:10.1177/1094342005051521.
[59] Christine Tobler. Low-rank Tensor Methods for Linear Systems and Eigen-
value Problems. PhD thesis, ETH Zurich, 2012. URL: http://sma.epfl.ch/
~anchpcommon/students/tobler.pdf.
[60] L.N. Trefethen and D. Bau. Numerical Linear Algebra. Society for Industrial
and Applied Mathematics, 1997.
[61] TT-Toolbox. URL: https://github.com/oseledets/TT-Toolbox.
[62] Laurens Van Der Maaten, Eric Postma, and Jaap Van den Herik. Dimensionality
reduction: a comparative. J Mach Learn Res, 10(66-71):13, 2009.
[63] R. Weinhandl, P. Benner, and T. Richter. Linear low-rank parameter-dependent
fluid-structure interaction discretization in 2D. PAMM, 18(1):e201800178,
2021/04/09 2018.
[64] R. Weinhandl, P. Benner, and T. Richter. Low-rank linear fluid-structure inter-
action discretizations. ZAMM - Journal of Applied Mathematics and Mechanics
/ Zeitschrift fur Angewandte Mathematik und Mechanik, 100(11):e201900205,
2021/04/09 2020.
[65] W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative ma-
trix factorization. In SIGIR ’03, pages 267–273, 2003. doi:10.1145/860435.
860485.
[66] Yassine Zniyed, Remy Boyer, AndreL. F. de Almeida, and Gerard Favier. A tt-
based hierarchical framework for decomposing high-order tensors. SIAM Journal
on Scientific Computing, 42(2):A822–A848, 2021/04/09 2020.
94
Curriculum Vitae
Lawton Manning
Employment
• Graduate Research Assistant, Wake Forest University
August 2019 - May 2021
• Security Intern, Logikcull
June 2019 - August 2019
• MATLAB Software Developer, Wake Forest University
June 2018 - May 2019
Education
• Wake Forest University, Winston-Salem NC
M.S. in Computer Science, May 2021
• Wake Forest University, Winston-Salem NC
B.S. in Computer Science, May 2019
Publications
• L. Manning, G. Ballard, R. Kannan, H. Park, Parallel Hierarchical Clustering
using Rank-Two Nonnegative Matrix Factorization, in 2020 IEEE 27th Inter-
national Conference on High Performance Computing, Data, and Anallytics
(HiPC), Pune, India, 2020, pp. 141-150
95