PARALLEL ALGORITHMS FOR LOW-RANK APPROXIMATIONS OF ...

PARALLEL ALGORITHMS FOR LOW-RANK APPROXIMATIONS OFMATRICES AND TENSORS

BY

LAWTON MANNING

A Thesis Submitted to the Graduate Faculty of

WAKE FOREST UNIVERSITY GRADUATE SCHOOL OF ARTS AND SCIENCES

in Partial Fulfillment of the Requirements

for the Degree of

MASTER OF SCIENCE

Computer Science

May 2021

Winston-Salem, North Carolina

Approved By:

Grey Ballard, Ph.D., Advisor

Jennifer Erway, Ph.D., Chair

Samuel Cho, Ph.D.

Table of Contents

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Low-Rank Approximations . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Distributed-Memory Parallel Algorithms . . . . . . . . . . . . . . . . 2

1.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Chapter 2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 Distributed-Memory Parallel Computing . . . . . . . . . . . . . . . . 5

2.1.1 MPI Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.2 MPI Collectives . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.3 Parallel Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.1 Singular Value Decomposition . . . . . . . . . . . . . . . . . . 8

2.2.2 Truncated SVD . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.3 QR Decomposition . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.4 Nonnegative Matrix Factorization . . . . . . . . . . . . . . . . 10

2.2.5 Hierarchical NMF . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.1 Tensor Train . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.2 Tensor Train Notation . . . . . . . . . . . . . . . . . . . . . . 13

2.3.3 TT Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Chapter 3 Parallel Hierarchical Clustering using Rank-Two Nonnegative MatrixFactorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3 Preliminaries and Related Work . . . . . . . . . . . . . . . . . . . . . 19

3.3.1 Non-negative Matrix Factorization(NMF) . . . . . . . . . . . 19

3.3.2 Parallel NMF . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3.3 Communication Model . . . . . . . . . . . . . . . . . . . . . . 22

3.4 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

ii

3.4.1 Sequential Algorithms . . . . . . . . . . . . . . . . . . . . . . 23

3.4.2 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.5.1 Experimental Platform . . . . . . . . . . . . . . . . . . . . . . 36

3.5.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.5.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Chapter 4 Tensor Train Rounding using Gram Matrices . . . . . . . . . . . . . . . . . . . . . . 50

4.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2.1 Tensor Train Notation . . . . . . . . . . . . . . . . . . . . . . 50

4.2.2 Cholesky QR and Gram SVD . . . . . . . . . . . . . . . . . . 51

4.2.3 Cookies Problem and TT-GMRES . . . . . . . . . . . . . . . 52

4.2.4 TT-Rounding via Orthogonalization . . . . . . . . . . . . . . 54

4.2.5 Previous Work on Parallel TT-Rounding . . . . . . . . . . . . 55

4.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.4 Truncation of Matrix Product . . . . . . . . . . . . . . . . . . . . . . 58

4.4.1 Truncation via Orthogonalization . . . . . . . . . . . . . . . . 59

4.4.2 Truncation via Gram SVD . . . . . . . . . . . . . . . . . . . . 59

4.4.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . 62

4.4.4 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . 63

4.5 TT-Rounding via Gram SVD . . . . . . . . . . . . . . . . . . . . . . 67

4.5.1 TT Rounding Structure . . . . . . . . . . . . . . . . . . . . . 67

4.5.2 Structured Gram Matrix Computation . . . . . . . . . . . . . 68

4.5.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.5.4 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.5.5 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . 75

4.6 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 77

4.6.2 Parallel Scaling of TT Rounding . . . . . . . . . . . . . . . . . 78

4.6.3 Time Breakdown of TT Rounding . . . . . . . . . . . . . . . . 79

4.6.4 TT-GMRES Performance . . . . . . . . . . . . . . . . . . . . 80

4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

Chapter 5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

Curriculum Vitae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

iii

List of Figures

2.1 NMF matrix diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Tensor Train format of a five-way tensor . . . . . . . . . . . . . . . . 13

2.3 Unfoldings for TT Tensors . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1 Hierarchical Clustering of DC Mall HSI . . . . . . . . . . . . . . . . . 20

3.2 Hierarchy node classification . . . . . . . . . . . . . . . . . . . . . . . 26

3.3 Parallel cluster splitting using Rank-2 NMF . . . . . . . . . . . . . . 30

3.4 Strong Scaling for Clustering on DC-HYDICE . . . . . . . . . . . . . 38

3.5 Strong Scaling Speedup for Rank-2 NMF . . . . . . . . . . . . . . . . 39

3.6 Time Breakdown for Rank-2 NMF on Synthetic . . . . . . . . . . . . 40

3.7 Time Breakdown for Rank-2 NMF on SIIM-ISIC . . . . . . . . . . . . 41

3.8 Strong Scaling Speedup for Clustering . . . . . . . . . . . . . . . . . 42

3.9 Time Breakdown for Clustering on Synthetic . . . . . . . . . . . . . . 43

3.10 Time Breakdown for Clustering on SIIM-ISIC . . . . . . . . . . . . . 44

3.11 Level Times for 1 Compute Node on Synthetic . . . . . . . . . . . . . 46

3.12 Level Times for 40 Compute Nodes on Synthetic . . . . . . . . . . . . 47

3.13 Rank Scaling for Hierarchical and Flat NMF . . . . . . . . . . . . . . 47

4.1 Numerical results for truncation of matrix product X = ABT . . . . 65

4.2 Tensor network diagrams . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.3 Strong Scaling for Model 2 . . . . . . . . . . . . . . . . . . . . . . . . 78

4.4 Performance results for Model 3 . . . . . . . . . . . . . . . . . . . . . 78

4.5 Weak scaling time breakdowns for Model 1 . . . . . . . . . . . . . . . 80

4.6 TT-GMRES timing for MATLAB implementation . . . . . . . . . . . 81

4.7 TT-GMRES Weak Scaling . . . . . . . . . . . . . . . . . . . . . . . . 82

iv

Abstract

Low-rank approximations are useful in the compression and interpretation of largedatasets. Distributed parallel algorithms of such approximations, like those for matri-ces and tensors, are applicable for even larger datasets that cannot conceivably fit onone computer. In this thesis I will present parallelizing two such approximation algo-rithms: Hierarchical Nonnegative Matrix Factorization, and Tensor Train Rounding.In both cases, the distributed parallel algorithms outperform the state of the art.

Nonnegative Matrix Factorization (NMF) is a tool for clustering nonnegative ma-trix data. A Hierarchical NMF clustering can be achieved by recursively clusteringa dataset using Rank-2 or two cluster NMF. The hierarchical clustering algorithmcan reveal more detailed information about the data. Also, it is faster than a flatclustering of the same size since Rank-2 NMF is faster and scales better than thegeneral NMF algorithm as the number of clusters increases.

Tensor Train (TT) uses a series of 3-dimensional TT cores to approximate anN-dimensional tensor. TT ranks determine the sizes of these cores. Arithmetic withTensor Train causes an artificial increase in the TT ranks, and thus the sizes of the TTcores. So, TT applications use an algorithm called TT rounding to truncate TT rankssubject to some approximation error. The TT rounding algorithm can be thought of asa Truncated Singular Value Decomposition (tSVD) of a product of highly structuredmatrices. The state-of-the-art approach requires a slow orthogonalization phase. Afaster Gram SVD algorithm avoids this slow phase and reduces the computation timeof TT Rounding and improves its parallel scalability.

v

Chapter 1: Introduction

Low-rank approximations are useful in the compression and interpretation of large

datasets. Distributed-memory parallel algorithms of such approximations, like those

for matrices and tensors, are applicable for even larger datasets that cannot conceiv-

ably fit on one computer. In this thesis we will present parallelizing two such approxi-

mation algorithms: Hierarchical Nonnegative Matrix Factorization, and Tensor Train

Rounding. In both cases, the distributed-memory parallel algorithms outperform the

state of the art.

1.1 Low-Rank Approximations

There are a wide variety of low-rank approximations that are used in a range of appli-

cations such as facial recognition [18], dimensional reduction [62], hyperspectral image

segmentation [25], and data completion [55]. Some of these low-rank approximations

include: Singular Value Decomposition (SVD), Nonnegative Matrix Factorization

(NMF), Principal Component Analysis (PCA), the tensor CP Decomposition, and

Tensor Train (TT).

For example, hyperspectral image segmentation is a popular application for Non-

negative Matrix Factorization (NMF). NMF is a clustering algorithm that can cluster

individual pixels in a hyperspectral image. The resulting NMF clustering also con-

tains feature signatures for each cluster and fractional cluster membership for each

pixel. For hyperspectral images, these feature signatures can describe the types of

materials each pixel captures as different materials reflect light at different spectra

(colors) [25].

Another example is the low-rank approximation of incomplete tensor data called

1

tensor completion. Tensor completion is the problem of filling missing or unobserved

entries of partially observed tensors [55]. Filling missing entries in a tensor gives many

degrees of freedom for what those entries could ultimately be, so tensor completion

problems require constraints so that they can be solvable. One of the common con-

straints is maintaining a low rank in the resulting completed tensor. There are several

definitions of rank for a tensor approximation, depending on the type of approxima-

tion used. One of the common tensor decompositions used for tensor completion is

the CP decomposition. After computing the CP decomposition that best fits the

observed data and has a minimal rank, the unobserved data is predicted using the

corresponding value from that CP model.

1.2 Distributed-Memory Parallel Algorithms

In 1965, Gordon Moore observed that the number of transistors on a single silicon

chip has increased by a factor of two per year and proposed that it would continue

to do so for at least the next 10 years [42]. This observation, now known as Moore’s

Law has been generalized over time to computational instead of transistor density. As

engineers met the physical limits of transistor density, other strategies were developed

to meet the extended Moore’s law, such as multiple processor cores on a single chip

and GPU accelerators. However, even as computers become more and more powerful,

there are still problems that take too long to solve. These problems also typically

require large amounts of memory as well. Both computational and storage bottlenecks

lead us to work on distributed-memory systems such as supercomputers.

The most powerful supercomputers in the world are not made up of futuristic

processors or overly large hard drives. Instead, they are giant networks of individ-

ual computers made of commercially available technology. For example, the Summit

supercomputer at Oak Ridge National Laboratory was the most powerful supercom-

2

puter in the world, with 4608 individual “nodes”, each with 2 IBM POWER-9 CPUs

and 6 NVIDIA Volta GPUs [39]. Although each of these nodes are powerful in their

own right, the ability to utilize mulitple nodes in tandem makes distributed-memory

parallel algorithms high performing.

The Summit nodes each contain 512 GB of main memory for use by the proces-

sors [39]. If a problem requires more than this amount of memory, which is likely

for problems requiring high performance computing, adding more nodes to the com-

putation can allow for the distribution of that problem’s data across many nodes.

However, distributing memory like this comes with a downside, which is the commu-

nication between nodes.

Relative to the speed of computation on an individual node, the costs associated

with communicating data between two nodes is orders of magnitude higher. In the

worst cases, the majority of time spent in a distributed-memory algorithm can be

that slow communication of data instead of the actual computations of the algorithm,

which limits parallel scalability. This is why we must design parallel algorithms that

avoid this communication as much as possible. The algorithms presented in this thesis

both avoid communicating the bulk of their data but instead communicate the results

of smaller, intermediate calculations.

1.3 Applications

This thesis will cover two distributed-memory parallel algorithms for low-rank approx-

imations: Nonnegative Matrix Factorization and Tensor Train. Nonnegative Matrix

Factorization (NMF) is a clustering algorithm for nonnegative data that can extract

feature signatures and cluster membership for individual samples. Hierarchical Clus-

tering with Rank-2 NNMF (HierNMF) results from an optimization on a flat NMF

clustering algorithm. HierNMF can give a deeper answer than the flat algorithm and

3

potentially do it faster. This algorithm is discussed further in chapter 3. Tensor Train

(TT) is a data compression format for tensors, which are multidimensional arrays in

any number of dimensions. TT allows for computations do be done on tensors implic-

itly without being uncompressed. TT Rounding is a common bottleneck subroutine

used in many TT applications and so, chapter 4 proposes another approach to that

subroutine that avoids both communication and computation to result in a faster

approximation.

4

Chapter 2: Preliminaries

This chapter will provide some background knowledge on how distributed-memory

algorithms are designed and implemented using the Message Passing Interface (MPI),

and analyzed using the α − β − γ model, and the linear algebra concepts needed to

understand the content of future chapters.

2.1 Distributed-Memory Parallel Computing

Distributed-memory parallel architectures consist of multiple processors, each with

their own local memory. We use the Message Passing Interface (MPi) to allow for

processors to explicitly send and receive data. MPI is a standard interface for writing

distributed-memory parallel code in C, C++, and FORTRAN. Unlike shared memory

interfaces like OpenMP, MPI requires that data must be explicitly passed between

processors, often through collectively invoked functions.

2.1.1 MPI Cost Model

In analyzing MPI algorithms, there are the normal costs of computation as well as the

additional communication costs of passing data between processors. Communication

costs can be broken down into two parts: bandwidth and latency. Bandwidth is

the cost associated with the amount of data sent between processors. Latency is

the overhead cost of sending any amount of data in MPI. To analyze these costs

together, we use the α − β − γ model defined in [11]. This model combines the

costs of latency, bandwidth, and computation by assigning the coefficients α, β, and

γ to each, respectively. On distributed-memory systems, latency is the most costly

followed by bandwidth and then computation. So, α β γ. In this model, the

5

cost of sending w words of data costs βw + α.

2.1.2 MPI Collectives

MPI collectives are commonly used functions where groups of processors invoke one

function to pass data collectively between them. Table 2.1 shows the MPI collectives

used in this thesis and their initial and final data distributions. For example, given

elements of a vector x scattered across processors, AllGather will gather those ele-

ments so that all processors have a full copy of x. If instead each processor had a

local x, AllReduce would sum the individual x and store the result to all processors.

ReduceScatter would sum the local x on each processor and distribute the elements

of that sum across processors [11].

Operation Before After

All-Reducep0 p1 p2

x(0) x(1) x(2)

p0 p1 p2∑pj x

(j)∑p

j x(j)

∑pj x

(j)

Reduce-Scatter

p0 p1 p2

x(0)0 x

(1)0 x

(2)0

x(0)1 x

(1)1 x

(2)1

x(0)2 x

(1)2 x

(2)2

p0 p1 p2∑pj x

(j)0 ∑p

j x(j)1 ∑p

j x(j)2

All-Gather

p0 p1 p2

x0

x1

x2

p0 p1 p2

x0 x0 x0

x1 x1 x1

x2 x2 x2

Table 2.1: MPI collective algorithm data distributions [11]. xi is a segment of a vectorx. x(j) is data originally belonging to processor pj.

Table 2.2 shows the minimal α−β−γ costs of each of the three collectives described

in Table 2.1. As the number of processors p increases, the latency costs increase,

eventually creating a bottleneck in any distributed-memory parallel algorithm.

6

CollectiveCost

Computation (γ) Bandwidth (β) Latency (α)All-Reduce

O(n)O(n) O(log2 p)Reduce-Scatter

All-Gather —

Table 2.2: MPI collective algorithm costs using the α− β − γ model [11]. The costsassume an input array of n words that is communicated using p processors.

2.1.3 Parallel Scaling

Scaling is useful for analyzing parallel algorithms. There are two types of scaling:

strong and weak. Strong scaling is done by observing the performance boost by

increasing the number of processors for working on the same problem. An algorithm

is said to have perfect strong scaling when the performance “speed-up” relative to

one processor is identical to the number of processors used (e.g. 8x speed-up for 8

processors). Perfect strong scaling is possible when the problem is computationally

bound and the computations can be evenly distributed between processors. However,

after a certain point the communication cost in a parallel algorithm will start to

dominate entirely since it can grow with the number of processors used.

Weak scaling is done by observing the performance as the number of processors

increases in step with the size of the problem. Applications for weak scaling are

generally problems where resolution can be increased. This could be the number of

spatial grid points in a simulation, for example.

2.2 Matrices

A matrix is a two-dimensional grid of numbers and is a useful data storage format. In

this work, a matrix called “A” is written as A. One of the important characteristics

of a matrix that is explored in this thesis is its rank. We will explore the rank further

in section 2.2.1.

7

Low-rank approximations of matrices extract the most useful features out of the

original matrix. This can be useful in things like image compression as the resulting

representation of the matrix can be smaller but still maintain the essence of the

original data.

2.2.1 Singular Value Decomposition

The Singular Value Decomposition (SVD) is a popular factorization of real or complex

matrices into interpretable component matrices. The SVD is given by

A = UΣVT (2.1)

where A ∈ Rm×n, U ∈ Rm×n, Σ ∈ Rn×n, V ∈ Rn×n, and m ≥ n.

U and V are orthonormal matrices. Orthonormal matrices have orthogonal col-

umn vectors with unit norms. This means that each column vector is perpendicular

to the other column vectors in the matrix and their “length” is 1. In the case of the

SVD, the column vectors of U and V are called the left and right singular vectors,

respectively.

Σ is a diagonal matrix with positive descending diagonal entries. This means that

only the entries along the main diagonal from upper-left to lower-right can be nonzero

while the rest of the matrix is zero. These diagonal entries are called the singular

values, and they are unique to the matrix A.

The SVD has many properties that are useful for Numerical Linear Algebra. The

rank r of the matrix A is defined as the number nonzero of singular values in Σ.

Since the number of singular values is bounded by the number of diagonal entries of

the matrix Σ, the rank is also bounded as r ≤ n. If r = n, a matrix is said to be full

rank.

8

2.2.2 Truncated SVD

Given a matrix A with rank r and SVD A = UΣVT , the best rank k ≤ r approxi-

mation of A can be defined as

Ak =k∑j=1

σjujvjT (2.2)

as provided by [60], where σj are the singular values of A up to k and uj and vjT

are column vectors of U and V.

From eq. (2.2), the truncated SVD is the first k vectors of U and VT and the first

k singular values from the full SVD of a matrix A. The Truncated SVD is represented

as Ak = UΣVT

.

So, after computing the full SVD as described in section 2.2.1, the truncated SVD

for any rank k is trivial to compute.

2.2.3 QR Decomposition

Similar to the Singular Value Decomposition (section 2.2.1), the QR decomposition

takes any matrix A and computes

A = QR (2.3)

where A ∈ Rm×n, Q ∈ Rm×n, and R ∈ Rn×n. Like U and V in the SVD, Q has

orthonormal columns. R is an upper triangular matrix. This type of matrix generally

has nonzeros along the main diagonal and every entry above the main diagonal in a

triangle, while every entry below the main diagonal is zero.

The QR decomposition is useful for solving least squares problems. As will be

explained in chapter 4, it can also be used to solve the Truncated SVD, as it is less

computationally expensive to compute.

9

A

(m×n)

≈ W

(m×k)

HT

(k×n)

Figure 2.1: Nonnegative Matrix Factorization (NMF) of a matrix A by factor matricesW and H. The dimensions of each matrix are listed in parentheses below the boxes.The boxes of each matrix are relative in size to one another given dimension choices.

2.2.4 Nonnegative Matrix Factorization

Nonnegative Matrix Factorization (NMF) is an approximation of a matrix with high

dimensions as a product of two lower dimensional nonnegative matrices. The approx-

imation is written as

A ≈WHT (2.4)

where A ∈ Rm×n+ and is a data matrix. W ∈ Rm×k

+ and H ∈ Rn×k+ are both nonneg-

ative factor matrices. The chosen k ≤ min (m,n) is a parametrized value and is the

rank of the factor matrices and also the nonnegative rank of the approximation of A.

This approximation is also depicted in fig. 2.1.

There are several methods for computing a NMF. One of these methods is the

Alternating Nonnegative Least Squares (ANLS) method [38]. This method starts

with the minimization problem

minW≥0‖A−WHT‖ (2.5)

for finding W and the similar problem of

minH≥0‖AT −HW T‖ (2.6)

10

for finding H. These are both constrained Least Squares (LS) problems with nonneg-

ativity constraints. They are referred to as Nonnegative Least Squares (NNLS).

By fixing either W or H and solving the linear system for the other, an alternating

update algorithm can converge to a stopping point, since both minimizations are

convex problems [38].

There are different algorithms used solve the NNLS problems as described in

eq. (2.5) and eq. (2.6). One of these methods, Block Principal Pivoting (or BPP),

is described in [35] and [31]. BPP uses the active set method in order to compute

the NNLS. The active set method deals with the non-negative constraint of NNLS

by iteratively computing the unconstrained LS and grouping negative contributions.

This active set method is well-defined for the vector case, and is extended to the

matrix case by going column-by-column.

2.2.5 Hierarchical NMF

NMF can be used to cluster data by interpreting the W and H factor matrices. For

example, if columns of a data matrix represent samples of data and rows represent

features of those samples, then the k columns of W represent k clusters of data and

the k rows of HT represent the membership of each data point in the k clusters.

Since NMF can naturally be used as a clustering algorithm, recursively calling

NMF with k = 2 on data can result in a hierarchical tree of clusters. This is the

basic premise of the Hierarchical NMF algorithm. In Hierarchical NMF, k refers to

the number of leaf clusters in the resulting tree.

From section 2.2.4, BPP is a general approach to solving NNLS for any k and

scales like O(k). In [38], the authors propose a faster NNLS that requires k = 2.

The possible active sets for a matrix with k = 2 is only of size 4 and so can be

computed exhaustively without being infeasible. Since the algorithm proposed in [38]

11

is so simple to compute for k = 2, the authors proposed that it be used as a subroutine

for Hierarchical NMF. In chapter 3, we parallelize this Hierarchical NMF algorithm

using a parallel Rank-2 NMF.

2.3 Tensors

Tensors are a generalization of matrices in higher dimensions. In this work, a tensor

called “T” is written as T. Tensors are popular in a number of fields such as sig-

nal processing, numerical linear algebra, computer vision, numerical analysis, data

mining, graph analysis, and neuroscience [36].

2.3.1 Tensor Train

One of the problems of working with tensors is the so-called “curse of dimensionality”,

where the number of elements of the tensor is exponential in the number of modes [47].

Some tensor applications can use tens to thousands of modes and so can lead to tensors

of infeasible size in both storage and computation. A solution to this problem is to

use a tensor decomposition that can compress the data and is not exponential in the

number of modes. One such decomposition is called Tensor Train.

Tensor Train (TT) is a low-rank tensor decomposition. It’s been used in areas

such as molecular simulations, data completion, uncertainty quantification, and clas-

sification [1]. The “train” of tensor train is a series of tensors, called TT cores. Each

of these tensors, with the exception of the first and last tensors, is a three-way tensor.

The first and last tensor in the train are both matrices. Figure 2.2 shows a diagram

of a five-way tensor in TT format.

12

i

j

k

l

m

I1

R1

I2

R1R2

I3

R2R3

I4

R3R4

I5

R4

Figure 2.2: TT format of a five-way tensor X ∈ RI1×I2×I3×I4×I5 . Note that R0 =RN = 1 is shown through the first and last TT cores being matrices. The blue shadedregions represent the matrices and vectors required in computing eq. (2.7). Althoughthe In can be of any size, they are generally thought to be much larger relative toRn and so this representation shows tall TT cores.

2.3.2 Tensor Train Notation

Given a tensor X ∈ RI1×···×IN where N is the number of modes of X and each Ik is

the dimension of that mode, if X can be represented in TT format, then there exist

positive integers R0, . . . , RN with R0 = RN = 1 and N TT cores where the nth TT

core is TX,n ∈ RRn−1×In×Rn . In other words, X is in TT format if can be represented

as

X(i1, . . . , iN) = TX,1(i1, :) · · ·TX,n(:, in, :) · · ·TX,N(:, iN) (2.7)

where TX,n is the nth tensor core of N cores in the train [47]. Figure 2.2 shows the

pattern of element access for the entry X(i, j, k, l,m).

The integers R0, . . . , RN are called the TT ranks. By reducing these TT ranks

and approximating X, then the resulting tensor is in a more compressed format. This

TT rank reduction is called TT rounding.

One of the advantages of using Tensor Train over other tensor low-rank approxima-

tions is that the number of elements of the TT format is linear rather than exponential

13

in the number of modes of the original tensor. In other words,

|TT (X)| =N∑k

Rk−1IkRk (2.8)

where |TT (X)| is the number of elements of the TT representation of X. Note that

eq. (2.8) shows that

|TT (X)| = O(NIR2

)(2.9)

where N is the number of modes of X, I is the largest dimension of X and R is the

largest TT-rank of X.

By comparison to eq. (2.9), another decomposition called Tucker hasO(RN +NIR

)elements, where R is called the Tucker rank, which might be different than TT ranks.

TT avoids having elements that are exponential in the number of modes by limiting

the modes of the factor tensors.

Some computations with tensors, such as the truncated SVD, require the individ-

ual TT cores to be “unfolded”. Figure 2.3 shows this pattern of unfolding for vertical

and horizontal unfoldings.

14

In

Rn−1Rn

TX,n ∈ RRn−1×In×Rn

are TT cores

Rn

Rn−1 · · ·Rn

· · ·Rn

In

H(TX,n) ∈ RRn−1×InRn

Rn

Rn−1

...

Rn−1

...

Rn−1

In

V(TX,n) ∈ RRn−1In×Rn

Figure 2.3: Types of unfolding for TT tensors. TX,n is the nth TT core. The blueshaded region is a slice of TX,n. H(TX,n) is the horizontal unfolding of TX,n. V(TX,n)is the vertical unfolding of TX,n.

2.3.3 TT Rounding

The truncated SVD is necessary to reduce the ranks of a TT tensor X. In general,

each TT rank Rn is reduced as the TT rounding algorithm proceeds down the train

of TT cores. The current state-of-the-art method of computing TT rounding requires

an orthogonalization step using the QR decomposition. Although it is quite accurate,

this approach is slow. Chapter 4 describes an improvement on this method that

avoids using the QR orthogonalization step, improving the speed of the overall TT

Rounding algorithm.

15

Chapter 3: Parallel Hierarchical Clustering using

Rank-Two Nonnegative Matrix Factorization

The following chapter is a manuscript published to the International Conference

on High Performance Computing (HiPC’20) authored by myself, Grey Ballard, Ra-

makrishnan Kannan, and Haesun Park. For this work, I contributed to designing and

implementing the parallel algorithms identified in the paper. I also contributed to the

experimental section of the manuscript by reporting results and choosing data sets

for experimentation.

3.1 Abstract

Nonnegative Matrix Factorization (NMF) is an effective tool for clustering nonnega-

tive data, either for computing a flat partitioning of a dataset or for determining a

hierarchy of similarity. In this paper, we propose a parallel algorithm for hierarchical

clustering that uses a divide-and-conquer approach based on rank-two NMF to split a

data set into two cohesive parts. Not only does this approach uncover more structure

in the data than a flat NMF clustering, but also rank-two NMF can be computed

more quickly than for general ranks, providing comparable overall time to solution.

Our data distribution and parallelization strategies are designed to maintain compu-

tational load balance throughout the data-dependent hierarchy of computation while

limiting interprocess communication, allowing the algorithm to scale to large dense

and sparse data sets. We demonstrate the scalability of our parallel algorithm in terms

of data size (up to 800 GB) and number of processors (up to 80 nodes of the Summit

supercomputer), applying the hierarchical clustering approach to hyperspectral imag-

ing and image classification data. Our algorithm for Rank-2 NMF scales perfectly

16

on up to 1000s of cores and the entire hierarchical clustering method achieves 5.9x

speedup scaling from 10 to 80 nodes on the 800 GB dataset.

3.2 Introduction

Nonnegative Matrix Factorization (NMF) has been demonstrated to be an effective

tool for unsupervised learning problems including clustering [15, 51, 65]. An NMF

consists of two tall-and-skinny non-negative matrices whose product approximates a

nonnegative data matrix. That is, given an m×n data matrix A, we seek nonnegative

matrices W and H that each have k columns so that A ≈ WHT. Each pair of

corresponding columns of W and H form a latent component of the NMF. If the

rows of A correspond to features and the columns to samples, the ith row of the H

matrix represents the loading of sample i onto each latent component and provides a

soft clustering. Because the W factor is also nonnegative, each column can typically

be interpreted as a latent feature vector for each cluster.

Hierarchical clustering is the process of recursively paritioning a group of samples.

While standard NMF is interpreted as a flat clustering, it can also be extended for

hierarchical clustering. Kuang and Park [38] propose a method that uses rank-2 NMF

to recursively bipartition the samples. The method determines a binary tree such that

all leaves contain unique samples and the structure of the tree determines hierarchical

clusters.A single W vector for each node can also be used for cluster interpretation.

We discuss the hierarchical method in more detail in Section 3.3 and Section 3.4.1.

We illustrate the output of the hierarchical clustering method with an example

data set and output tree. Following Gillis et al. [25], we apply the method to a

hyperspectral imaging (HSI) data set of the Washington, D.C national mall, which

has pixel dimensions 1280 × 307 and 191 spectral bands. Figure 3.1 visualizes the

output tree with 6 leaves along with their hierarchical relationships. The root node,

17

labeled 0, is a flattening of the HSI data to a 2D grayscale image. Each other node is

represented by an overlay of the member pixels of the clusters (in blue) on the original

grayscale image. The first bipartitioning separates vegetation (cluster 1) from non-

vegetation (cluster 2), the bipartitioning of cluster 1 separates grass (cluster 3) from

trees (cluster 4), the bipartitioning of cluster 2 separates buildings (cluster 5) from

sidewalks/water (cluster 6), and so on. If the algorithm continues, it chooses to split

the leaf node that provides the greatest benefit to the overall tree, which can be

quantified as a node’s “score” in various ways.

While the hierarchical clustering method offers advantages in terms of interpre-

tation as well as execution time compared to flat NMF, implementations of the al-

gorithm are limited to single workstations and the dataset must fit in the available

memory. Currently available implementations can utilize multiple cores via MAT-

LAB [38] or explicit shared-memory parallelization in the SmallK library [17].

The goal of this work is to use distributed-memory parallelism to scale the algo-

rithm to large datasets that require the memory of multiple compute nodes and to

high processor counts. While flat NMF algorithms have been scaled to HPC plat-

forms [6, 21, 32, 41], our implementation is the first to our knowledge to scale a hier-

archical NMF method to 1000s of cores. As discussed in detail in Section 3.4.2, we

choose to parallelize the computations associated with each node in the tree, which

involve a Rank-2 NMF and the computation of the node’s score. We choose a data

matrix distribution across processors that avoids any redistribution of the input ma-

trix regardless of the data-dependent structure of the tree’s splitting decisions so that

the communication required involves only the small factor matrices. Analysis of the

algorithm shows the dependence of execution time on computation and communica-

tion costs as well as on k, the number of clusters computed. In particular, we confirm

that many of the dominant costs are logarithmic in k, which is favorable to the linear

18

or sometimes superlinear dependence of flat NMF algorithms.

We demonstrate in Section 3.5 the efficiency and scalability of our parallel al-

gorithm on three data sets, including the HSI data of the DC mall and an image

classification data set involving skin melanoma. The experimental results show that

our parallelization of Rank-2 NMF is highly scalable, maintaining computation bound

performance on 1000s of cores. We also show the limits of strong scalability when

scaling to large numbers of clusters (leaf nodes), as the execution time shifts to be-

coming interprocessor bandwidth bound and eventually latency bound. The image

classification data set requires 800 GB of memory across multiple nodes to process,

and in scaling from 10 nodes to 80 nodes of the Summit supercomputer (see Sec-

tion 3.5.1), we demonstrate parallel speedups of 7.1× for a single Rank-2 NMF and

5.9× for a complete hierarchical clustering.

3.3 Preliminaries and Related Work

3.3.1 Non-negative Matrix Factorization(NMF)

The NMF constrained optimization problem

minW,H≥0

‖A−WHT‖2

is nonlinear and nonconvex, and various optimization techniques can be used to ap-

proximately solve it. A popular approach is to use alternating optimization of the

two factor matrices because each subproblem is a nonnegative least squares (NNLS)

problem, which is convex and can be solved exactly. Many block coordinate descent

(BCD) approaches are possible [34], and one 2-block BCD algorithm that solves the

NNLS subproblems exactly is block principal pivoting [35]. This NNLS algorithm is

an active-set-like method that determines the sets of entries in the solution vectors

that are zero and those that are positive through an iterative but finite process.

19

3

9 10

4

1

11 12

5 6

2

0

Figure 3.1: Hierarchical Clustering of DC Mall HSI

20

When the rank of the factorization (the number of columns of W and H) is

2, the NNLS subproblems can be solved much more quickly because the number

of possible active sets is only 4. As explained in more detail in Section 3.4.1, the

optimal solution across the 4 sets can be determined efficiently to solve the NNLS

subproblem more quickly than general-rank approaches like block principal pivoting.

Because of the relative ease of solving the NMF problem for the rank-2 case, Kuang

and Park [38] propose a recursive method to use a rank-2 NMF to partition the input

data into 2 parts, whereby each part can be further partitioned via rank-2 NMF

of the corresponding original data. This approach yields a hierarchical factorization,

potentially uncovering more global structure of the input data and allowing for better

scalability of the algorithm to large NMF ranks.

The hierarchical rank-2 NMF method has been applied to document clustering [38]

and hyperspectral image segmentation [25]. The leaves of the tree also yield a set of

column vectors that can be aggregated into an approximate W factor (ignoring their

hierarchical structure). Using this factor matrix to initialize a higher-rank NMF com-

putation leads to quick convergence and overall faster performance than initializing

NMF with random data; this approach is known as Divide-and-Conquer NMF [19].

We focus in this paper on parallelizing the hierarchical algorithms proposed by Kuang

and Park [38] and Gillis et al. [25].

3.3.2 Parallel NMF

Scaling algorithms for NMF to large data often requires parallelization in order to fit

the data across the memories of multiple compute nodes or speed up the computation

to complete in reasonable time. Parallelizations of multiple optimization approaches

have been proposed for general NMF [6, 17, 21, 32, 41]. In particular, we build upon

the work of Kannan et al. [20, 31, 32] and the open-source library PLANC, designed

21

for nonnegative matrix and tensor factorizations of dense and sparse data. In this

parallelization, the alternating optimization approach is employed with various op-

tions for the algorithm used to (approximately) solve the NNLS subproblems. The

efficiency of the parallelization is based on scalable algorithms for the parallel ma-

trix multiplications involved in all NNLS algorithms; these algorithms are based on

Cartesian distributions of the input matrix across 1D or 2D processor grids.

3.3.3 Communication Model

We use the α-β-γ model [4, 11, 58] for analysis of distributed-memory parallel algo-

rithms. In this model, the cost of sending a single message of n words of data between

two processors is α + β · n, so that α represents the latency cost of the message and

β represents the bandwidth cost of each word in the message. The γ parameter

represents the computational cost of a single floating point operation (flop). In this

simplified communication model, we ignore contention in the network, assuming in

effect a fully connected network, and other limiting factors in practice such as the

number of hops between nodes and the network injection rate [28]. We let p represent

the number of processors available on the machine.

All of the interprocessor communication in the algorithms presented in this work

are encapsulated in collective communication operations that involve the full set of

processors. Algorithms for implementing the collective operations are built out of

pairwise send and receive operations, and we assume the most efficient algorithms are

used in our analysis [11, 58]. The collectives used in our algorithms are all-reduce,

all-gather, and reduce-scatter. In an all-reduce, all processors start out with the same

amount of data and all end with a copy of the same result, which is in our case a sum

of all the inputs (and the same size as a single input). The cost of an all-reduce of size

n words is α ·O(log p) + (β+ γ) ·O(n) for n > p and α ·O(log p) + (β+ γ) ·O(n log p)

22

for n < p. In an all-gather, all processors start out with separate data and all end

with a copy of the same result, which is the union of all the input data. If each

processor starts with n/p data and ends with n data, the cost of the all-gather is

α · O(log p) + β · O(n). In a reduce-scatter, all processors start out with the same

amount of data and all end with a subset of the result, which is in our case a sum of

all the inputs (and is smaller than its input). If each processor starts with n data and

ends with n/p data, the cost of the reduce-scatter is α ·O(log p)+(β+γ) ·O(n). In the

case of all-reduce and reduce-scatter, the computational cost is typically dominated

by the bandwidth cost because β γ.

3.4 Algorithms

3.4.1 Sequential Algorithms

Rank-2 NMF

Using the 2-block BCD approach for a rank-2 NMF yields NNLS subproblems of the

form minH≥0‖WH

T −A‖ and minW≥0 ‖HWT −AT‖. In each case, the columns of the

transposed variable matrix can be computed independently. Considering the ith row

of H, for example, the NNLS problem to solve is

minhi,1,hi,2≥0

∥∥∥∥[w1 w2

] [hi,1hi,2

]− ai

∥∥∥∥= min

hi,1,hi,2≥0

∥∥hi,1w1 + hi,2w2 − ai∥∥

where w1 and w2 are the two columns of W and ai is the i column of A. We note that

there are four possibilities of solutions, as each of the two variables may be positive

or zero.

As shown by Kuang and Park [38], determining which of the four possible solutions

is feasible and optimal can be done efficiently by exploiting the following properties:

23

• if the solution to the unconstrained least squares problem admits two positive

values, it is the optimal solution to the nonnegatively constrained problem,

• if W and A are both nonnegative, then the candidate solution with two zero

values is never (uniquely) optimal and can be discarded, and

• if the unconstrained problem does not admit a positive solution, the better of

the two remaining solutions can be determined by comparing aTj w1/‖w1‖ and

aTj w2/‖w2‖.

If the unconstrained problem is solved via the normal equations, then the temporary

matrices computed for the normal equations (WTW and ATW) can be re-used to

determine the better of the two solutions with a single positive variable.

Algorithm 1 implements this strategy for all rows of H simultaneously. It takes as

input the matrices C = ATW and G = WTW, first solves the normal equations for

the unconstrained problem, and then chooses between the two alternate possibilities as

necessary. We note that each row of H is independent, and therefore this algorithm is

easily parallelized. Solving for W can be done using inputs C = AH and G = HTH.

Given that the computational complexity of Algorithm 1 is O(n) (or O(m) when

computing W), and the complexity of computing WTW and HTH is O(m+ n), the

typical dominant cost of each iteration of Rank-2 NMF is that of computing ATW

and AH, which is O(mn).

Hierarchical Clustering

A Rank-2 NMF can be used to partition the columns of the matrix into two parts.

In this case, the columns of the W factor represent feature weights for each of the

two latent components, and the strength of membership in the two components for

each column of A is given by the two values in the corresponding row of H. We can

24

Algorithm 1 Rank-2 Nonnegative Least Squares Solve [38]

Require: C is n× 2 and G is 2× 2 and s.p.d.1: function H = Rank2-NLS-Solve(C,G)2: H = CG−1 . Solve unconstrained system3: for i = 1 to n do4: if hi1 < 0 or hi2 < 0 then5: . Choose between single-variable solutions6: if ci1/

√g11 < ci2/

√g22 then

7: hi1 = 08: hi2 = ci2/g22

9: else10: hi1 = ci1/g11

11: hi2 = 012: end if13: end if14: end for15: end functionEnsure: H = arg min

H≥0

‖A−WHT‖ is n× 2 with C = ATW and G = WTW

determine part membership by comparing those values: if hi1 > hi2, then column i of

A is assigned to the first part, which is associated with feature vector w1. Membership

can be determined by other metrics that also take into account balance across parts

or attempt to detect outliers.

Given Rank-2 NMF as a splitting procedure, hierarchical clustering builds a binary

tree such that each node corresponds to a subset of samples from the original data

set and each node’s children correspond to a 2-way partition of the node’s samples.

In this way, the leaves form a partition of the original data, and the internal nodes

specify the hierarchical relationship among clusters. As the tree is built, nodes are

split in order of their score, or relative value to the overall clustering of the data.

The process can be continued until a target number of leaves is produced or until all

remaining leaves have a score below a given threshold.

A node’s score can be computed in different ways. For document clustering, Kuang

25

Internal NodeFrontier Node

Leaf Node

Figure 3.2: Hierarchy node classification

and Park [38] propose using modified normalized discounted cumulative gain, which

measures how distinct a node’s children are from each other using the feature weights

associated with the node and its children. For hyperspectral imaging data, Gillis et

al. [25] propose using the possible reduction in overall NMF error if the node is split

– the difference in error between using the node itself or using its children. We use

the latter in our implementation.

In any case, a node’s score depends on properties of its children, so the compu-

tation for a split must be done before the split is actually accepted. To this end,

we define a frontier node to be a parent of leaves; these are nodes whose children

have been computed but whose splits have not been accepted. Figure 3.2 depicts the

classification of nodes into internal, frontier, and leaf nodes. As the tree is built, the

algorithm selects the frontier node with the highest score to split, though no compu-

tation is required to split the node. When a frontier node split is accepted, it becomes

an internal node and its children are split (so that their scores can be computed) and

added to the set of frontier nodes. When the algorithm terminates, the leaves are

discarded and the frontier nodes become the leaves of the output tree.

Our hierarchical clustering algorithm is presented in Algorithm 2 and follows that

26

of Kuang and Park [38]. Each node includes a field A, which is a subset of columns

(samples) of the original data, a feature vector w, which is its corresponding column

of the W matrix from its parent’s Rank-2 NMF, a score, and pointers to its left and

right children. A priority queue Q tracks the frontier nodes so that the node with the

highest score is split at each step of the algorithm. We use a target number of leaf

clusters k as the termination condition. When a node is selected from the priority

queue, it is removed from the set of frontier nodes and its children are added.

Algorithm 2 Hierarchical Clustering [38]

Require: A is m× n, k is target number of leaf clusters1: function T = Hier-R2-NMF(A)2: R = node(A) . create root node3: Split(R)4: inject(Q,R.left) . create priority queue5: inject(Q,R.right) . of frontier nodes6: while size(Q) < k do7: N = eject(Q) . frontier node with max score8: Split(N .left) . split left child9: inject(Q,N .left) . and add to Q10: Split(N .right) . split right child11: inject(Q,N .right) . and add to Q12: end while13: end functionEnsure: T is binary tree rooted at R with k frontier nodes, each node has subset of

cols of A and feature vector w

The splitting procedure is specified in Algorithm 3. After the Rank-2 NMF is

performed, the H factor is used to determine part membership, and the columns of

the W factor are assigned to the child nodes. The score of the node is computed as

the reduction in overall NMF error if the node is split, which can be computed from

the principal singular values of the subsets of columns of the node and its children,

as given in Line 6. The principal singular values of the children are computed via the

power method. Note that the principal singular value of the node itself need not be

recomputed as it was needed for its parent’s score.

27

Algorithm 3 Node Splitting via Rank-Two NMF

Require: N has a subset of columns given by field A1: function Split(N )2: [W,H] = Rank2-NMF(N .A) . split N3: partition N .A into A1 and A2 using H4: N .left = node(A1,w1) . create left child5: N .right = node(A2,w2) . create right child6: N .score = σ2

1(A1) + σ21(A2)− σ2

1(N .A)7: end function

Ensure: N has two children and a score

3.4.2 Parallelization

In this section, we consider the options for parallelizing Hierarchical Rank-2 NMF

Clustering (Algorithm 2) and provide an analysis for our approach. The running

time of an algorithm is data dependent because not only does each Rank-2 NMF

computation require a variable number of iterations, but also the shape of the tree

can vary from a balanced binary tree with O(log k) levels to a tall, flat tree with O(k)

levels. For the sake of analysis, we will assume a fixed number of NMF iterations for

every node of the tree and we will analyze the cost of complete levels.

The first possibility for parallelization is across the nodes of the tree, as each Rank-

2 NMF split is independent. We choose not to parallelize across nodes in the tree for

two reasons. The first reason is that while the NMF computations are independent,

choosing which nodes to split may depend on global information. In particular, when

the global target is to determine k leaf clusters, the nodes must be split in order

of their scores, which leads to a serialization of the node splits. This serialization

might be relaxed using speculative execution, but it risks performing unnecessary

computation. If the global target is to split all nodes with sufficiently high scores,

then this serialization is also avoided and node splits become truly independent. We

choose not to parallelize in this way to remain agnostic to the global stopping criterion.

The second reason is that parallelizing across nodes requires redistribution of the

28

input data. Given a node split by p processors, in order to assign disjoint sets of

processors to each child node, each of the p processors would have to redistribute

their local data, sending data for samples not in their child’s set and receiving data

for those in their child’s set. The communication would be data dependent, but on

average, each processor would communicate half of its data in the redistribution set,

which could have an all-to-all communication pattern among the p processors. For

a node with n columns, the communication cost would be at least O(mn/p) words,

which is much larger than the communication cost per iteration of Parallel Rank-2

NMF, as we will see in Section 3.4.2.

By choosing not to parallelize across nodes in the tree, we employ all p proces-

sors on each node, and split nodes in sequence. The primary computations used to

split a node are the Rank-2 NMF and the score computation, which is based on an

approximation of the largest singular value. We use an alternating-updating algo-

rithm for Rank-2 NMF as described in Section 3.3, and we parallelize it following the

methodology proposed in [20] and presented in Algorithm 4.

The communication cost of the algorithm depends on the parallel distribution of

the input matrix data A. In order to avoid redistribution of the matrix data, we choose

a 1D row distribution so that each processor owns a subset of the rows of A. Because

the clustering partition splits the columns of A, each processor can partition its

local data into left and right children to perform the split without any interprocessor

communication. If we use a 2D distribution for a given node, then because the

partition is data dependent, a data redistribution is required in order to obtain a

load balanced distribution of both children. Figure 3.3 presents a visualization of the

node-splitting process using a 1D processor distribution. In the following subsections,

we describe the parallel algorithms for Rank-2 NMF and approximating the principal

singular value given this 1D data distribution and analyze their complexity in the

29

AW

HT

A1w1 A2w2

Figure 3.3: Parallel splitting using Rank-2 NMF and 1D processor distribution. ARank-2 NMF computes factor matrices W and H to approximate A, the values of Hare used to determine child membership of each column (either red or blue), and thecorresponding column of the W matrix represents the part’s feature weighting. The1D distribution is depicted for 3 processors to show that splitting requires no inter-processor redistribution as children are evenly distributed identically to the parent.

context of the hierarchical clustering algorithm.

Algorithms

Parallel Rank-2 NMF Algorithm 4 presents the parallelization of an alternating-

updating scheme for NMF that uses the exact rank-2 solve algorithm presented in

Algorithm 1 to update each factor matrix. The algorithm computes the inputs to

the rank-2 solves in parallel and then exploits the parallelism across rows of the

30

factor matrix so that each processor solves for a subset of rows simultaneously. The

distribution of all matrices is 1D row distribution, so that each processor owns a

subset of the rows of A, W, and H. We use the notation A to refer to the (m/p)×n

local data matrix and W and H to refer to the (m/p)× 2 and (n/p)× 2 local factor

matrices. With this distribution, the computation of WTW and HTH each is done

via local multiplication followed by a single all-reduce collective. All processors own

the data they need to compute their contribution to ATW; in order to distribute

the result to compute the rows H independently, a reduce-scatter collective is used

to sum and simultaneously distribute across processors. To obtain the data needed

to compute W, each processor must access all of H, which is performed via an all-

gather collective. The iteration progresses until a convergence criterion is satisfied.

For performance benchmarking we use a fixed number of iterations, and in practice

we use relative change in objective function value (residual norm).

Parallel Power Method In order to compute the score for a frontier node, we

use the difference between the principal singular value of the matrix columns of the

node and the sum of those of its children. Thus, we must determine the principal

singular value of every node in the tree once, including leaf nodes. We use the power

method to approximate it, repeatedly applying AAT to a vector until it converges to

the leading right singular vector. We present the power method in Algorithm 5. Note

that we do not normalize the approximate left singular vector so that the computed

value approximates the square of the largest singular value.

Given the 1D distribution, only one communication collective is required for the

pair of matrix-vector multiplications. That is, the approximate right singular vector v

is redundantly owned on each processor, and the approximate left singular vector u is

distributed across processors. Each processor can compute its local u from v without

31

Algorithm 4 Parallel Rank-2 NMF

Require: A is m×n and row-distributed across processors so that A is local (m/p)×n submatrix

1: function [W,H] = Parallel-Rank2-NMF(A)2: Initialize local W randomly3: while not converged do4: . Compute H

5: GW = WTW

6: GW = All-Reduce(GW )

7: B = ATW

8: C = Reduce-Scatter(B)9: H = Rank2-NLS-Solve(C,GW )10: . Compute W

11: GH = HTH

12: GH = All-Reduce(GH)13: H = All-Gather(H)14: D = AH15: W = Rank2-NLS-Solve(D,GH)16: end while17: end functionEnsure: A ≈WHT with W, H row-distributed

32

communication and use the result for its contribution to v = ATu. An all-reduce

collective is used to obtain a copy of v on every processor for the next iteration,

and the norm is redundantly computed without further communication. We used the

relative change in σ as the stopping criterion for benchmarking.

Algorithm 5 Parallel Power Method

Require: A is m×n and row-distributed across processors so that A is local (m/p)×n submatrix

1: function σ = Parallel-Power-Method(A)2: Initialize v randomly and redundantly3: while not converged do4: u = Av5: z = A

Tu

6: v = All-Reduce(z)7: σ = ‖v‖8: v = v/σ9: end while10: end functionEnsure: σ ≈ σ2

1(A) is redundantly owned by all procs

Analysis

Parallel Rank-2 NMF Each iteration of Algorithm 4 incurs the same cost, so we

analyze per-iteration computation and communication costs. We first consider the

cost of the Rank-2 NNLS solves, which are local computations. In the notation of

Algorithm 1, matrix G is 2 × 2, so solving the unconstrained system (via Cholesky

decomposition) and then choosing between single-positive-variables solutions if neces-

sary requires constant time per row of C. Thus, the cost of Algorithm 1 is proportional

to the number of rows of the first input matrix. In the context of Algorithm 4, the

per-iteration computational cost of rank-2 solves is then O((m+n)/p). The other lo-

cal computations are the matrix multiplications WTW and H

TH, which also amount

to O((m+n)/p) flops, and ATW and AH, which require O(mn/p) flops because they

33

involve the data matrix. Thus, the computation cost is γ · O((mn + m + n)/p) and

typically dominated by the multiplications involving A. We track the lower order

terms corresponding to NNLS solves because their hidden constants are larger than

that of the dominating term.

There are four communication collectives each iteration, and each involves all p

processors. The two all-reduce collectives to compute the Gram matrices of the factor

matrices involve 2×2 matrices and incur a communication cost of (γ+β+α)·O(log p).

The reduce-scatter and all-gather collectives involve n × 2 matrices (the size of H)

and require β ·O(n) +α ·O(log p) in communication cost (we ignore the computation

cost of the reduce-scatter because it is typically dominated by the bandwidth cost).

If the algorithm performs ı iterations, the overall cost of Algorithm 4 is

γ ·O(ı(mn+m+ n)

p

)+ β ·O(ın) + α ·O(ı log p). (3.1)

Parallel Power Method Similar to the previous analysis, we consider a single

iteration of the power method. The local computation is dominated by two matrix-

vector products involving the local data matrix of size O(mn/p) words, incurring

O(mn/p) flops. The single communication collective is an all-reduce of the approxi-

mate right singular vector, which is of size n, incurring β ·O(n) +α ·O(log p) commu-

nication. We ignore the O(n) computation cost of normalizing the vector, as it will

typically be dominated by the communication cost of the all-reduce. Over iterations,

Algorithm 5 has an overall cost of

γ ·O(mn

p

)+ β ·O(n) + α ·O( log p). (3.2)

Note the per-iteration cost of the power method differs by only a constant from the

per-iteration cost of Rank-2 NMF. Because the power method involves single vectors

34

rather than factor matrices with two columns, its constants are smaller than half the

size of their counterparts.

Hierarchical Clustering To analyze the overall cost of the hierarchical clustering

algorithm, we sum the costs over all nodes in the tree. Because the shape of the tree

is data dependent and affects the overall costs, for the sake of analysis we will analyze

only complete levels. The number of rows in any node is m, the same as the root node,

as each splitting corresponds to a partition of the columns. Furthermore, because each

split is a partition, every column of A is represented exactly once in every complete

level of the tree. If we assume that all nodes perform the same number of NMF

iterations (ı) and power method iterations (), then the dominating costs of a node

with n columns is

γ ·O(

(ı+ )mn+ ı(m+ n)

p

)+ β ·O((ı+ )n) + α ·O((ı+ ) log p).

Because the sum of the number of columns across any level of the tree is n, the cost

of the `th level of the tree is

γ ·O(

(ı+ )mn+ ım2`

p

)+ β ·O((ı+ )n) + α ·O((ı+ )2` log p). (3.3)

Note that the only costs that depend on the level index ` are the latency cost and a

lower-order computational cost.

Summing over levels and assuming the tree is nearly balanced and has height

O(log k) where k is the number of frontier nodes, we obtain an overall cost of Algo-

rithm 2 of

γ ·O(

(ı+ )mn

plog k +

ımk

p

)+ β ·O((ı+ )n log k) + α ·O((ı+ )k log p). (3.4)

We see that the leading order computational cost is logarithmic in k and perfectly

load balanced. If the overall running time is dominated by the computation (and

35

in particular the matrix multiplications involving A), we expect near-perfect strong

scaling. The bandwidth cost is also logarithmic in k but does not scale with the

number of processors. The latency cost grows most quickly with the target number

of clusters k but is also independent of the matrix dimensions m and n.

3.5 Experimental Results

3.5.1 Experimental Platform

All the experiments in this section were conducted on Summit. Summit is a su-

percomputer created by IBM for the Oak Ridge National Laboratory. There are

approximately 4,600 nodes on Summit. Each node contains two IBM POWER9 pro-

cessors on separate sockets with 512 GB of DDR4 memory. Each POWER9 processor

utilizes 22 IBM SIMD Multi-Cores (SMCs), although one of these SMCs on each pro-

cessor is dedicated to memory transfer and is therefore not available for computation.

For node scaling experiments, all 42 available SMCs were utilized in each node so

that every node computed with 42 separate MPI processes. Additionally, every node

also supports six NVIDIA Volta V100 accelerators but these were unused by our

algorithm.

Our implementation builds on the PLANC open-source library [20] and uses the

Armadillo library (version 9.900.1) for all matrix operations. On Summit, we linked

this version of Armadillo with OpenBLAS (version 0.3.9) and IBM’s Spectrum MPI

(version 10.3.1.2-20200121).

3.5.2 Datasets

Hyperspectral Imaging We use the Hyperspectral Digital Imagery Collection Ex-

periment (HYDICE) image of the Washington DC Mall. We will refer to this dataset

as DC-HYDICE [40]. DC-HYDICE is formatted into a 3-way tensor representing two

36

spatial dimensions of pixels and one dimension of spectral bands. So, a slice along the

spectral band dimension would be the full DC-HYDICE image in that spectral band.

For hierarchical clustering, these tensors are flattened so that the rows represent the

191 spectral bands and the columns represent the 392960 pixels. The data set is

approximately 600 MB in size.

Image Classification The SIIM-ISIC Melanoma classification dataset, which we

will refer to as SIIM-ISIC [52], consists of 33126 RGB training images equally sized

at 1024 × 1024. Unlike with hyperspectral imaging, the resulting matrix used in hi-

erarchical clustering consists of image pixels along the rows and individual images

along the columns. So, the resulting sized matrix is 3145728 × 33126, which is ap-

proximately 800 GB in size. Given its size, SIIM-ISIC requries 10 Summit nodes to

perform hierarchical clustering.

Synthetic Dataset Our synthetic dataset has the same aspect ratio of SIIM-ISIC

but consists of fewer rows and columns by a factor of 3. The resulting matrix is

1048576 × 11042. We choose the smaller size in order to fit on a single node for

scaling experiments.

3.5.3 Performance

For all hierarchical clustering experiments in this section, the number of tree leaf

nodes k was set at 100, the number of NMF iterations was set to 100, the power

iteration was allowed to stop iterating after convergence, and only complete levels

were considered for analysis purposes for both level and strong scaling plots.

37

0 5 10 15 20 25 30 35 40

Number of Compute Cores

2

4

6

8

10

12

14

Rel

ativ

eS

pee

dup

Figure 3.4: Strong Scaling for Clustering on DC-HYDICE

Single-Node Scaling for DC Dataset

DC-HYDICE is small compared to the other datasets, so it can easily fit on one com-

pute node. Also, its small number of 191 rows doesn’t allow for parallelizing beyond

that number of MPI processes. So, this dataset was used for a single-node scaling

experiment on Summit from 1 to 42 cores. Because Rank-2 NMF is memory band-

width bound, we expect limited speedup on one node due to the memory bandwidth

not scaling linearly with the number of cores. Figure 3.4 shows that there is enough

speedup (14× on 42 cores) for it to be worth parallelizing such a small problem, but

perfect scaling requires more memory bandwidth. In this experiment, the processes

were distributed across both sockets so that an even number of cores on each socket

are used.

38

1 10 20 30 40

Number of Compute Nodes

0

10

20

30

40

Rel

ativ

eS

pee

dup

(a) Synthetic Data

10 20 30 40 50 60 70 80


1

2

3

4

5

6

7

Rel

ativ

eS

pee

dup

(b) SIIM-ISIC Data

Figure 3.5: Strong Scaling Speedup for Rank-2 NMF

Rank-2 NMF Strong Scaling

We perform strong scaling experiments for a single Rank-2 NMF (Algorithm 4) on

the synthetic and SIIM-ISIC datasets. The theory (Equation (3.1)) suggests that

perfect strong scaling is possible as long as the execution time is dominated by local

computation. Both the matrix multiplications and NNLS solves scale linearly with

1/p (we expect MatMul to dominate), but the bandwidth cost is independent of p

and latency increases slightly with p.

Figures 3.5a and 3.5b show performance relative to the smallest number of com-

pute nodes required to store data and factor matrices. For these data sets, we observe

nearly perfect strong scaling, with 42× speedup on 40 compute nodes (over 1 compute

node) for synthetic data and 7.1× speedup on 80 compute nodes (over 10 compute

nodes) for SIIM-ISIC data.

The relative time breakdowns are presented in Figures 3.6 and 3.7 and explain

the strong scaling performance. Each experiment is normalized to 100% time, so

comparisons cannot be readily made across numbers of compute nodes. For both data

sets, we see that the time is dominated by MatMul, which is the primary reason for

the scalability. The dominant matrix multiplications are between a large matrix and

a matrix with 2 columns, so it is locally memory bandwidth bound, with performance

39

1 10 20 30 40


0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

eT

ime

MatMul

NNLS

Gram

Comp-Sigma

AllGather

ReduceScatter

AllReduce

Comm-Sigma

Figure 3.6: Time Breakdown for Rank-2 NMF on Synthetic

40

10 20 30 40 50 60 70 80


0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

eT

ime

MatMul

NNLS

Gram

Comp-Sigma

AllGather

ReduceScatter

AllReduce

Comm-Sigma

Figure 3.7: Time Breakdown for Rank-2 NMF on SIIM-ISIC

proportional to the size of the large matrix. In each plot, we also see the relative time

of all-gather and reduce-scatter increasing, which is because the local computation is

decreasing while the communication cost is slightly increasing with p. This pattern

will continue as p increases, which will eventually limit scalability, but for these data

sets the MatMul takes around 80% of the time at over 2000 cores.

Hierarchical Clustering Strong Scaling

From Equation (3.4), we expect to see perfect strong scaling in a computationally

bound clustering problem with target cluster count k = 100. As k is large, we expect

the latency cost of small problems deep in the tree to limit scalability.

Figure 3.8a demonstrates the scalability of the synthetic data set on up to 40 nodes,

and we observe a 15× speedup compared to 1 node. Figure 3.9 shows the relative

41

0 5 10 15 20 25 30 35 40


2

4

6

8

10

12

14

Rel

ativ

eS

pee

dup

(a) Synthetic Data

10 20 30 40 50 60 70 80


1

2

3

4

5

6

Rel

ativ

eS

pee

dup

(b) SIIM-ISIC Data

Figure 3.8: Strong Scaling Speedup for Clustering

time breakdown and explains the limitation on scaling. On 40 nodes, computation

still takes 60% of the total time, but the all-gather and reduce-scatter costs have

grown in relative time because they do not scale with p. Because all-reduce involves

only a constant amount of data and its time remains relatively small, we conclude

the communication is bandwidth bound at this scale.

With the larger SIIM-ISIC dataset, it’s possible to scale much further as seen in

Figure 3.8b, where we observe a 5.9× speedup of 80 compute nodes compared to 10.

From Figure 3.10, we see that the communication cost constitutes less than 20% of

the total time even at 80 compute nodes.

We note that the speedup of the overall hierarchical clustering algorithm is not

as high as for a single Rank-2 NMF (measured at the root node). This is due to

inefficiencies in the lower levels of the tree, as we explore in the next section.

Level Scaling

To compare execution time across levels of a particular tree, we consider only complete

levels. From Equation (3.3), the dominant computational term (due to MatMul) is

constant per level, the lower order computational term (represented by NNLS) grows

like O(2`), and the latency cost grows similarly like O(2`).

42

1 10 20 30 40


0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

eT

ime

MatMul

NNLS

Gram

Comp-Sigma

AllGather

ReduceScatter

AllReduce

Comm-Sigma

Figure 3.9: Time Breakdown for Clustering on Synthetic

43

10 20 30 40 50 60 70 80


0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

eT

ime

MatMul

NNLS

Gram

Comp-Sigma

AllGather

ReduceScatter

AllReduce

Comm-Sigma

Figure 3.10: Time Breakdown for Clustering on SIIM-ISIC

44

Figure 3.11 show absolute time across levels for the synthetic data set on 1 node.

The MatMul cost decreases slightly per level, which may be explained by cache effects

in the local matrix multiply, as each node’s subproblem decreases in size. The NNLS

grows exponentially, as expected, and communication is negligible.

Figure 3.12 shows the level breakdown for the synthetic data on 40 nodes, where we

see different behavior. MatMul cost is again constant across levels and the NNLS cost

becomes dominating at lower levels suggesting it does not scale as well as MatMul.

We also see all-reduce time becoming significant as communication time increases,

indicating that the nodes at lower levels are becoming more latency bound. Thus,

we see that the poorer scaling at the lower levels of the tree is the main reason the

overall hierarchical clustering algorithm does not scale as well as the single Rank-2

NMF at the root node.

Rank Scaling

To confirm the slow growth in running time of the hierarchical algorithm in terms

of the number of clusters k, we perform rank scaling experiments for DC-HYDICE

and synthetic data. Assuming a balanced tree and relatively small k, Equation (3.4)

shows that the dominant computational cost is proportional to log k, while a flat

NMF algorithm has a dominant cost that is linear in k [32]. Figure 3.13 shows the

raw time for various values of k, confirming that running time for HierNMF grows

more slowly in k than a flat NMF algorithm (based on Block Principal Pivoting)

from PLANC [20] with the same number of columns and processor grid. We see that

for sufficiently large k, the hierarchical algorithm outperforms flat NMF and it scales

much better with k.

45

0 1 2 3 4 5

Levels

0

50

100

150

200

250

Wal

lC

lock

Tim

e(i

nS

ecs)

MatMul

NNLS

Gram

Comp-Sigma

AllGather

ReduceScatter

AllReduce

Comm-Sigma

Figure 3.11: Level Times for 1 Compute Node on Synthetic

46

0 1 2 3 4 5

Levels

0

5

10

15

20

25

30

35W

all

Clo

ckT

ime

(in

Sec

s)

MatMul

NNLS

Gram

Comp-Sigma

AllGather

ReduceScatter

AllReduce

Comm-Sigma

Figure 3.12: Level Times for 40 Compute Nodes on Synthetic

10 20 30 40 50

Number of Clusters k

50

100

150

200

Tim

e(s

)

Hier NMF Flat NMF

(a) DC-HYDICE Data

10 20 30 40 50 60 70 80 90 100

Number of Clusters k

200

300

400

500

600

700

800

900

Tim

e(s

)

Hier NMF Flat NMF

(b) Synthetic Data

Figure 3.13: Rank Scaling for Hierarchical and Flat NMF

47

3.6 Conclusion

As shown in the theoretical analysis (Section 3.4.2) and experimental results (Sec-

tion 3.5.3), Algorithm 2 can efficiently scale to large p as long as the execution time

is dominated by local matrix multiplication. The principal barriers to scalability are

the bandwidth cost due to Rank-2 NMF, which is consistent across levels of the tree

and proportional to the number of columns n of the original data set, and the latency

cost due to large numbers of tree nodes in lower levels of the tree. When n is small

relative to m and the number of leaves k and levels ` are small, then these barriers

do not pose a problem until p is very large. However, if the input matrix is short and

fat (i.e., has many samples with few features), then the bandwidth cost can hinder

performance for smaller p. Likewise, if k is large or the tree is lopsided, then achieving

scalability for very small problems is more difficult. We also note that in the case of

sparse A, it becomes more difficult to hide communication behind the cheaper matrix

multiplications, and other costs may become more dominant.

One approach for reducing the bandwidth cost of Rank-2 NMF is to choose a

more balanced data distribution over a 2D grid, as proposed by Kannan et al. [31].

This will reduce the communicated data and achieve a local data matrix that is more

square, which can improve local matrix multiplication performance. The downside

of this approach is requiring a redistribution of the data for each split, but if many

NMF iterations are required, then the single upfront cost may be amortized.

Another approach to alleviate the rising latency costs of lower levels of the tree

is to parallelize across nodes of the tree. This will result in fewer processors working

on any given node, reducing the synchronization time among them, and it will allow

small, latency-bound problems to be solved simultaneously. Prioritizing the sequence

of node splits is more difficult in this case, but modifying the stopping criterion for

splitting to use a score threshold instead of a target number of leaves will allow truly

48

independent computation.

In the future, we also plan to compare performance of Algorithm 2 with flat NMF

algorithms and employ the Divide-and-Conquer NMF technique [19] of seeding an

iterative flat NMF algorithm with the feature vectors of the leaf nodes. The parallel

technique proposed here can be combined with the existing PLANC library [20] to

obtain faster overall convergence for very large datasets.

49

Chapter 4: Tensor Train Rounding using Gram Matrices

The following chapter is a manuscript that has been submitted. For this work, I

contributed mainly to the results section by performing experiments and generating

plots.

4.1 Abstract

Tensor Train (TT) is a low-rank tensor representation consisting of a series of three-

way cores whose dimensions specify the TT ranks. Formal tensor train arithmetic

often causes an artificial increase in the TT ranks. Thus, a key operation for appli-

cations that use the TT format is rounding, which truncates the TT ranks subject

to an approximation error guarantee. Truncation is performed via SVD of a highly

structured matrix, and current rounding methods require careful orthogonalization

to compute an accurate SVD. We propose a new algorithm for TT rounding based

on the Gram SVD algorithm that avoids the expensive orthogonalization phase. Our

algorithm performs less computation and can be parallelized more easily than ex-

isting approaches, at the expense of a slight loss of accuracy. We demonstrate that

our implementation of the rounding algorithm is efficient, scales well, and consistently

outperforms the existing state-of-the-art parallel implementation in our experiments.

4.2 Preliminaries

4.2.1 Tensor Train Notation

An order-N low rank tensor X ∈ RI1×···×IN is in the Tensor Train (TT) format if there

exist strictly positive integers R0, . . . , RN with R0 = RN = 1 and N order-3 tensors

50

TX,1, . . . ,TX,N , called TT cores, with TX,n ∈ RRn−1×In×Rn , such that:

X(i1, . . . , iN) = · · · · · · .

Since R0 = RN = 1, the first and last TT cores are (order-2) matrices so ∈ RR1 and

∈ RRN−1 and hence · · · · · · ∈ R. We refer to the Rn−1 × Rn matrix as the inth slice

of the nth TT core of X, where 1 ≤ in ≤ In.

Different types of matricization (also known as unfolding) of a tensor are used

to express linear algebra operations on tensors. In this work, we will often use two

particular matricization of 3D tensors. The horizontal unfolding of TT core TX,n

corresponds to stacking the slices for in = 1, . . . , In horizontally. The horizontal

unfolding operator is denoted by H, therefore, H(TX,n) ∈ RRn−1×RnIn . The vertical

unfolding corresponds to stacking the slices for in = 1, . . . , In vertically. The vertical

unfolding operator is denoted by V , therefore, V(TX,n) ∈ RRn−1In×Rn . These two

unfoldings are important for the linearization of tensor entries in memory as they

enable performing matrix operations on the TT core without shuffling or permuting

data.

Another type of unfolding which we will use to express mathematical relationships

among TT cores maps the first n modes to rows and the rest to columns [49]. We use

the notation X(1:n) to represent this unfolding, so that X(1:n) ∈ RI1···In×In+1···IN . The

n TT rank of X is the rank of X(1:n).

4.2.2 Cholesky QR and Gram SVD

Given a tall and skinny matrix A, recall that the corresponding Gram matrices are

AAT and ATA. We are typically interested in GA = ATA for efficient algorithms

because it is a smaller matrix.

Cholesky QR is an algorithm that exploits the fact that, for A full rank, the

upper triangular Cholesky factor of GA is also the upper triangular factor in the QR

51

decomposition of A. That is, for A = QR, we have GA = RTQTQR = RTR. If

A is full rank, then R is invertible and Q can be recovered as Q = AR−1 using

a triangular solve. In finite precision, Cholesky QR obtains a small decomposition

error ‖A−QR‖, but the orthogonality error ‖QTQ−I‖ grows quadratically with the

condition number of A. By comparison, Householder QR obtains small orthogonality

error regardless of the conditioning of A [30]. We note there are techniques for

improving the numerical properties of Cholesky QR, by using 2 or 3 passes [22, 23].

Likewise, Gram SVD is an algorithm that exploits the connection between the SVD

of a matrix and the eigenvalue decompositions of its Gram matrices. For A = UΣVT,

we have GA = VΣUTUΣVT = VΣ2VT. We see that the eigenvalues of GA are the

squares of the singular values of A and the eigenvectors of GA are the right singular

vectors of A. We can recover the left singular vectors via U = AVΣ−1 (assuming full

rank). Like Cholesky QR, Gram SVD computes an accurate decomposition but suffers

from higher orthogonality error of U as well as reduced accuracy of the singular values.

SVD algorithms using orthogonal transformations compute singular values with error

proportional to ‖A‖ · ε, where ε is the working precision, while the error for Gram

SVD can be larger by a factor as large as the condition number of A [60]. This implies

that backwards stable SVD algorithms can compute singular values in a range of 1/ε,

while Gram SVD is limited to computing singular values in a range of 1/√ε.

4.2.3 Cookies Problem and TT-GMRES

As a concrete example of a parametrized PDE for which TT methods work well, we

consider the two-dimensional cookies problem [37, 59] described as follows:

−div(σ(x, y;ρ)∇(u(x, y))) = f(x, y) in Ω,

u(x, y) = 0 on δΩ,

52

where Ω is (−1, 1)× (−1, 1), δΩ is the boundary of Ω and σ is defined as:

σ(x, y;ρ) =

1 + ρi if (x, y) ∈ Di

1 elsewhere

where Di for i = 1, . . . , p are disjoint disks distributed in Ω such that their centers

are equidistant and ρi is selected from a set of samples Ji ⊂ R for i = 1, . . . , p. To

solve this problem, for each combination of values (ρ1, . . . , ρp), one can solve the linear

system (G1,1 +∑p

i=1 ρiGi+1,1) u = f , where G1,1 ∈ RI1×I1 is the discretization of the

operator −div(∇(·)) in Ω, Gi+1,1 is the discretization of −div(χDi∇(·)) in Ω where

χS is the indicator function of the set S, and f is the discretization of the function f .

The number of linear systems to solve in that case is the product of the cardinalities

of the sets (Ji)1≤i≤p. Knowing that the set of solutions can be well approximated by a

low-rank tensor [13, 26], another approach to solve the problem is to use an iterative

method that exploits the low-rank structure and solves one large system including

all combinations of parameters. That is, to solve a (p + 1)-order problem of the

form GU = F. The operator G is given as G =∑p+1

i=1 Gi,1 ⊗ · · · ⊗ Gi,p+1, , where

Gi,i ∈ RIi×Ii for i = 2, . . . , p + 1 is a diagonal matrix containing the samples of ρi,

and the remaining matrices Gi,j for i = 1, . . . , p + 1, j = 2, . . . , p + 1 and j 6= i are

the identity matrices of suitable size. The right-hand side F = f ⊗ 1I2 ⊗ · · · ⊗ 1Ip+1 ,

where 1Ii is the vector of ones of size Ii.

In this application and many others, the operator G has an operator rank that is

low and the right-hand side F is given in a low-rank form [3,7,9,37,63,64]. One way to

approximate the solution by a low-rank tensor is to apply a Krylov method adapted

to low rank tensors such as TT-GMRES [16]. In each iteration, the operator G is

applied to a low rank tensor leading to a formal expansion of the ranks. Furthermore,

one needs to orthonormalize the new basis tensor against previous ones by using a

Gram–Schmidt procedure, see algorithm 6. Again, the ranks will increase formally.

53

In order to keep memory and computations tractable, one has to round the resulting

tensors after performing these two steps. Most of the time, a small reduction in

the final relative residual norm is sufficient, which allows performing aggressive TT

rounding with loose tolerances.

Algorithm 6 TT-GMRES [16]

1: function U = TT-GMRES(G,F,m, ε)2: Set β = ‖F‖F , V1 = U/β, r = β3: for j = 1 : m do4: Set δ = εβ

r5: W = TT-Round(GVj , δ)6: for i = 1 : j do7: H(i, j) = InnerProd(W,Vi)8: end for9: W = TT-Round(W−

∑ji=1 H(i, j)Vi, δ)

10: H(j + 1, j) = ‖W‖F11: r = min ‖H(1 : j + 1, 1 : j)y − βe1‖212: Vj+1 = W/H(j + 1, j)13: end for14: ym = argminy ‖Hy − βe1‖215: U =

∑mj=1 ym(j)Vj

16: end function

4.2.4 TT-Rounding via Orthogonalization

The standard algorithm for TT-rounding [47] is given in algorithm 7. This procedure

is composed of two phases, an orthogonalization phase and a truncation phase. The

orthogonalization phase consists of a sequence of QR decompositions of the vertical

unfolding of each core starting from the leftmost to orthonormalize its columns and

then a multiplication of the triangular factor by the following core. The truncation

phase consists of a sequence of truncated SVDs of the horizontal unfolding of each

core starting from the rightmost, leaving its rows orthonormal (set as the leading right

singular vectors), and multiplying the preceding core by the singular values and the

leading left singular vectors. The direction of these two phases can be reversed.

54

Given a required accuracy, the TT-Rounding procedure provides a quasi-optimal

approximation with given TT ranks [47].

Algorithm 7 TT-Rounding via Orthogonalization [1, 47]

1: function Y = TT-Round-QR(X, ε)2: Set TY,1 = TX,1

3: for n = 1 to N − 1 do4: [V(TY,n),R] = QR(V(TY,n))5: H(TY,n+1) = RH(TX,n+1)6: end for7: Compute ‖X‖ = ‖TY,N‖F and ε0 = ‖X‖F√

N−1ε

8: for n = N down to 2 do9: [Q,R] = QR(H(TY,n)

T)

10: [U, Σ, V] = tSVD(R, ε0)11: H(TY,n)

T = QU

12: V(TY,n−1) = V(TY,n−1)VΣ13: end for14: end function

4.2.5 Previous Work on Parallel TT-Rounding

Algorithm 7 has been parallelized by Al Daas et al. [1], who use a 1-D distribution

of TT cores to partition a TT tensor across processors. Each core is distributed over

all processors along the physical mode such that each processor owns Ik/P slices of

the kth core. This distribution guarantees a load balancing and allows to perform

TT arithmetic efficiently. In particular, the QR decompositions are performed via the

Tall-Skinny QR algorithm [14], and multiplications involving TT cores are parallelized

following the 1D distributions. We improve upon this prior work by using an alternate

TT-rounding approach that avoids QR decompositions, reducing arithmetic by a

constant factor and also reducing communication.

55

4.3 Introduction

Low-rank representations of tensors help to make algorithms addressing large-scale

multidimensional problems computationally feasible. While the size of explicit rep-

resentations of these tensors grows very quickly (an instance of the “curse of dimen-

sionality”), low-rank representations can often approximate explicit forms to sufficient

accuracy while requiring orders of magnitude less space and computational time. For

example, suppose a parametrized PDE depends on 10 parameters, where each param-

eter has 10 possible values. Computing the solution for each of the 1010 configurations

becomes infeasible even for modest discretizations of the state space, but if the so-

lution depends smoothly on the parameters, then the qualitative behavior of the

solution over the entire configuration space can be captured using far fewer than 1010

parameters [13,26,37].

As we describe in detail in section 3.3, the Tensor Train (TT) format [47] is a

low-rank representation with a number of parameters that is linear in the sum of the

tensor dimensions, as compared to an explicit representation whose size is the prod-

uct of the tensor dimensions. The TT format consists of a series of 3-way tensors,

or TT cores, with one dimension corresponding to an original tensor dimension and

two dimensions corresponding to much smaller TT ranks. TT approximations can be

computed from explicit tensors as a means of compression for scientific computing and

machine learning applications [27,47,50,66], but they are also often used to represent

tensors that cannot be formed explicitly at all. In the context of parametrized PDEs,

the TT format has been used to represent both the discretized operators as well as

the solution, residual, and other related vectors [7–9, 16]. In this case, TT tensors

are manipulated using operations such as additions, dot products, and elementwise

multiplications, which causes the TT ranks to grow in size. The key operation that

prevents uncontrolled growth in TT ranks is known as TT rounding, in which a TT

56

tensor is approximated by another TT tensor with minimal ranks subject to a spec-

ified approximation error. This operation requires a sequence of highly structured

matrix singular value decomposition (SVD) problems, and is typically a computa-

tional bottleneck.

There exists a wide array of high-performance, parallel implementations of tensor

computations for computing decompositions such as CP and Tucker of dense and

sparse tensors [5, 10, 12, 20, 33, 53], as well as for performing contractions of dense,

sparse, and structured tensors [2, 54, 56]. However, the available software for com-

puting, manipulating, and rounding TT tensors is largely limited to productivity

languages such as MATLAB and Python [44, 61]. Aside from the work of Al Daas

et al. [1], which we describe in section 3.3 and compare against in section 4.6, we

are not aware of other HPC implementations of TT-based algorithms. One of the

aims of this paper is to raise the bar for parallel performance for TT rounding and

demonstrate that TT-based approaches can scale to scientific problems with more

and higher dimensions using efficient parallelization.

The TT rounding algorithm utilizes multiple truncated SVDs. The central con-

tribution of this paper is the development of a parallel algorithm that performs these

truncated SVDs more efficiently than the existing approach, by reducing both com-

putational and communication costs. The basic tool of the algorithm is the Gram

SVD algorithm, which exploits the connection between the SVD of a matrix A and

the eigenvalue decomposition of its Gram matrix ATA. The truncated SVD must

be performed on a highly structured matrix which is analogous a matrix represented

as X = ABT, where A and B are tall-skinny matrices. We present our approach in

full detail for this matrix analogue in section 4.4, including empirical results for the

numerical properties, and then show how it can be applied within the TT rounding

algorithm in section 4.5. The key to efficiency in the context of TT rounding is the

57

computation of Gram matrices of matrices with overlapping TT structure.

We present performance results in section 4.6, demonstrating the efficiency of our

algorithm compared to the existing state of the art. In a MATLAB-based experi-

ment, we show that improvement of a TT-rounding implementation leads to overall

performance improvement for a TT-based linear solver. Then we demonstrate that

our C/MPI implementation is both weakly and strongly scalable on TT tensors with

representative dimensions and ranks. In particular, we achieve up to Y× parallel

speedup when scaling to 64 nodes of a distributed-memory platform for a Z-way

tensor with dimensions of size W and TT ranks of size Q. We also achieve up to

a 8× speedup over a state-of-the-art implementation of the standard TT-rounding

approach. Our results demonstrate that TT rounding is highly scalable using our

algorithm, and we target parallelization of TT-based solvers based on our approach

as future work.

4.4 Truncation of Matrix Product

To gain intuition for the use of Gram SVD within TT-Rounding, we focus in the

section on the (degenerate) case of TT with 2 modes, with dimensions I × J . In this

case, the tensor is a matrix represented by a low-rank product of matrices:

X = ABT, (4.1)

where A and B are tall and skinny matrices with R columns. The goal is to approx-

imate X with a lower rank representation

X ≈ ABT, (4.2)

where A and B have L < R columns.

58

Algorithm 8 Rounding Matrix Product ABT using QR

function [A, B] = Mat-Rounding-QR(A,B, ε)[QA,RA] =QR(A)[QB,RB] =QR(B)[U, Σ, V] =tSVD(RART

B, ε)

A = QA

(UΣ

1/2)

B = QB

(VΣ

1/2)

end function

4.4.1 Truncation via Orthogonalization

A numerically accurate and reasonably efficient approach to truncate the represen-

tation of X is via orthogonalization. By computing (compact) QR decompositions

A = QARA and B = QBRB, we have

X = QARARTBQT

B (4.3)

and the SVD of RARTB yields the (compact) SVD of X because QA and QB have

orthonormal columns. Note that RARTB is R × R, so its SVD is much cheaper to

compute.

We formalize this approach in algorithm 8. In order to truncate the rank of X,

we can truncate the SVD of RARTB. To obtain factors A and B, we apply QA and

QB to the left and right singular vectors, respectively. The singular values can be

distributed arbitrarily, we choose to distribute them evenly to left and right factors.

4.4.2 Truncation via Gram SVD

We now show our proposed method for a faster but potentially less accurate rounding

algorithm for the matrix product. Our method is based on the Gram SVD algorithm,

but we note it is not a straightforward application. For example, we can represent

XXT as ABTBAT, and while BTB is R×R, we cannot obtain the eigenvalue decom-

position easily without orthogonalizing A. Instead, we consider the Gram matrices

59

of A and B separately, letting GA = ATA and GB = BTB. For clarity, we first

describe the method using Cholesky QR, then discuss pivoting within Cholesky, and

finally explain the use of Gram SVD. We compare numerical results for the matrix

product case in section 4.4.4.

Cholesky QR

Let us first assume A and B are full rank, and use Cholesky QR to orthonormalize

the columns of A and B. Computing Cholesky decompositions, we have RTARA = GA

and RTBRB = GB. Then eq. (4.3) becomes

X = (AR−1A )RART

B(BR−1B )T.

Given the truncated SVD UΣVT

= RARTB, we can compute

A = A(R−1A UΣ

1/2)

and B = B(R−1B VΣ

1/2)

to obtain eq. (4.2).

Pivoted Cholesky QR

Now suppose that A and B are low rank with ranks LA and LB. While the standard

Cholesky algorithm will fail in this case, we can employ pivoted Cholesky to obtain

RTARA = PT

AGAPA and RTBRB = PT

BGBPB, where PA and PB are permutation

matrices and RA and RB can be written

RA =

[RA RA

0

]and RB =

[RB RB

0

],

60

with R−1

A and R−1

B having dimensions LA × LA and LB × LB, respectively. Then

eq. (4.3) becomes

X = QA

[RA RA

]PTAPB

[R

T

B

RT

B

]︸︷︷︸

M

QT

B = QAMQT

B

where

QA = APA

[R−1

A

0

]and QB = BPB

[R−1

B

0

].


= M, we compute

A = A

(PA

[R−1A

0

]UΣ

1/2

)and B = B

(PB

[R−1B

0

]VΣ

1/2

)

to obtain eq. (4.2).

Gram SVD

Pivoted Cholesky QR works well for the low rank case in exact arithmetic, but in the

case of numerically low rank matrices, it provides a sharp truncation for each of A and

B individually. We now consider using the Gram SVD approach, which we will see in

section 4.4.4 is more robust than pivoted Cholesky QR. Here, we consider A and B to

be possibly low rank. Given the SVDs A = UAΣAVTA and B = UBΣBVT

B, we have

eigenvalue decompositions GA = VAΣ2AVT

A = VAΣ2

AVT

A and GB = VBΣ2BVT

B =

VBΣ2

BVT

B, where ΣA and ΣB represent the nonzero singular values and VA and VB

are the corresponding vectors. We can then write the corresponding left singular

vectors via UA = AVAΣ−1

A and UB = AVBΣ−1

B . With these quantities, eq. (4.1)

becomes

X = (AVAΣ−1

A︸︷︷︸UA

) ΣAVT

AVBΣB︸︷︷︸M

(BVBΣ−1

B︸︷︷︸UB

)T = UAMUT

B.

61

Algorithm 9 Truncated SVD of ABT using Gram SVDs

1: function [A, B] = tSVD-ABt-Gram(A,B, ε)2: GA = ATA3: GB = BTB4: [VA,ΛA] = Eig(GA)5: [VB,ΛB] = Eig(GB)

6: [U, Σ, V] =tSVD(Λ1/2A VT

AVBΛ1/2B , ε)

7: A = A(VAΛ

−1/2A UΣ

1/2)

8: B = B(VBΛ

−1/2B VΣ

1/2)

9: end function


= M, we compute

A = A(VAΣ

−1

A UΣ1/2)

and B = B(VBΣ

−1

B VΣ1/2)

to obtain eq. (4.2). The algorithm for the Gram SVD approach is given as algorithm 9,

which can be adapted to pivoted Cholesky QR following the algebra of section 4.4.2.

4.4.3 Complexity Analysis

We now consider the computational complexity of the truncation methods, where

we assume A is I × R, B is J × R, A is I × L, and B is J × L. Truncation via

orthogonalization is specified in algorithm 8. The QR decompositions in lines 2 and 3

require 2(I+J)R2 flops, where we assume that the orthogonal factors QA and QB are

maintained in implicit (e.g., Householder) form. The multiplication and truncated

SVD of line 4 cost O(R3). Applying the implicit orthogonal factors to R×L matrices

to compute A and B require 4(I + J)RL flops for a total cost bounded by

2(I + J)R2 + 4(I + J)RL+O(R3). (4.4)

In the case of the Gram SVD approach, we unify the analysis for Cholesky QR

and Gram SVD. Algorithm 9 gives the explicit steps assuming Gram SVD is used.

62

The cost of lines 2 and 3 together is (I + J)R2 operations, which is performed for

either method. The eigendecompositions of lines 4 and 5 is O(R3). This cost is

approximately 10 times more expensive than performing Cholesky decomposition of

the Gram matrices, but we note that O(R3) is a lower order term compared to the cost

of computing the Gram matrices. The matrix multiplications and truncated SVD of

line 6 are also O(R3), possibly less if A and B are low rank and similar across the two

methods. Finally, lines 7 and 8 first involve computations of small matrices (of size

R×L or smaller) followed by a single multiplication with the large A or B matrices,

which together cost 2(I + J)RL. Overall, the computational cost of the Gram SVD

method is bounded by

(I + J)R2 + 2(I + J)RL+O(R3), (4.5)

which is about half the cost of that of the orthogonalization approach, given in

eq. (4.4). Furthermore, the dominant costs of eq. (4.5) come from (symmetric) ma-

trix multiplication rather than computation of/with implicit orthogonal factors, so

we expect higher efficiency for the Gram SVD approach in addition to the reduced

arithmetic.

4.4.4 Numerical Examples

In this section, we will demonstrate the empirical error of computing a truncated SVD

of X = ABT using Gram matrices and compare it to the more accurate orthogonal-

ization approach. We consider 3 synthetic input matrices with differing condition-

ing properties to illustrate the differences among the three methods (including both

Cholesky QR and Gram SVD approaches).

In each case, we construct input matrices A and B each to be 1000 × 50 and

to have geometrically distributed singular values with random left and right singular

vectors. We use double precision in these experiments. In the first case, we construct

63

both A and B to have condition numbers of 106: κ(A) = κ(B) = 106. That is,

the largest singular value of each matrix is 106, the smallest is 100, and the rest are

geometrically distributed within that range. The condition number of X in this case

is bounded above by 1012. The second synthetic case has input matrices that are

more ill-conditioned: κ(A) = κ(B) = 1012. The third case has input matrices that

are imbalanced, with κ(A) = 1012 and κ(B) = 100.

Figure 4.1 reports the results from truncation via QR (algorithm 8), Gram SVD

(algorithm 9), and Cholesky QR (variant of algorithm 9 described in section 4.4.2).

Each column of the figure corresponds to a different pair of inputs, the top row

plots the computed relative singular values (normalized by σ1 so that the first index

is equal to 1), the middle row reports the approximation error after truncation for

various tolerances, and the bottom row reports the computed truncation ranks.

In the left column, we see an example of a typical use case of the algorithm: all

algorithms perform equivalently and the approximation error matches the specified

tolerance. Note that when the tolerance is smaller than the smallest singular value,

no truncation is performed. If both input matrices have condition number smaller

than the inverse of the square root of machine precision, then we expect no distinction

among algorithms. In this case, the conditioning of the Gram matrices is such that

the eigenvalues can be computed accurately and Cholesky decomposition will not fail.

In the middle column, we see an example of input matrices whose condition num-

bers are larger than 108. In this case, the Gram matrices are numerically low rank,

causing truncation of the Cholesky decomposition and a loss of accuracy of the small-

est eigenvalues. This causes a sharp truncation of the rank in the case of Cholesky

and an overestimate of the singular values of X in the case of Gram SVD. For toler-

ances smaller than 10−8, we see that the approximation error of Cholesky QR does

not drop below the square root of machine precision. The Gram SVD approach’s rank

64

0 10 20 30 40 50

100

10−4

10−8

10−12

10−16

Index

Com

pSin

gV

als

κ(A)=106, κ(B)=106

QRGram SVDCholQR

0 10 20 30 40 50

Index

κ(A)=1012, κ(B)=1012

0 10 20 30 40 50

Index

κ(A)=1012, κ(B)=101

100 10−4 10−8 10−12

100

10−4

10−8

10−12

10−16

Tol

Err

or

100 10−4 10−8 10−12

Tol

100 10−4 10−8 10−12

Tol

100 10−4 10−8 10−12

10

20

30

40

50

Tol

Ran

ks

100 10−4 10−8 10−12

Tol

100 10−4 10−8 10−12

Tol

Figure 4.1: Numerical results for truncation of matrix product X = ABT. Columnscorrespond to input matrices with different conditioning properties (details given insection 4.4.4). Top row specifies computed relative singular values before truncation,middle row reports relative approximation error after truncation, and bottom rowspecifies the truncation rank used for various requested tolerances.

65

selection deviates slightly from that of QR, but only for very small tolerances near

10−14. We note that for tolerance larger than 10−8, we see no deviation in behavior

across all three algorithms.

In the right column, we consider input with one matrix close to low rank but

the other well conditioned. Again, for tolerances larger than 10−8, all algorithms

perform well. For tighter tolerances, however, we see that the inaccuracy of small

eigenvalues of the Gram matrix of A causes deviation in truncation rank selection

and approximation error. As in the second case, the Cholesky QR approach does not

attain error below 10−8 because of the sharp truncation performed by the pivoted

algorithm. The Gram SVD approach computes approximation errors that match

the tolerance closely below 10−8, but as the tolerance tightens, the method begins

overestimating the truncation rank and eventually stops truncating at all. In this

way, the approximation error satisfies the tolerance, but the rank is not truncated as

much as possible.

Based on these results, we conclude that for tolerances greater than the square

root of machine precision, truncation using Gram matrices is sufficiently accurate.

While small singular of A and B are not computed as accurately via the Gram

SVD approach, they are not necessary for computing low rank approximations with

large approximation error. We note that the relationship between the SVDs of A,

B, and X have an effect of the overall accuracy. Even if a less accurate method

is used for the SVD of A and B, these results show that the Gram SVD approach

can compute singular values of X that are smaller than the square root of machine

precision. Despite the fact that the cheapest approach using pivoted Cholesky QR

is sufficiently accurate for large tolerances, we use the Gram SVD approach in the

context of TT rounding because it is more robust for smaller tolerances and because

the extra computation has little effect on the overall run time.

66

TX,1 TX,2 TX,3 TX,4 TX,5 TX,6

R1 R2 R3 R4 R5

I1 I2 I3 I4 I5 I6

A BQ Z

X

(a) Equation (4.6) for N =6, n = 3

TX,1

TX,1

TX,2

TX,2

TX,3

TX,3

R1

R1

R2

R2

R3

R3

I1 I2 I3

A

A

GL1

GL2

GL3

(b) A(1:n)TA(1:n) for n = 3

TX,3

TX,3

R3

R3

I3GL2

R2

R2

GL3

TC,3

(c) Computing GLn from

GLn−1 for n = 3

Figure 4.2: Tensor network diagrams

4.5 TT-Rounding via Gram SVD

We now apply the approach described in section 4.4 for X = ABT to the case of

TT rounding. In section 4.5.1, we explain the analogues of matrices A and B within

the TT rounding algorithm, and in section 4.5.2 we show how to compute the Gram

matrices for the associated structure matrices. We then present two algorithmic vari-

ants of TT rounding based on the approach in section 4.5.3 and provide complexity

analysis in section 4.5.5 with comparison against the standard TT rounding via or-

thogonalization.

4.5.1 TT Rounding Structure

The nth TT rank of a tensor X is the rank of the unfolding X(1:n), which is an I1 · · · In×

In+1 · · · IN matrix where each column is a vectorization of an n-mode subtensor. If

X is already in TT format, then X(1:n) has the following structure [1, Eq. (2.3)]:

X(1:n) = (IIn ⊗Q(1:n−1))V(TX,n)H(TX,n+1)(IIn+1 ⊗ Z(1)), (4.6)

where Q is I1 × · · · × In−1 ×Rn−1 with

Q(i1, . . . , in−1, rn−1) = · · · ·TX,n−1(:, in−1, rn−1),

and Z is Rn+1 × In+2 × · · · × IN with

Z(rn+1, in+2, . . . , iN) = TX,n+2(rn+1, in+2, :) · · · · .

67

Truncating or rounding the TT rank of X in this case corresponds to performing a

truncated SVD of X(1:n). The correctness of algorithm 7 stems from the fact that

at the nth step of the truncation loop, the matrix IIn ⊗ Q(1:n−1) has orthonormal

columns and the matrix H(TX,n+1)(IIn+1 ⊗Z(1)) has orthonormal rows, and therefore

the truncated SVD of V(TX,n) yields the truncated SVD of X(1:n).

In our proposed approach, we do not impose orthogonality on the exterior ma-

trices and instead use a Gram SVD based approach. To follow the analogy of

section 4.4, we consider A = A(1:n) = (IIn ⊗ Q(1:n−1))V(TX,n) and BT = B(1) =

H(TX,n+1)(IIn+1 ⊗Z(1)), where A and B are tensors with dimensions I1× · · · In×Rn

and Rn× In+1×· · ·× IN , respectively. We visualize these relationships using a tensor

network diagram [48] in fig. 4.2a. In these diagrams, a node represents a tensor, edges

represent modes (so that the degree of a node is its dimension), and adjacent nodes

represent contractions. To perform the truncation, we first compute ATA and BTB

as described in section 4.5.2. Then we follow the approach of algorithm 9 and finally

compute A and B by updating only V(TX,n) and H(TX,n+1), leaving the TT cores

that constitute Q and Z unchanged.

4.5.2 Structured Gram Matrix Computation

Considering A(1:n) = (IIn ⊗Q(1:n−1))V(TX,n) as the matrix A in our matrix product

example, our goal is to compute ATA exploiting the structure of A (and the internal

structure of Q(1:n−1)). This can also be seen as a contraction between A, a tensor of

dimension n+ 1, and itself in the first n modes.

The structure is easiest to understand in the form of a tensor network diagram, as

we show in fig. 4.2b. In the figure, we have n = 3, so that A is a 4-way tensor composed

of 3 TT cores. To visualize contracting A with itself and compute GL3 = A(1:3)

TA(1:3),

we draw A twice and connect edges corresponding to the modes with dimensions I1,

68

I2, and I3. After all connected modes are contracted, we are left with 2 un-contracted

modes, each of dimension R3, corresponding to a square output matrix (which is also

symmetric). We use the notation GL3 to signify that A is composed of left-most cores

and has dimension R3 ×R3.

The most efficient way to perform the contractions to compute GLn = A(1:n)

TA(1:n)

is to work left to right, first contracting the mode with dimension I1. Because the

operation involves two tensors with dimension 2, it corresponds to the (symmetric)

matrix multiplication GL1 = V(TX,1)TV(TX,1), where we use the notation GL

1 because

the result is the contraction between the left-most cores and has dimension R1 ×R1.

The next step is to contract the two TX,2 nodes with GL1 to compute GL

2 . These two

contractions can be performed in either order or simultaneously, exploiting symmetry

as we describe below. We continue this process of computing each symmetric Gram

matrix from the previous mode’s, finally computing GLn from GL

n−1 and the two TX,n

cores. Figure 4.2c shows the structure of the tensor network before GL3 is computed

from GL2 and the two TX,3 cores.

The key to the efficiency of the structured Gram matrix computation in the context

of TT rounding is the fact that we obtain all Gram matrices GLn as a by-product

of computing the last one, GLN−1. In this way, we have performed the ATA-analogue

computations for truncating all TT ranks with one left-to-right pass over the TT

representation of the tensor. In order to compute the BTB-analogue quantities, we

make a similar pass from right to left to obtain GRn for 1 ≤ n ≤ N − 1. Note

that GRn is the contraction between the right-most cores to the right of (and not

including) the nth core, so that GLn and GR

n are the Gram matrices associated with

the truncation of the nth TT rank and are both Rn ×Rn.

We now consider two ways of computing GLn from GL

n−1 and two TX,n cores, which

we refer to as non-symmetric and symmetric approaches. Computations for GRn from

69

GRn+1 are analogous. In the nonsymmetric approach, we contract GL

n−1 with one of

the cores, letting TC,n represent the temporary result as illustrated in fig. 4.2c. Here

we consider C to be a TT-format tensor with the same dimensions and ranks as X

for convenient notation. This contraction is a tensor-times-matrix operation and can

be expressed as TC,n = TX,n ×1 GLn−1 and computed as H(TC,n) = GL

n−1H(TX,n).

After the first contraction, TC,n and the remaining TX,n share two modes, and the

second contraction is across both modes. This operation can be performed via GLn =

V(TX,n)TV(TC,n). Note that while the result is symmetric in exact arithmetic, this

approach does not assume symmetry, and the result will not be bit-wise symmetric

due to roundoff error.

In the symmetric approach, we can use the fact that every Gram matrix is sym-

metric and positive semi-definite. Thus, we can compute a (pivoted) Cholesky decom-

position GLn−1 = LLT. Then we can contract each L factor with one of the TX,n nodes,

permuting slices of TX,n if necessary. Here, one contraction is sufficient because they

are equivalent operations, and we can exploit the triangular structure of L to save half

the arithmetic of the tensor-times-matrix operation. Letting TD,n = TX,n×1 L repre-

sent the result, the second contraction is performed via GLn = V(TD,n)TV(TD,n) which

can be performed symmetrically, again saving half the arithmetic and producing an

exactly symmetric result.

As illustrated in , GLn−1 is a matrix with dimension Rn−1 × Rn−1 and TX,n has

dimensions Rn−1 × In × Rn. In the nonsymmetric approach, the first contraction

requires 2InR2n−1Rn operations, and the second contraction requires 2InRn−1R

2n op-

erations. In the symmetric approach, the Cholesky decomposition requires O(R3n−1)

operations, and the two contractions together require InR2n−1Rn + InRn−1R

2n opera-

tions, not including any pivoting that must be performed. Despite the fact that they

symmetric approach saves half the flops, we use the nonsymmetric approach in our

70

later experiments because of the empirical performance benefits. We found that the

superior performance of gemm over trmm and syrk (and the need to copy data for

trmm) on our platform outweighs the reduction in arithmetic.

4.5.3 Algorithms

Given the approach to computing Gram matrices of the TT-structured matrices de-

scribed in section 4.5.2, we now present algorithms for TT-rounding using the Gram

SVD approach. We follow the basic steps outlined in section 4.4 and algorithm 9:

compute Gram matrices of factors, perform eigenvalue decompositions, truncate the

combined results using SVD, then apply updates to factors to reduce their dimensions.

As described in section 4.5.2, with a left-to-right and right-to-left pass of the TT

structure, we can obtain the Gram matrices associated with every TT rank truncation.

Given its pair of Gram matrices, each TT rank can be truncated independently of all

others. We call this approach the simultaneous variant to distinguish it from a more

computationally efficient method that truncates ranks in sequence (described below).

The simultaneous variant of the algorithm is given as algorithm 10. Line 2 to line 11

show the set of contractions used to obtain Gram matrices across all modes. Lines 14

to 16 perform the eigenvalue and singular value decompositions of small matrices.

Finally, lines 17 and 18 update the TT cores and reduce their dimension. Note that

the singular values are distributed evenly to each interior factor, as each is scaled by

Σ1/2

.

Alternatively, we can truncate the TT ranks in sequence to save some arith-

metic by exploiting orthogonality. Following the original approach of TT-Rounding

via orthogonalization (algorithm 7), if we truncate the ranks from left to right and

pass all singular value to the right, then we maintain orthogonality of the left-most

cores. That is, when truncating the nth rank and considering eq. (4.6), we have that

71

Algorithm 10 TT-Rounding via Gram SVD (Simultaneous)

1: function Y = TT-Round-Gram-Sim(X, ε)2: GL

1 = V(TX,1)TV(TX,1)

3: for n = 2 to N − 1 do4: H(TC,n) = GL

n−1H(TX,n)5: GL

n = V(TX,n)TV(TC,n)

6: end for7: GR

N−1 = H(TX,N )H(TX,N )T

8: for n = N − 1 down to 1 do9: V(TC,n) = V(TX,n)G

Rn

10: GRn−1 = H(TC,n)H(TX,n)

T

11: end for12: Compute ‖X‖ = (GR

0 )1/2

and ε0 = ‖X‖√N−1

ε

13: for n = 1 to N − 1 do14: [VL,ΛL] = Eig(GL

n)15: [VR,ΛR] = Eig(GR

n )

16: [U, Σ, V] =tSVD(Λ1/2L VT

LVRΛ1/2R , ε0)

17: V(TY,n) = V(TX,n)·(VLΛ−1/2L UΣ

1/2)

18: H(TY,n+1) = (Σ1/2

VTΛ−1/2R VT

R)·H(TY,n+1)19: end for20: end function

Q(1:n−1) has orthonormal columns. Thus, truncating X(1:n) is equivalent to truncat-

ing V(TX,n)H(TX,n+1)(IIn+1 ⊗ Z(1)). In the standard approach, we also have that

H(TX,n+1)(IIn+1 ⊗ Z(1)) has orthogonal rows, but that does not apply here. Instead,

we use the analogue of A = V(TX,n) and BT = H(TX,n+1)(IIn+1⊗Z(1)). We note that

BT is identical to the simultaneous case, so BTB is exactly GRn . The A matrix is

different, but because it corresponds to a single core, the Gram matrix computation

is much cheaper to compute: GLn = V(TX,n)TV(TX,n).

Thus, we can make a single right-to-left pass to pre-compute all Gram matrices

corresponding to BTB, and then we can make a left-to-right truncation pass where we

maintain orthogonality of the left-most cores and compute Gram matrices for ATA

in sequence. The other added benefit of this approach is that the nth core already

has one dimension truncated (from the previous mode) when its Gram matrix is

72

computed. This sequence variant is presented in algorithm 11.

Algorithm 11 TT-Rounding via Gram SVD (Sequence RLR)

1: function Y = TT-Round-Gram-Seq(X, ε)2: GR

N−1 = H(TX,N )H(TX,N )T

3: for n = N − 1 down to 1 do4: V(TC,n) = V(TX,n)G

Rn

5: GRn−1 = H(TC,n)H(TX,n)

T

6: end for7: Compute ‖X‖ = (GR

0 )1/2

and ε0 = ‖X‖√N−1

ε

8: for n = 1 to N − 1 do9: GL

n = V(TX,n)TV(TX,n)

10: [VL,ΛL] = Eig(GLn)

11: [VR,ΛR] = Eig(GRn )

12: [U, Σ, V] =tSVD(Λ1/2L VT

LVRΛ1/2R , ε0)

13: V(TY,n) = V(TX,n)·(VLΛ−1/2L U)

14: H(TY,n+1) = (ΣVTΛ−1/2R VT

R)·H(TY,n+1)15: end for16: end function

We note that the sequence order is arbitrary. Algorithm 11 truncates ranks in

left-to-right order, but it can also truncate right-to-left if the Gram matrix sweep

is done left-to-right. Following prior work [1], we use the acronym RLR to signify

a right-to-left Gram sweep followed by a left-to-right truncation sweep, and we use

LRL to signify left-to-right Gram sweep followed by a right-to-left truncation sweep.

4.5.4 Parallelization

Algorithms 10 and 11 are presented as sequential algorithms. We describe the parallel

version of the algorithm in words here, as we have chosen the algorithm for its ease

of parallelization. We follow the same parallel distribution as prior work on TT-

Rounding via orthogonalization [1] described in section 4.2.5, with each TT core

distributed across all processors and each processor owning a subset of the slices in

1D-distribution fashion.

There are two main parallel operations to consider in these algorithms: (1) a

73

TT-core times a small matrix in one mode (e.g., line 4 in algorithm 10), and (2)

the contraction of two TT cores across two modes (e.g., line 5 in algorithm 10).

Given the parallel distribution, a TT-core times a small matrix in one node, which is

expressed as pre-multiplication of the horizontal unfolding or post-multiplication of

the vertical unfolding by a small matrix) can be performed independently, with no

communication, if all processors have access to the small matrix. Also, the contraction

of two TT cores (expressed as the transpose of a vertical unfolding times another

vertical unfolding or a horizontal unfolding times the transpose of another horizontal

unfolding) can be performed via parallel reduction with a small matrix as output:

after local contraction, a single all-reduce computes and stores the result across all

processors.

In the simultaneous variant (algorithm 10), computing the left and right Gram

matrices consists of alternating these two operations. Consider line 2 to line 6: if each

GLn contraction operation uses an all-reduce, then the subsequent core-times-matrix

operation requires only local computation and no communication. The same pattern

applies to computing the GRn matrices. Given that the Gram matrices are all avail-

able on all processors, the EVD and SVD operations can be performed redundantly

so that the update operations in lines 17 and 18 also require no communication. We

note that in the simultaneous variant, the EVD and SVD operations are independent

across modes. It is thus possible to distribute these computations across processors,

allowing N different processors to work simultaneously on all modes. In this case, the

processors need to broadcast their results in order to perform the update operations.

This optimization improves scalability at the expense of slightly higher communica-

tion costs. We have not implemented this approach because the sequence variant of

the algorithm outperforms the simultaneous variant in our experiments.

In the sequence variant (algorithm 11), we pre-compute only one set of Gram

74

matrices. Computing these Gram matrices is parallelized the same as in the simul-

taneous variant. The unique operation for the sequence variant is line 9, which is a

contraction of a TT core with itself, which is performed via local computation and

an all-reduce. As before, the EVD and SVD operations are performed redundantly

and the updates require no communication.

4.5.5 Complexity Analysis

We perform complexity analysis using the simplifying assumptions that all tensor

dimensions are equivalent, all ranks are equivalent, and all reduced ranks are equiv-

alent. That is, we assume that In = I for 1 ≤ n ≤ N and that original and reduced

ranks Rn = R and Ln = L for 1 ≤ n ≤ N − 1. For comparison, the parallel cost of

TT-Rounding via orthogonalization (algorithm 7) is given by

γ ·(NIR

3R2 + 6RL+ 4L2

P+O(NR3 logP )

)+β ·O(NR2 logP )+α ·O(N logP ),

where γ, β, and α are the costs per flop, word, and message, respectively [1, Eq.

(3.6)].

Algorithm 10 (the simultaneous variant) performs two passes to compute Gram

matrices. For each mode, the local computation involves the multiplication between

a local tensor core of dimension R × (I/P ) × R with an R × R matrix, for a cost

of 2IR3/P flops, and a contraction between two cores, which requires 2IR3/P flops.

Thus, the total arithmetic cost of the Gram matrix computations is 8NIR3/P . As

described in section 4.5.2, by exploiting symmetry we can reduce the constant factor

from 8 to 4. The EVD and SVD operations are performed on R × R matrices for

a total cost of O(NR3) flops (note there is no parallelism in these operations). The

updates of the cores are multiplications of the cores with two R × L matrices. The

first multiplication costs 2IR2L/P flops, while the second costs 2IRL2/P because

75

it involves a core with one mode of already reduced dimension. Thus, the total

arithmetic cost for the updates is 2NIR2L/P + 2NIRL2/P .

The communication cost of algorithm 10 is that of two all-reduces for each mode

(one for each direction of Gram matrix computation). Thus, the communication

costs across all modes are β ·O(NR2) + α ·O(N logP , and the total parallel cost for

algorithm 10 (assuming symmetry is exploited) is

γ ·(NIR

4R2 + 2RL+ 2L2

P+O(NR3)

)+ β · O(NR2) + α · O(N logP ).

Algorithm 10 (the sequence variant) performs only one pass to compute Gram

matrices, for an arithmetic cost of 4NIR3/P flops across all modes, or 2NIR3/P

flops if we use the symmetric approach. Computing the Gram matrix for the nth TT

core in line 9 costs IR2L/P flops, because its first mode has already been reduced

in dimension from R to L. The EVD and SVD operations and the updates of the

cores are the same as in the simultaneous variant. The communications costs are

identical to the simultaneous variant as well: there is one all-reduce for each mode in

the Gram pass and one all-reduce in each mode for line 9. Thus, the total parallel

cost for algorithm 11 (assuming symmetry is exploited) is

γ ·(NIR

2R2 + 3RL+ 2L2

P+O(NR3)

)+ β · O(NR2) + α · O(N logP ).

We note that, compared to the orthogonalization approach, the Gram SVD ap-

proaches have reduced constants on the leading arithmetic terms and smaller band-

width terms (by a factor of O(logP )). We will see in the numerical results that

the reduced arithmetic provides significant speedup in practice, in part because the

performance of the operations (which are all based on gemm for Gram SVD) also im-

proves. At higher processor counts, the simplified communication structure (using

76

Model Modes Dimensions Memory1 50 2K × . . .× 2K 77 MB2 16 100M × 50K × . . .× 50K × 1M 8 GB3 30 2M × . . .× 2M 45 GB4 10 10K × 20× . . .× 20 930 KB

Table 4.1: Synthetic TT models used for performance experiments. All formal ranksare 20 and are cut in half to 10 by the TT rounding procedure.

a single well-optimized collective) also provides speedup over the more complicated

communication of Tall-Skinny QR of the orthogonalization approach.

4.6 Numerical Results

4.6.1 Experimental Setup

All parallel scaling experiments are performed on the Andes supercomputer at Oak

Ridge Leadership Computing Facility. Andes is a 704-node Linux cluster. Each node

contains 256 GB of RAM and 2 AMD EPYC 7302 16-Core processors for a total of

32 cores per node. We build our Gram rounding subroutines on top of the library

MPI ATTAC [43], and we use OpenBLAS implementation for BLAS and LAPACK

routines [46] and OpenMPI [24].

As described in table 4.1, we use 4 synthetic TT models for scaling experiments.

Models 1-3 are analogous to the synthetic models used in prior work [1]. Model

4 is identical in shape to the problem we solve via TT-GMRES in the MATLAB

implementation of TT Rounding (see section 4.6.4). For each model, we scale using

the three Gram SVD algorithms described in section 4.5.3 and the original QR-based

TT Rounding algorithm given by algorithm 7. All reported numbers are the minimum

of 5 trials on 5 different allocations. The sequential experiments using MATLAB were

performed on a machine with an Intel Xeon Gold 6226R CPU and 256 GB of RAM.

77

25 26 27 28 29 210 211

2−6

2−4

2−2

20

22

Cores

Tim

e(s

)

QRSIMLRLRLR

Figure 4.3: Strong Scaling for Model 2

26 27 28 29 210 2112−3

2−1

21

23

25

Cores

Tim

e(s

)

QRSIMLRLRLR

(a) Strong Scaling

64 128 256 512 1024 20480

0.2

0.4

0.6

0.8

1

Cores

Tim

e(s

)

QRSIMLRLRLR

(b) Timing Breakdown

Figure 4.4: Performance results for Model 3. Dark signifies computation, and lightsignifies communication.

4.6.2 Parallel Scaling of TT Rounding

Figures 4.3 and 4.4a present strong scalability comparisons using models 2 and 3,

respectively, among different rounding procedures. In fig. 4.3, we see that Gram-

SVD-based rounding methods scale well to 32 nodes, with parallel speedups of 26×,

21×, and 21×. The LRL variant is fastest, reaching a speedup of a factor of up to

78

21× compared to the QR-based rounding. We note that since the mode sizes of the

boundary modes are different, the computation complexity costs for the LRL and

RLR variants become different, with LRL performing approximately half the flops

of RLR. As expected, we see a performance difference between LRL and RLR of

nearly 2× when the performance is computation bound, and the run times converge

as communication costs begin to dominate. The scalability limit is caused by the

machine and is not inherent to the algorithm, as we explain in section 4.6.3.

In the case of model 3, the mode sizes are all equal, and the complexity analysis in

section 4.5.5 tells us that the LRL and RLR approaches are about twice faster than

the Gram-Sim approach. This analysis is confirmed by the experiment when the time

is computation bound, as we see in fig. 4.4a. Speedups of Gram SVD over QR range

from 6× to 8×, and the parallel speedups for the Gram SVD algorithms on 64 nodes

are 42×, 27×, and 15×.

4.6.3 Time Breakdown of TT Rounding

Figure 4.4b presents the relative communication/computation runtime of the strong

scalability test using model 3, matching the data of fig. 4.4a. We remark that the com-

munication time is more significant when using the QR-based TT rounding. The com-

munication costs for the QR-based are a factor O(logP ) larger than the Gram round-

ing procedures in theory. Further, the Gram SVD variants use the MPI Allreduce

routine which seems to be more efficient than the TSQR implementation used in the

QR-based rounding.

Figure 4.5 presents the communication/computation runtime breakdown of a weak

scalability test using model 1 and different variants of TT rounding procedures. We

remark that the computation time for each method is the same when increasing

the number of processors, and the relative computation time affirms the theoretical

79

32 64 128 256 512 102420480

1

2

3

Cores

Tim

e(s

)

QRSIMLRLRLR

Figure 4.5: Weak scaling time breakdowns for Model 1. Dark signifies computation,and light signifies communication.

analysis of the constant factors n the leading terms. The communication time of

Gram rounding procedures shows a logarithmic increase up to 32 nodes (1024 cores)

and increases significantly on 64 nodes. This behavior appears even earlier, at 256

processors, when using the QR-based TT rounding. In order to understand this

behavior, we performed a scalability test on the MPI Allreduce routine on Andes

using a single scalar and observed similar behavior costs as in fig. 4.5: the time

increases like logP until 32 nodes and then begins to increase more quickly than

theory suggests. Thus, we believe the scalability limit is reached due to an artifact

of the machine rather than a limitation of the algorithm, whose latency costs should

grow with O(logP ).

4.6.4 TT-GMRES Performance

Here we consider a parameter dependent PDE model where we seek an increasingly

accurate solution by refining the mesh in space. This mesh refinement will increase

the size of mode 1 and leave the parameters modes’ sizes the same.

80

500 1,000 1,5000

20

40

60

Tensor Dimension

Tim

e(s

)

QR

SIM

LRL

Figure 4.6: TT-GMRES timing for MATLAB implementation. Dark signifies TTrounding, and light signifies other computation.

MATLAB Performance for Small Problem

In this experiment, we use TT-GMRES to solve the Cookies problem described in

section 4.2.3 using p = 4 parameters. The values of each parameter are distributed

logarithmically in the interval [0.1,10]. The discretization of the PDE is obtained

by using FreeFem++ [29]. For each variant of TT rounding, we perform 10 itera-

tions. The variance between relative residual norm obtained by different methods is

negligable. For all methods we obtain an accuracy of approximately 10−3.

Figure 4.6 shows the performance of the original TT Rounding using QR on a

MATLAB implementation of TT-GMRES compared to the Gram-Sim and Gram-Seq

(LRL) implementations of TT rounding in MATLAB. We note that TT-Rounding is

at least half of the runtime of TT-GMRES using QR and that the Gram-Sim gives

at least a 2× speedup over the QR implementation of TT-Rounding for an overall

faster TT-GMRES algorithm.

81

25 26 27 28 29 210 211

2−8

2−6

2−4

2−2

Cores

Tim

e(s

)

QRSIMLRL

Figure 4.7: TT-GMRES Weak Scaling

Weak Scaling of TT Rounding for Larger Problems

Using a TT tensor of the same dimensions as the one used in section 4.6.4, we weakly

scale the spatial dimension on Andes, keeping all other modes fixed, and report the

results in Figure 4.7. We remark that the LRL variant does less computation than

RLR, so we report only LRL performance, which we see weakly scales well until 210

cores.

4.7 Conclusion

We present in this work a parallel rounding procedure for low-rank TT tensors based

on Gram SVD. In contrast with the orthogonalization-based rounding procedure that

relies heavily on QR decomposition of tall and skinny matrices, this method relies on

matrix multiplication. Not only does the Gram SVD approach reduce the computa-

tional complexity, but existing on-node implementations of matrix multiplication are

typically more efficient than those computing and multiplying by orthogonal matrices.

Our scalability experiments show that the proposed method scales as well as or

82

better than the state of the art, in large part because all the communication is cast

in terms of all-reduce collectives. We observe a maximum speedup over the previous

work of 21× for a 16-mode tensor on 16 nodes (512 cores). Our numerical experiments

also show that the loss of accuracy inherent in the Gram SVD does not affect the final

accuracy of the solution when used in iterative low rank solvers such as TT-GMRES

where aggressive truncation, hence, low accuracy, can be used.

We consider simultaneous and sequence variants of the Gram SVD approach. The

theoretical analysis and experimental results show that the reduced arithmetic of the

sequence variants leads to shorter run times in almost all cases. Within the sequence

variant, we observe that the LRL and RLR orderings are both possible and typically

have comparable run times. We note that for some applications where the first mode

size is much larger than the last mode size (which is common for parametrized PDE

problems), the LRL approach should be used as it has lower computation complexity.

In the light of the numerical experiments, we plan in the future to study ran-

domized methods to perform rounding procedures. Using randomized methods could

outperform the proposed procedures as they reduce arithmetic further and also rely

on matrix multiplication. Encouraged by the results of the MATLAB implementation

of TT-GMRES, we also plan to develop a scalable implementation of the TT-based

linear solver than can use our parallel TT rounding algorithms.

83

Chapter 5: Conclusion

Low-rank approximations of matrices and tensors are applicable for compressing

and interpreting data. By designing and implementing distributed-memory parallel

algorithms for low-rank approximations, we can feasibly compute with larger datasets

in a reasonable amount of time and without exceeding memory constraints.

Chapter 3 shows a distributed-memory implementation for Hierarchical NMF. We

showed that this algorithm can scale well as long as the local matrix multiplication

problem dominates in time. This holds true when the data matrices have many more

features than samples, and so have an aspect ratio that is “tall-and-skinny”. However,

applications like hyperspectral imaging [45] have many more samples (pixels) than

features (spectral bands) and so have an aspect ratio of “short-and-fat”. This is due

to the fact that the AVIRIS hyperspectral camera only captures 224 spectral bands,

while it can be used in high altitude imaging that covers thousands of miles for a

total of billions of pixels. This aspect ratio means that it is difficult to scale with

AVIRIS data using our current 1-D row distribution. In order to scale for this type

of data, future work should involve a more general 2-D row and column distribution.

This adds its own scaling difficulties since such distributions require processors to

redistribute data as the hierarchical tree is built.

Chapter 4 describes an improvement on the distributed-memory Tensor Train

Rounding algorithm using Gram matrices. By using Gram matrices instead of QR to

compute truncated SVDs, this algorithm gives at least a 2X speedup over the state-

of-the-art approach. Like in chapter 3, these truncated SVDs work well when the

matrices are “tall-and-skinny”. This means that it works well when dimensions of a

tensor are much larger than its TT-ranks, as is the case for many problems arising from

84

parametrized PDEs. However, in applications like the TT-GMRES cookie example

described in section 4.2.3, the dimensions of the TT tensor are either of equal size or

smaller than the TT-Ranks for some modes. In some quantum physics applications,

the tensor dimensions are very small (less than 10 even) and the ranks are very large

(greater than 1000) [57]. With this type of problem, the computational bottleneck

goes from computing the Gram matrices (done sequentially) to computing the SVDs

for truncation (which can be done in parallel). So future work should implement

the simultaneous Gram variant for this application, since it can be advantageous in

parallelizing the SVD computations.

85

Bibliography

[1] Hussam Al Daas, Grey Ballard, and Peter Benner. Parallel algorithms for tensor

train arithmetic. Technical Report 2011.06532, arXiv, 2020. URL: https://

arxiv.org/abs/2011.06532.

[2] E. Apra, E. J. Bylaska, et al. NWChem: Past, present, and future. The Journal

of Chemical Physics, 152(18):184102, 2021/04/09 2020.

[3] J. Ballani and L. Grasedyck. A projection method to solve linear systems

in tensor format. Numerical Linear Algebra with Applications, 20(1):27–43,

2021/04/09 2013.

[4] G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz.

Communication lower bounds and optimal algorithms for numerical linear alge-

bra. Acta Numerica, 23:1–155, May 2014. doi:10.1017/S0962492914000038.

[5] Grey Ballard, Alicia Klinvex, and Tamara G. Kolda. TuckerMPI: A parallel

C++/MPI software package for large-scale data compression via the tucker

tensor decomposition. ACM Transactions on Mathematical Software, 46(2),

June 2020. URL: https://dl.acm.org/doi/10.1145/3378445, doi:10.1145/

3378445.

[6] E. Battenberg and D. Wessel. Accelerating non-negative matrix factorization for

audio source separation on multi-core and many-core architectures. In ISMIR,

pages 501–506, 2009. URL: https://archives.ismir.net/ismir2009/paper/

000089.pdf.

86

https://arxiv.org/abs/2011.06532


https://doi.org/10.1017/S0962492914000038

https://dl.acm.org/doi/10.1145/3378445

https://doi.org/10.1145/3378445

https://doi.org/10.1145/3378445

https://archives.ismir.net/ismir2009/paper/000089.pdf

https://archives.ismir.net/ismir2009/paper/000089.pdf

[7] P. Benner, S. Dolgov, A. Onwunta, and M. Stoll. Low-rank solvers for unsteady

stokes–brinkman optimal control problem with random data. Computer Methods

in Applied Mechanics and Engineering, 304:26–54, 2016.

[8] P. Benner, S. Dolgov, A. Onwunta, and M. Stoll. Low-rank solution of an optimal

control problem constrained by random navier-stokes equations. International

Journal for Numerical Methods in Fluids, 92(11):1653–1678, 2020/11/10 2020.

[9] Peter Benner, Serkan Gugercin, and Karen Willcox. A survey of projection-

based model reduction methods for parametric dynamical systems. SIAM

Review, 57(4):483–531, 2015. arXiv:https://doi.org/10.1137/130932715,

doi:10.1137/130932715.

[10] V. T. Chakaravarthy, J. W. Choi, D. J. Joseph, X. Liu, P. Murali, Y. Sabharwal,

and D. Sreedhar. On optimizing distributed Tucker decomposition for dense ten-

sors. In 2017 IEEE International Parallel and Distributed Processing Symposium

(IPDPS), pages 1038–1047, May 2017. doi:10.1109/IPDPS.2017.86.

[11] E. Chan, M. Heimlich, A. Purkayastha, and R. van de Geijn. Collective com-

munication: theory, practice, and experience. Concurrency and Computation:

Practice and Experience, 19(13):1749–1783, 2007. doi:10.1002/cpe.1206.

[12] Jee Choi, Xing Liu, and Venkatesan Chakaravarthy. High-performance dense

tucker decomposition on gpu clusters. In Proceedings of the International Con-

ference for High Performance Computing, Networking, Storage, and Analysis,

SC ’18, pages 42:1–42:11, Piscataway, NJ, USA, 2018. IEEE Press. URL:

http://dl.acm.org/citation.cfm?id=3291656.3291712.

[13] Wolfgang Dahmen, Ronald DeVore, Lars Grasedyck, and Endre Suli. Tensor-

sparsity of solutions to high-dimensional elliptic partial differential equations.

87

http://arxiv.org/abs/https://doi.org/10.1137/130932715

https://doi.org/10.1137/130932715

https://doi.org/10.1109/IPDPS.2017.86

https://doi.org/10.1002/cpe.1206

http://dl.acm.org/citation.cfm?id=3291656.3291712

Foundations of Computational Mathematics, 16(4):813–874, 2016.

[14] J. Demmel, L. Grigori, M. Hoemmen, and J. Langou. Communication-optimal

parallel and sequential QR and LU factorizations. SIAM Journal on Scientific

Computing, 34(1):A206–A239, 2012. URL: http://epubs.siam.org/doi/abs/

10.1137/080731992, doi:10.1137/080731992.

[15] C. Ding, X. He, and H.D. Simon. On the equivalence of nonnegative matrix

factorization and spectral clustering. In SDM ’05, pages 606–610. SIAM, 2005.

doi:10.1137/1.9781611972757.70.

[16] S. V. Dolgov. TT-GMRES: solution to a linear system in the structured tensor

format. Russian Journal of Numerical Analysis and Mathematical Modelling,

28(2):149–172, 01 Apr. 2013. doi:10.1515/rnam-2013-0009.

[17] B. Drake, S. Lee-Urban, and H. Park. Smallk v1.6.2. http://smallk.github.

io/, June 2017.

[18] Bruce A. Draper, Kyungim Baek, Marian Stewart Bartlett, and J.Ross Bev-

eridge. Recognizing faces with pca and ica. Computer Vision and Image Under-

standing, 91(1):115–137, 2003. Special Issue on Face Recognition. URL: https:

//www.sciencedirect.com/science/article/pii/S1077314203000778, doi:

https://doi.org/10.1016/S1077-3142(03)00077-8.

[19] R. Du, D. Kuang, B. Drake, and H. Park. DC-NMF: nonnegative matrix

factorization based on divide-and-conquer for fast clustering and topic mod-

eling. Journal of Global Optimization, 68(4):777–798, 2017. doi:10.1007/

s10898-017-0515-z.

[20] Srinivas Eswar, Koby Hayashi, Grey Ballard, Ramakrishnan Kannan, Michael A.

Matheson, and Haesun Park. PLANC: Parallel low rank approximation with

88

http://epubs.siam.org/doi/abs/10.1137/080731992

http://epubs.siam.org/doi/abs/10.1137/080731992

https://doi.org/10.1137/080731992

https://doi.org/10.1137/1.9781611972757.70

https://doi.org/10.1515/rnam-2013-0009

http://smallk.github.io/

http://smallk.github.io/

https://www.sciencedirect.com/science/article/pii/S1077314203000778

https://www.sciencedirect.com/science/article/pii/S1077314203000778

https://doi.org/https://doi.org/10.1016/S1077-3142(03)00077-8

https://doi.org/https://doi.org/10.1016/S1077-3142(03)00077-8

https://doi.org/10.1007/s10898-017-0515-z

https://doi.org/10.1007/s10898-017-0515-z

non-negativity constraints. Technical Report 1909.01149, arXiv, 2019. URL:

https://arxiv.org/abs/1909.01149.

[21] J.P. Fairbanks, R. Kannan, H. Park, and D.A. Bader. Behavioral clusters in

dynamic graphs. Parallel Computing, 47:38–50, 2015. doi:10.1016/j.parco.

2015.03.002.

[22] Takeshi Fukaya, Ramaseshan Kannan, Yuji Nakatsukasa, Yusaku Yamamoto,

and Yuka Yanagisawa. Shifted Cholesky QR for computing the QR factorization

of ill-conditioned matrices. SIAM Journal on Scientific Computing, 42(1):A477–

A503, 2020. doi:10.1137/18M1218212.

[23] Takeshi Fukaya, Yuji Nakatsukasa, Yuka Yanagisawa, and Yusaku Yamamoto.

CholeskyQR2: A simple and communication-avoiding algorithm for computing

a tall-skinny QR factorization on a large-scale parallel system. In Proceedings

of the 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale

Systems, ScalA ’14, pages 31–38, Piscataway, NJ, USA, 2014. IEEE Press. URL:

http://dx.doi.org/10.1109/ScalA.2014.11, doi:10.1109/ScalA.2014.11.

[24] Edgar Gabriel, Graham E. Fagg, et al. Open MPI: Goals, concept, and design of a

next generation MPI implementation. In Proceedings, 11th European PVM/MPI

Users’ Group Meeting, pages 97–104, Budapest, Hungary, September 2004.

[25] N. Gillis, D. Kuang, and H. Park. Hierarchical clustering of hyperspectral im-

ages using rank-two nonnegative matrix factorization. IEEE Transactions on

Geoscience and Remote Sensing, 53(4):2066–2078, April 2015. doi:10.1109/

TGRS.2014.2352857.

[26] Lars Grasedyck. Existence and computation of low Kronecker-rank approxima-

tions for large linear systems of tensor product structure. Computing, 72(3-

89


https://doi.org/10.1016/j.parco.2015.03.002

https://doi.org/10.1016/j.parco.2015.03.002

https://doi.org/10.1137/18M1218212

http://dx.doi.org/10.1109/ScalA.2014.11

https://doi.org/10.1109/ScalA.2014.11

https://doi.org/10.1109/TGRS.2014.2352857

https://doi.org/10.1109/TGRS.2014.2352857

4):247–265, 2004.

[27] L. Grigori and S. Kumar. Parallel Tensor Train through Hierarchical Decompo-

sition. working paper or preprint, February 2021. URL: https://hal.inria.

fr/hal-03081555.

[28] W. Gropp, L.N. Olson, and P. Samfass. Modeling MPI communication per-

formance on SMP nodes: Is it time to retire the ping pong test. In EuroMPI

’16, pages 41–50, New York, NY, USA, 2016. ACM. doi:10.1145/2966884.

2966919.

[29] F. Hecht. New development in freefem++. J. Numer. Math., 20(3-4):251–265,

2012. URL: https://freefem.org/.

[30] N. J. Higham. Accuracy and Stability of Numerical Algorithms. SIAM, Philadel-

phia, PA, 2nd edition, 2002.

[31] R. Kannan, G. Ballard, and H. Park. A high-performance parallel algorithm for

nonnegative matrix factorization. In PPoPP ’16, pages 9:1–9:11, New York, NY,

USA, February 2016. ACM. doi:10.1145/2851141.2851152.

[32] R. Kannan, G. Ballard, and H. Park. MPI-FAUN: An MPI-based framework for

alternating-updating nonnegative matrix factorization. IEEE Transactions on

Knowledge and Data Engineering, 30(3):544–558, March 2018. doi:10.1109/

TKDE.2017.2767592.

[33] Oguz Kaya and Bora Ucar. Scalable sparse tensor decompositions in distributed

memory systems. In Proceedings of the International Conference for High Per-

formance Computing, Networking, Storage and Analysis, SC ’15, pages 77:1–

77:11, New York, NY, USA, 2015. ACM. URL: http://doi.acm.org/10.1145/

2807591.2807624, doi:10.1145/2807591.2807624.

90

https://hal.inria.fr/hal-03081555

https://hal.inria.fr/hal-03081555

https://doi.org/10.1145/2966884.2966919

https://doi.org/10.1145/2966884.2966919

https://freefem.org/

https://doi.org/10.1145/2851141.2851152

https://doi.org/10.1109/TKDE.2017.2767592

https://doi.org/10.1109/TKDE.2017.2767592

http://doi.acm.org/10.1145/2807591.2807624

http://doi.acm.org/10.1145/2807591.2807624

https://doi.org/10.1145/2807591.2807624

[34] J. Kim, Y. He, and H. Park. Algorithms for nonnegative matrix and ten-

sor factorizations: a unified view based on block coordinate descent frame-

work. Journal of Global Optimization, 58(2):285–319, 2014. doi:10.1007/

s10898-013-0035-4.

[35] J. Kim and H. Park. Fast nonnegative matrix factorization: An active-set-like

method and comparisons. SIAM Journal on Scientific Computing, 33(6):3261–

3281, 2011. doi:10.1137/110821172.

[36] Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications.

SIAM Review, 51(3):455–500, September 2009. doi:10.1137/07070111X.

[37] Daniel Kressner and Christine Tobler. Krylov subspace methods for linear sys-

tems with tensor product structure. SIAM J. Matrix Anal. Appl., 31(4):1688–

1714, 2009/10. doi:10.1137/090756843.

[38] D. Kuang and H. Park. Fast rank-2 nonnegative matrix factorization for hierar-

chical document clustering. In KDD ’13, pages 739–747, New York, NY, USA,

2013. ACM. doi:10.1145/2487575.2487606.

[39] Oak Ridge National Laboratory. Summit: America’s newest and smartest su-

percomputer. https://www.olcf.ornl.gov/summit/.

[40] D. Landgrebe and L. Biehl. Multispec - hyperspectral images. https:

//engineering.purdue.edu/~biehl/MultiSpec/hyperspectral.html,

February 2020.

[41] G.E. Moon, J.A. Ellis, A. Sukumaran-Rajam, S. Parthasarathy, and P. Sadayap-

pan. ALO-NMF: Accelerated locality-optimized non-negative matrix factoriza-

tion. In KDD ’20, 2020. doi:10.1145/3394486.3403227.

91

https://doi.org/10.1007/s10898-013-0035-4

https://doi.org/10.1007/s10898-013-0035-4

https://doi.org/10.1137/110821172

https://doi.org/10.1137/07070111X

https://doi.org/10.1137/090756843

https://doi.org/10.1145/2487575.2487606

https://www.olcf.ornl.gov/summit/

https://engineering.purdue.edu/~biehl/MultiSpec/hyperspectral.html

https://engineering.purdue.edu/~biehl/MultiSpec/hyperspectral.html

https://doi.org/10.1145/3394486.3403227

[42] Gordon E. Moore. Cramming more components onto integrated circuits,

reprinted from electronics, volume 38, number 8, april 19, 1965, pp.114 ff. IEEE

Solid-State Circuits Society Newsletter, 11(3):33–35, 2006. doi:10.1109/N-SSC.

2006.4785860.

[43] MPI ATTAC. URL: https://gitlab.com/aldaas/mpi_attac.

[44] Alexander Novikov, Pavel Izmailov, Valentin Khrulkov, Michael Figurnov, and

Ivan V Oseledets. Tensor Train decomposition on TensorFlow (T3F). Journal

of Machine Learning Research, 21(30):1–7, 2020.

[45] Jet Propulsion Laboratory California Institute of Technology. Aviris data portal

2006-2020. https://aviris.jpl.nasa.gov/dataportal.

[46] OpenBLAS. URL: https://github.com/xianyi/OpenBLAS.

[47] I. Oseledets. Tensor-train decomposition. SIAM Journal on Scientific Comput-

ing, 33(5):2295–2317, 2011. doi:10.1137/090752286.

[48] Roger Penrose. Applications of negative dimensional tensors. Combinatorial

mathematics and its applications, 1:221–244, 1971.

[49] Anh-Huy Phan, Petr Tichavsky, and Andrzej Cichocki. Fast alternating LS al-

gorithms for high order CANDECOMP/PARAFAC tensor factorizations. IEEE

Transactions on Signal Processing, 61(19):4834–4846, Oct 2013. doi:10.1109/

TSP.2013.2269903.

[50] Melven Rohrig-Zollner, Jonas Thies, and Achim Basermann. Performance of

low-rank approximations in tensor train format (tt-svd) for large dense tensors,

2021. arXiv:2102.00104.

92

https://doi.org/10.1109/N-SSC.2006.4785860

https://doi.org/10.1109/N-SSC.2006.4785860

https://gitlab.com/aldaas/mpi_attac

https://aviris.jpl.nasa.gov/dataportal

https://github.com/xianyi/OpenBLAS

https://doi.org/10.1137/090752286

https://doi.org/10.1109/TSP.2013.2269903

https://doi.org/10.1109/TSP.2013.2269903

http://arxiv.org/abs/2102.00104

[51] F. Shahnaz, M.W. Berry, V.P. Pauca, and R.J. Plemmons. Document clustering

using nonnegative matrix factorization. Information Processing & Management,

42(2):373–386, 2006. doi:10.1016/j.ipm.2004.11.005.

[52] The ISIC 2020 challenge dataset, 2020. doi:10.34970/2020-ds01.

[53] Shaden Smith, Niranjay Ravindran, Nicholas D. Sidiropoulos, and George

Karypis. SPLATT: Efficient and parallel sparse tensor-matrix multiplication.

In Proceedings of the 2015 IEEE International Parallel and Distributed Pro-

cessing Symposium, IPDPS ’15, pages 61–70, Washington, DC, USA, 2015.

IEEE Computer Society. URL: http://dx.doi.org/10.1109/IPDPS.2015.27,

doi:10.1109/IPDPS.2015.27.

[54] Edgar Solomonik, Devin Matthews, Jeff R Hammond, John F Stanton,

and James Demmel. A massively parallel tensor contraction framework for

coupled-cluster computations. Journal of Parallel and Distributed Computing,

74(12):3176–3190, 2014.

[55] Qingquan Song, Hancheng Ge, James Caverlee, and Xia Hu. Tensor comple-

tion algorithms in big data analytics. ACM Trans. Knowl. Discov. Data, 13(1),

January 2019. doi:10.1145/3278607.

[56] E. Stoudenmire and S. R. White. ITensor: A C++ library for creating efficient

and flexible physics simulations based on tensor product wavefunctions, 2016.

Available online. URL: http://itensor.org/.

[57] E. M. Stoudenmire and Steven R. White. Real-space parallel density matrix

renormalization group. Phys. Rev. B, 87:155137, Apr 2013. URL: https://

link.aps.org/doi/10.1103/PhysRevB.87.155137, doi:10.1103/PhysRevB.

87.155137.

93

https://doi.org/10.1016/j.ipm.2004.11.005

https://doi.org/10.34970/2020-ds01

http://dx.doi.org/10.1109/IPDPS.2015.27

https://doi.org/10.1109/IPDPS.2015.27

https://doi.org/10.1145/3278607

http://itensor.org/

https://link.aps.org/doi/10.1103/PhysRevB.87.155137

https://link.aps.org/doi/10.1103/PhysRevB.87.155137

https://doi.org/10.1103/PhysRevB.87.155137

https://doi.org/10.1103/PhysRevB.87.155137

[58] R. Thakur, R. Rabenseifner, and W. Gropp. Optimization of collective com-

munication operations in MPICH. International Journal of High Performance

Computing Applications, 19(1):49–66, 2005. doi:10.1177/1094342005051521.

[59] Christine Tobler. Low-rank Tensor Methods for Linear Systems and Eigen-

value Problems. PhD thesis, ETH Zurich, 2012. URL: http://sma.epfl.ch/

~anchpcommon/students/tobler.pdf.

[60] L.N. Trefethen and D. Bau. Numerical Linear Algebra. Society for Industrial

and Applied Mathematics, 1997.

[61] TT-Toolbox. URL: https://github.com/oseledets/TT-Toolbox.

[62] Laurens Van Der Maaten, Eric Postma, and Jaap Van den Herik. Dimensionality

reduction: a comparative. J Mach Learn Res, 10(66-71):13, 2009.

[63] R. Weinhandl, P. Benner, and T. Richter. Linear low-rank parameter-dependent

fluid-structure interaction discretization in 2D. PAMM, 18(1):e201800178,

2021/04/09 2018.

[64] R. Weinhandl, P. Benner, and T. Richter. Low-rank linear fluid-structure inter-

action discretizations. ZAMM - Journal of Applied Mathematics and Mechanics

/ Zeitschrift fur Angewandte Mathematik und Mechanik, 100(11):e201900205,

2021/04/09 2020.

[65] W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative ma-

trix factorization. In SIGIR ’03, pages 267–273, 2003. doi:10.1145/860435.

860485.

[66] Yassine Zniyed, Remy Boyer, AndreL. F. de Almeida, and Gerard Favier. A tt-

based hierarchical framework for decomposing high-order tensors. SIAM Journal

on Scientific Computing, 42(2):A822–A848, 2021/04/09 2020.

94

https://doi.org/10.1177/1094342005051521

http://sma.epfl.ch/~anchpcommon/students/tobler.pdf

http://sma.epfl.ch/~anchpcommon/students/tobler.pdf

https://github.com/oseledets/TT-Toolbox

https://doi.org/10.1145/860435.860485

https://doi.org/10.1145/860435.860485

Curriculum Vitae

Lawton Manning

Employment

• Graduate Research Assistant, Wake Forest University

August 2019 - May 2021

• Security Intern, Logikcull

June 2019 - August 2019

• MATLAB Software Developer, Wake Forest University

June 2018 - May 2019

Education

• Wake Forest University, Winston-Salem NC

M.S. in Computer Science, May 2021

• Wake Forest University, Winston-Salem NC

B.S. in Computer Science, May 2019

Publications

• L. Manning, G. Ballard, R. Kannan, H. Park, Parallel Hierarchical Clustering

using Rank-Two Nonnegative Matrix Factorization, in 2020 IEEE 27th Inter-

national Conference on High Performance Computing, Data, and Anallytics

(HiPC), Pune, India, 2020, pp. 141-150

95

PARALLEL ALGORITHMS FOR LOW-RANK APPROXIMATIONS OF ...

Documents

Transcript of PARALLEL ALGORITHMS FOR LOW-RANK APPROXIMATIONS OF ...