Low Rank Approximation Lecture 1 - sma.epfl.chsma.epfl.ch/~anchpcommon/lecture1.pdf · Low Rank...

Post on 27-Aug-2019

222 views 0 download

Transcript of Low Rank Approximation Lecture 1 - sma.epfl.chsma.epfl.ch/~anchpcommon/lecture1.pdf · Low Rank...

Low Rank ApproximationLecture 1

Daniel KressnerChair for Numerical Algorithms and HPC

Institute of Mathematics, EPFLdaniel.kressner@epfl.ch

1

Organizational aspects

I Lecture dates: 16.4., 23.4., 30.4., 14.5., 28.5., 4.6., 11.6., 18.6.,25.6., 2.7. (tentative)

I Exam: To be discussed next week (most likely oral exam).I Webpage: https://www5.in.tum.de/wiki/index.php/Low_Rank_ApproximationSlides on http://anchp.epfl.ch.

I EFY = Exercise For You.

2

From http://www.niemanlab.org

... his [AleksandrKogan’s] messagewent on to confirmthat his approachwas indeed similar toSVD or other matrixfactorization meth-ods, like in the NetflixPrize competition, andthe Kosinki-Stillwell-Graepel Facebookmodel. Dimensionalityreduction of Facebookdata was the core ofhis model.

3

Rank and matrix factorizationsFor field F , let A ∈ F m×n. Then

rank(A) := dim(range(A)).

For simplicity, F = R throughout the lecture and often m ≥ n.Let B = b1, . . . ,br ⊂ Rm with r = rank(A) be basis of range(A).Then each of the columns of A =

(a1,a2, . . . ,an

)can be expressed

as linear combination of B:

aj =r∑

j=1

bicij for some coefficients cij ∈ R, i = 1, . . . , r , j = 1, . . . ,n.

Defining B =(b1,b2, . . . ,br

)∈ Rm×r :

aj = B

cj1...

cjr

A = B

c11 · · · cn1...

...c1r · · · cnr

4

Rank and matrix factorizationsLemma. A matrix A ∈ Rm×n of rank r admits a factorization of theform

A = BCT , B ∈ Rm×r , C ∈ Rn×r .

We say that A has low rank if rank(A) m,n.Illustration of low-rank factorization:

A BCT

#entries mn mr + nrI Generically (and in most applications), A has full rank, that is,

rank(A) = minm,n.I Aim instead at approximating A by a low-rank matrix.

5

Questions addressed in lecture series

What? Theoretical foundations of low-rank approximation.When? A priori and a posteriori estimates for low-rank

approximation. Situations that allow for low-rankapproximation techniques.

Why? Applications in engineering, scientific computing, dataanalysis, ... where low-rank approximation plays acentral role.

How? State-of-the-art algorithms for performing and workingwith low-rank approximations.

Will cover both, matrices and tensors.

6

Contents of Lecture 1

1. Fundamental tools (SVD, relation to eigenvalues, norms, bestlow-rank approximation)

2. Overview of applications3. Fundamental tools (Stability, QR)4. Extensions (weighted approximation, bivariate functions)5. Subspace iteration

7

Literature for Lecture 1

Golub/Van Loan’2013 Golub, Gene H.; Van Loan, Charles F. Matrixcomputations. Fourth edition. Johns Hopkins UniversityPress, Baltimore, MD, 2013.

Horn/Johnson’2013 Horn, Roger A.; Johnson, Charles R. Matrixanalysis. Second edition. Cambridge University Press,2013.

+ References on slides.

8

1. Fundamental toolsI SVDI Relation to eigenvaluesI NormsI Best low-rank approximation

9

The singular value decompositionTheorem (SVD). Let A ∈ Rm×n with m ≥ n. Then there areorthogonal matrices U ∈ Rm×m and V ∈ Rn×n such that

A = UΣV T , with Σ =

σ1

. . .σn

0

∈ Rm×n

and σ1 ≥ σ2 ≥ · · · ≥ σn ≥ 0.

I σ1, . . . , σn are called singular valuesI u1, . . . ,um are called left singular vectorsI v1, . . . , vn are called right singular vectorsI Avi = σiui , AT ui = σivi for i = 1, . . . ,n.I Singular values are always uniquely defined by A.I Singular values are never unique. If σ1 > σ2 > · · ·σn > 0 then

unique up to ui ← ±ui , vi ← ±vi .

10

SVD: Sketch of proofInduction over n. n = 1 trivial.For general n, let v1 solve max‖Av‖2 : ‖v‖2 = 1 =: ‖A‖2. Setσ1 := ‖A‖2 and u1 := Av1/σ1.1 By definition,

Av1 = σ1u1.

After completion to orthogonal matrices U1 =(u1, U⊥

)∈ Rm×m and

V1 =(v1, V⊥

)∈ Rn×n:

UT1 AV1 =

(uT

1 Av1 uT1 AV⊥

UT⊥Av1 UT

⊥AV⊥

)=

(σ1 wT

0 A1

),

with w := V T⊥AT u1 and A1 = UT

⊥AV⊥. ‖ · ‖2 invariant under orthogonaltransformations

σ1 = ‖A‖2 = ‖UT1 AV1‖2 =

∥∥∥∥( σ1 wT

0 A1

)∥∥∥∥2≥√σ2

1 + ‖w‖22.

Hence, w = 0. Proof completed by applying induction to A1.1If σ1 = 0, choose arbitrary u1.

11

Very basic properties of the SVD

I r = rank(A) is number of nonzero singular values of A.I kernel(A) = spanvr+1, . . . , vnI range(A) = spanu1, . . . ,ur

12

SVD: Computation (for small dense matrices)Computation of SVD proceeds in two steps:

1. Reduction to bidiagonal form: By applying n Householderreflectors from left and n − 1 Householder reflectors from right,compute orthogonal matrices U1, V1 such that

UT1 AV1 = B =

(B10

)=

@@@

@@

0

,

that is, B1 ∈ Rn×n is an upper bidiagonal matrix.2. Reduction to diagonal form: Use Divide&Conquer to compute

orthogonal matrices U2, V2 such that Σ = UT2 B1V2 is diagonal.

Set U = U1U2 and V = V1V2.Step 1 is usually the most expensive. Remarks on Step 1:

I If m is significantly larger than n, say, m ≥ 3n/2, first computingQR decomposition of A reduces cost.

I Most modern implementations reduce A successively via bandedform to bidiagonal form.2

2Bischof, C. H.; Lang, B.; Sun, X. A framework for symmetric band reduction. ACMTrans. Math. Software 26 (2000), no. 4, 581–601.

13

SVD: Computation (for small dense matrices)In most applications, vectors un+1, . . . ,um are not of interest. Byomitting these vectors one obtains the following variant of the SVD.Theorem (Economy size SVD). Let A ∈ Rm×n with m ≥ n. Thenthere is a matrix U ∈ Rm×n with orthonormal columns and anorthonormal matrix V ∈ Rn×n such that

A = UΣV T , with Σ =

σ1. . .

σn

∈ Rn×n

and σ1 ≥ σ2 ≥ · · · ≥ σn ≥ 0.

Computed by MATLAB’s [U,S,V] = svd(A,’econ’).Complexity:

memory operationssingular values only O(mn) O(mn2)economy size SVD O(mn) O(mn2)

(full) SVD O(m2 + mn) O(m2n + mn2)

14

SVD: Computation (for small dense matrices)Beware of roundoff error when interpreting singular value plots.Exmaple: semilogy(svd(hilb(100)))

0 20 40 60 80 10010

-20

10-10

100

I Kink is caused by roundoff error and does not reflect truebehavior of singular values.

I Exact singular values are known to decay exponentially.3

I Sometimes more accuracy possible.4.3Beckermann, B. The condition number of real Vandermonde, Krylov and positive

definite Hankel matrices. Numer. Math. 85 (2000), no. 4, 553–577.4Drmac, Z.; Veselic, K. New fast and accurate Jacobi SVD algorithm. I. SIAM J.

Matrix Anal. Appl. 29 (2007), no. 4, 1322–134215

Singular/eigenvalue relations: symmetric matrices

Symmetric A = AT ∈ Rn×n admits spectral decomposition

A = U diag(λ1, λ2, . . . , λn)UT

with orthogonal matrix U.After reordering may assume |λ1| ≥ |λ2| ≥ · · · ≥ |λn|. Spectraldecomposition can be turned into SVD A = UΣV T by defining

Σ = diag(|λ1|, . . . , |λn|), V = U diag(sign(λ1), . . . , sign(λn)).

Remark: This extends to the more general case of normal matrices(e.g., orthogonal or symmetric) via complex spectral or real Schurdecompositions.

16

Singular/eigenvalue relations: general matricesConsider SVD A = UΣV T of A ∈ Rm×n with m ≥ n. We then have:

1. Spectral decomposition of GramianAT A = V ΣT ΣV T = V diag(σ2

1 , . . . , σ2n)V T

AT A has eigenvalues σ21 , . . . , σ

2n ,

right singular vectors of A are eigenvectors of AT A.2. Spectral decomposition of Gramian

AAT = UΣΣT UT = U diag(σ21 , . . . , σ

2n ,0, . . . ,0)UT

AAT has eigenvalues σ21 , . . . , σ

2n and, additionally, m − n zero

eigenvalues,first n left singular vectors A are eigenvectors of AAT .

3. Decomposition of Golub-Kahan matrix

A =

(0 A

AT 0

)=

(U 00 V

)(0 Σ

ΣT 0

)(U 00 V

)T

.

EFY. Prove thatA has eigenvalues±σj with eigenvectors 1√2

(±uj

vj

).

17

Norms: Spectral and Frobenius normGiven SVD A = UΣV T , one defines:

I Spectral norm: ‖A‖2 = σ1.

I Frobenius norm: ‖A‖F =√σ2

1 + · · ·+ σ2n .

Basic properties:I ‖A‖2 = max‖Av‖2 : ‖v‖2 = 1 (see proof of SVD).I ‖ · ‖2 and ‖ · ‖F are both (submultiplicative) matrix norms.I ‖ · ‖2 and ‖ · ‖F are both unitarily invariant, that is

‖QAZ‖2 = ‖A‖2, ‖QAZ‖F = ‖A‖F

for any orthogonal matrices Q,Z .I ‖A‖2 ≤ ‖A‖F ≤ ‖A‖2/

√r

I ‖AB‖F ≤ min‖A‖2‖B‖F , ‖A‖F ‖B‖2EFY. Prove these two inequalities. Hint for the second inequality: Use the relations on the next slide to first show that‖B‖F = ‖(‖b1‖2, . . . , ‖bn‖2)‖F .

EFY. Find a matrix A ∈ Rm1×n and a nonzero matrix B ∈ Rm2×n such that ‖A‖2 =

∥∥∥∥(AB

)∥∥∥∥2. Classify the set of matrices

A ∈ Rm1×n such that ‖A‖2 <∥∥∥∥(A

B

)∥∥∥∥2

for every nonzero matrix B ∈ Rm2×n .

Investigate analogous questions for the Frobenius norm.

18

Euclidean geometry on matricesLet B ∈ Rn×n have eigenvalues λ1, . . . , λn ∈ C. Then

trace(B) := b11 + · · ·+ bnn = λ1 + · · ·+ λn.

In turn,‖A‖2

F = trace AT A = trace AAT =∑i,j

a2ij .

Two simple consequences:I ‖ · ‖F is the norm induced by the matrix inner product

〈A,B〉 := trace(ABT ), A,B ∈ Rm×n.

I Partition A =(a1,a2, . . . ,an

)and define vectorization

vec(A) =

a1...

an

∈ Rmn.

Then 〈A,B〉 = 〈vec(A), vec(B)〉 and ‖A‖F = ‖ vec(A)‖2.

19

Von Neumann’s trace inequality

TheoremFor m ≥ n, let A,B ∈ Rm×n have singular values σ1(A) ≥ · · · ≥ σn(A)and σ1(B) ≥ · · · ≥ σn(B), respectively. Then

|〈A,B〉| ≤ σ1(A)σ1(B) + · · ·+ σn(A)σn(B).

Consequence:

‖A− B‖2F = 〈A− B,A− B〉 = ‖A‖2

F − 2〈A,B〉+ ‖B‖2F

≥ ‖A‖2F − 2

n∑i=1

σi (A)σi (B) + ‖B‖2F

=n∑

i=1

(σi (A)− σi (B))2.

EFY. Use Von Neumann’s trace inequality and the SVD to show for 1 ≤ k ≤ n that

max|〈A, PQT 〉| : P ∈ Rm×k, Q ∈ Rn×k

, PT P = QT Q = Ik = σ1(A) + · · · + σk (A).

20

Proof of Von Neumann’s trace inequality5

Singular value vector σ(A) can be written as convex combination

σ(A) = σn(A)fn + (σn−1(A)− σn(A))fn−1 + · · ·

with fj = e1 + · · ·+ ej . Decompose A analogously via its SVDA = UAΣAV T

A :

A = σn(A)An + (σn−1(A)− σn(A))An−1 + · · · , Aj := UAdiag(fj )V TA

Insert in lhs of trace inequality:

|〈A,B〉| ≤ σn(A)|〈An,B〉|+ (σn−1(A)− σn(A))|〈An−1,B〉|+ · · · .

Rhs is linear wrt σ(A) May assume A = Ak for k = 1, . . . ,n.Analogously for B.

5This proof follows [Grigorieff, R. D. Note on von Neumann’s trace inequality. Math.Nachr. 151 (1991), 327–328]. For Mirsky’s ingenious proof based on doubly stochasticmatrices; see Theorem 8.7.6 in [Horn/Johnson’2013].

21

Proof of Von Neumann’s trace inequalityLet A = UAdiag(fk )V T

A , B = UBdiag(f`)V TB , and k ≤ `. Then

〈A,B〉 = trace( k∑

i=1

vA,iuTA,i

∑j=1

uB,jvTB,j

)

=k∑

i=1

∑j=1

trace(vA,iuT

A,iuB,jvTB,j)

=k∑

i=1

∑j=1

(uTA,iuB,j )(vT

B,jvA,i )

Cauchy-Schwartz

|〈A,B〉| ≤k∑

i=1

‖UTB uA,i‖2‖V T

B vA,i‖2 = k ,

which completes the proof.

22

Schatten normsThere are other unitarily invariant matrix norms.6

Let s(A) = (σ1, . . . , σn). The p-Schatten norm defined by

‖A‖(p) := ‖s(A)‖p

is a matrix norm for any 1 ≤ p ≤ ∞.p =∞: spectral norm, p = 2: Frobenius norm, p = 1: nuclear norm.EFY. What is lim

p→0+‖A‖(p)?

DefinitionThe dual of a matrix norm ‖ · ‖ on Rm×n is defined by

‖A‖D = max〈A,B〉 : ‖B‖ = 1.

LemmaLet p,q ∈ [1,∞] such that p−1 + q−1 = 1. Then

‖A‖D(p) = ‖A‖(q).

EFY. Prove this lemma for p = ∞. Hint: Von Neumann’s trace inequality.

6Complete characterization via symm gauge functions in [Horn/Johnson’2013].23

Best low-rank approximation

Consider k < n and let

Uk :=(u1 · · · uk

), Σk := diag(σ1, . . . , σk ), Vk :=

(u1 · · · uk

).

ThenTk (A) := Uk Σk V T

k

has rank at most k . For any unitarily invariant norm ‖ · ‖:

‖Tk (A)− A‖ =∥∥diag(0, . . . ,0, σk+1, . . . , σn)

∥∥In particular, for spectral norm and the Frobenius norm:

‖A− Tk (A)‖2 = σk+1, ‖A− Tk (A)‖F =√σ2

k+1 + · · ·+ σ2n .

Nearly equal if and only if singular values decay sufficiently quickly.

24

Best low-rank approximationTheorem (Schmidt-Mirsky). Let A ∈ Rm×n. Then

‖A− Tk (A)‖ = min‖A− B‖ : B ∈ Rm×n has rank at most k

holds for any unitarily invariant norm ‖ · ‖.

Proof7 for ‖ · ‖F : Follows directly from consequence of VonNeumann’s trace inequality.Proof for ‖ · ‖2: For any B ∈ Rm×n of rank ≤ k , kernel(B) hasdimension ≥ n − k . Hence, ∃w ∈ kernel(B) ∩ range(Vk+1) with‖w‖2 = 1. Then

‖A− B‖22 ≥ ‖(A− B)w‖2

2 = ‖Aw‖22 = ‖AVk+1V T

k+1w‖22

= ‖Uk+1Σk+1V Tk+1w‖2

2

=r+1∑j=1

σj |vTj w |2 ≥ σk+1

r+1∑j=1

|vTj w |2 = σk+1.

7See Section 7.4.9 in [Horn/Johnson’2013] for the general case.25

Best low-rank approximationUniqueness:

I If σk > σk+1 best rank-k approximation with respect to Frobeniusnorm is unique.

I If σk = σk+1 best rank-k approximation never unique. Forexample I3 has several best rank-two approximations:1 0 0

0 1 00 0 0

,

1 0 00 0 00 0 1

,

0 0 00 1 00 0 1

.

I With respect to spectral norm best rank-k approximation onlyunique if σk+1 = 0. For example, diag(2,1, ε) with 0 < ε < 1 hasinfinitely many best rank-two approximations:2 0 0

0 1 00 0 0

,

2− ε/2 0 00 1− ε/2 00 0 0

,

2− ε/3 0 00 1− ε/3 00 0 1

, . . . .

EFY. Given a symmetric matrix A ∈ Rn×n and 1 ≤ k < n, show that there is always a best rank-k approximation that is symmetric.Is every best rank-k approximation (with respect to Frobenius norm) symmetric? What about the spectral norm?

26

Approximating the range of a matrix

Aim at finding a matrix Q ∈ Rm×k with orthonormal columns such that

range(Q) ≈ range(A).

I −QQT is orthogonal projector onto range(Q)⊥ Aim at minimizing

‖(I −QQT )A‖ = ‖A−QQT A‖

for unitarily invariant norm ‖ · ‖. Because rank(QQT A) ≤ k ,

‖A−QQT A‖ ≥ ‖A− Tk (A)‖.

Setting Q = Uk one obtains

Uk UTk A = Uk UT

k UΣV T = Uk Σk V Tk = Tk (A).

Q = Uk is optimal.

27

Approximating the range of a matrix

Variation:max‖QT A‖F : QT Q = Ik.

Equivalent tomax|〈AAT ,QQT 〉| : QT Q = Ik.

By Von Neumann’s trace inequality and equivalence betweeneigenvectors of AAT and left singular vectors of A, optimal Q given byUk .EFY. When replacing the Frobenius norm by the spectral norm in this formulation, does one obtain the same result?

28

2. ApplicationsI Principal Component AnalysisI Matrix CompletionI Some other applications

29

Principal Component Analysis (PCA)

I Most popular method for dimensionality reduction in statistics,data science, . . .

Consider N independently drawn observations for K randomvariables X1, . . . ,XK . Illustration of N = 100 observations for K = 2:

1 1.2 1.4 1.6 1.8 2

2.5

3

3.5

4

4.5

5

5.5

30

Principal Component Analysis (PCA)Each of the observations is arranged in a vector xj ∈ RK withj = 1, . . . ,N.Subtract sample mean

x :=1N

(x1 + · · ·+ xN)

Date with mean subtracted:

-0.5 0 0.5

-1

-0.5

0

0.5

1

1.5

31

Principal Component Analysis (PCA)

Covariance matrix

C :=1

N − 1

N∑j=1

(xj − x)(xj − x)T .

Diagonal entry cii estimates variance of Xi , while off-diagonal entrycik estimates covariance between Xi and Xk .Defining A :=

[x1 − x , . . . , xN − x

]∈ RK×N , we can equivalently write

C =1

N − 1AAT .

32

Principal Component Analysis (PCA)

Reduce data to dimension 1:Find linear combination Y1 = w1X1 + · · ·+ wK XK with w1, . . . ,wK ∈ Rand w2

1 + · · ·+ w2K = 1 that captures most of the observed variation.

Maximize variance of new variable Y1.Corresponding observations of Y1 given by wT x1, . . . ,wT xN withsample mean wT x maximization of variance corresponds to

maxw∈RN‖w‖2=1

N∑j=1

(wT xj − wT x)2 = maxw∈RN‖w‖2=1

‖wT A‖22.

Optimal vector w given by dominant left singular vector of A!(Corresponds to eigenvector for largest eigenvalue of AAT .)This is the first principal vector.

33

Principal Component Analysis (PCA)Data with first principal vector:

-1 -0.5 0 0.5 1

-2

-1

0

1

2

Projection of data onto first principal vector:

-1 -0.5 0 0.5 1

-2

-1

0

1

2

34

Principal Component Analysis (PCA)

I Analogously, first k principal vectors given by dominant k leftsingular vectors u1, . . . ,uk . Equivalent to best rank-kapproximation of data matrix:

minUT U=Ik

‖A− UCT‖

I PCA not robust wrt outliers in data. Robust PCA8 uses model

A ≈ low rank + sparse.

Obtained via solution of

min‖L‖(1) + λ‖S‖1 : A = L + S,

for multiplier λ > 0 and ‖S‖1 = 1-norm of vec(S).

8Emmanuel J. Candes; Xiaodong Li; Yi Ma; John Wright. Robust PrincipalComponent Analysis?

35

Matrix Completion

Assume that data matrix is modeled by (low) rank k . Two popularapproaches to deal with missing entries:

1. Impute data (insert 0 or row/column means in missing entries).Apply SVD to get best low-rank approximation BCT of imputeddata matrix.

2. Find rank-k matrix BCT that fits known entries best; measured in(weighted) Euclidean norm.

Predict unknown entries from BCT .Netflix prize won by combination of matrix completion with othertechniques.

36

Applications in Scientific Computing and Engineering

I POD, reduced basis method, reduced-order modelling.I High-dimensional integration.I Solution of large-scale matrix equations. Optimal control.I Solution of high-dimensional PDEs.I Uncertainty quantification.I . . .

Several of these will be covered in later parts of the course.

37

3. Fundamental ToolsI Stability of SVDI Canonical anglesI Stability of low-rank approximationI QR decomposition

38

Stability of SVDWhat happens to SVD if A is perturbed by noise?

LemmaLet A,E ∈ Rm×n. Then

|σi (A + E)− σi (A)| ≤ ‖E‖2.

Proof.Using the characterization

σi (A + E) = min‖B‖2 : rank(A + E − B) ≤ i − 1

and setting B = A− Ti−1(A) + E , we obtain

σi (A + E) ≤ ‖B‖2 ≤ ‖A− Ti−1(A)‖2 + ‖E‖2 = σi (A) + ‖E‖2,

which implies the result.Result also special case of famous Weyl’s inequality.EFY. Show that the matrix rank is a lower semi-continuous function.

39

Stability of SVD

Singular values are perfectly well conditioned.Singular vectors tend to be less stable! Example:

A =

(1 00 1 + ε

), E =

(0 εε −ε

).

I A has right singular vectors(

10

),

(01

).

I A + E has right singular vectors 1√2

(11

), 1√

2

(1−1

)To formulate perturbation bound, need to measure distances betweensubspaces.

40

Canonical angles

Let columns of X ,Y ∈ Cn×k contain orthonormal bases of twok -dimensional subspaces X ,Y ⊂ Cn, respectively. Denote singularvalues (in reverse order) of X T Y :

0 ≤ σ1 ≤ · · · ≤ σp ≤ 1.

We callθi (X ,Y) := arccosσi , i = 1, . . . , k ,

the canonical angles between X and Y. Note: For k = 1, θ1 is theusual angle θ(x , y) between vectors.Geometric characterization:

θ1(X ,Y) = maxx∈Xx 6=0

miny∈Yy 6=0

θ(x , y).

It follows that θ1(X ,Y) = 0 if and only if X ∩ Y⊥ 6= 0.

41

Canonical anglesNote that XX T and YY T are orthogonal projectors on X and Y,respectively.

Lemma (Projector characterization)Define sin Θ(X ,Y) = diag(sin θ1(X ,Y), . . . , sin θp(X ,Y)). Then

sin θ1(X ,Y) = ‖ sin Θ(X ,Y)‖2 = ‖XX T − YY T‖2.

Proof. See Theorem I.5.5 in [Stewart/Sun’1990].

LemmaLet Q ∈ R(n−k)×k , and X = range

(Ik0

), Y = range

(IpQ

).

Then θ1(X ,Y) = arctan ‖Q‖2.

Proof.The columns of

( IpQ

)(I + QT Q)−1/2 form an orthonormal basis of Y.

By definition, this implies that cos θ1(X ,Y) is the smallest singularvalue of (I + QT Q)−1/2. By the SVD of Q, it follows that

cos θ1(X ,Y) =1√

1 + ‖Q‖22

.

42

Stability of SVD

Theorem (Wedin). Let k < n and assume

δ := σk (A + E)− σk+1(A) > 0.

Let Uk/Uk/Vk/Vk denote subspaces spanned by first k left/rightsingular vectors of A / A + E . Then√∥∥ sin Θ(Uk , Uk )

∥∥2F +

∥∥ sin Θ(Vk , Vk )∥∥2

F ≤√

2‖E‖F

δ. (1)

Θ: diagonal matrix containing canonical angles between twosubspaces.

I Perturbation on input multiplied by δ−1 ≈ [σk (A)− σk+1(A)]−1.I Bad news for stability of low-rank approximations?

43

Stability of low-rank approximationLemma. Let A ∈ Rm×n have rank ≤ k . Then

‖Tk (A + E)− A‖ ≤ C‖E‖

holds with C = 2 for any unitarily invariant norm ‖ · ‖. For theFrobenius norm, the constant can be improved to C = (1 +

√5)/2.

Proof. Schmidt-Mirsky gives ‖Tk (A + E)− (A + E)‖ ≤ ‖E‖. Triangleinequality implies

‖Tk (A + E)− (A + E) + (A + E)− A‖ ≤ 2‖E‖.

Second part is result by Hackbusch9.

Implication for general matrix A:

‖Tk (A + E)− Tk (A)‖ =∥∥Tk(Tk (A) + (A− Tk (A)) + E

)− Tk (A)

∥∥≤ C‖(A− Tk (A)) + E‖ ≤ C(‖A− Tk (A)‖+ ‖E‖).

Perturbations on the level of truncation error pose no danger.9Hackbusch, W. New estimates for the recursive low-rank truncation of

block-structured matrices. Numer. Math. 132 (2016), no. 2, 303–32844

Stability of low-rank approximation: Application

Consider partitioned matrix

A =

(A11 A12A21 A22

), Aij ∈ Rmi×nj ,

and desired rank k ≤ mi ,nj . Let ε := ‖Tk (A)− A‖.

Eij := Tk (Aij )− Aij ⇒ ‖Eij‖ ≤ ε.

By stability of low-rank approximation,∥∥∥∥Tk

(Tk (A11) Tk (A12)Tk (A21) Tk (A22)

)− A

∥∥∥∥F

=

∥∥∥∥Tk

(A +

(E11 E12E21 E22

))− A

∥∥∥∥F≤ Cε,

with C = 32 (1 +

√5).

This allows, e.g., to perform truncations in parallel.

45

The QR decompositionTheoremLet X ∈ Rm×n with m ≥ n. Then there is an orthogonal matrixQ ∈ Rm×m such that

X = QR, with R =

(R10

)=

@@@

0

,

that is, R1 ∈ Rn×n is an upper triangular matrix.MATLAB: [Q,R] = qr(X).Will use economy size QR decomposition instead: Letting Q1 ∈ Cm×n

contain first n columns of Q, one obtains

X = Q1R1 = Q1 · @@@

.

MATLAB: [Q,R] = qr(X,0).EFY. Let A =

(a1, a2, . . . , an

)with ai ∈ Rm . Using the QR decomposition, show Hadamard’s inequality:

| det(A)| ≤ ‖a1‖2 · ‖a2‖2 · · · ‖an‖2.

Characterize the set of all m × n matrices A for which equality holds.

46

QR for recompressionSuppose that

A = BCT , with B ∈ Rm×K ,C ∈ Rn×K . (2)

Goal: Compute best rank-k approximation of A for k < K .Typical example: Sum of J matrices of rank k :

A =J∑

j=1

Bi︸︷︷︸∈Rm×k

Ci︸︷︷︸∈Rn×k

T=(B1 · · · BJ

)︸ ︷︷ ︸Rm×Jk

(C1 · · · CJ

)︸ ︷︷ ︸Rm×Jk

T. (3)

Algorithm to recompress A:1. Compute (economic) QR decomps B = QBRB and C = QCRC .2. Compute truncated SVD Tk (RBRT

C ) = Uk Σk V Tk .

3. Set Uk = QBUk , Vk = QCVk and return Tk (A) := Uk Σk V Tk .

Returns best rank-k approximation of A with O((m + n)K 2) ops.

47

4. ExtensionsI Weighted and structured low-rank approximationI Semi-separable approximation of bivariate functions

48

Weighted low-rank approximationIf some columns or rows are more important than others (e.g., theyare known to be less corrupted by noise), replace low-rankapproximation problem by

min‖DR(A− B)DC‖ : B ∈ Rm×n has rank at most k

with suitably chosen pos def diagonal matrices DR ,DC . More general:Given invertible matrices WR ∈ Rm×m, WC ∈ Rn×n, weightedlow-rank approximation problem consists of

min‖WR(A− B)WC‖ : B ∈ Rm×n has rank at most k

.

Solution given by

B = W−1R · Tk (WRAWC) ·W−1

C

Proof: EFY.Remark: Numerically more stable approach via generalized SVD[Golub/Van Loan’2013] .

49

Limit case: Infinite weights

Choosing diagonal weights that converge to∞ rows/columnsremain unperturbed.Case of fixed columns: Consider block column partition

A =(A1 A2

).

Consider

min‖A2 − B2‖ :

(A1 B2

)has rank at most k

No/trivial solution if rank(A1) ≥ k . Assume ` := rank(A1) < k and letX1 ∈ Rn×` contain orthonormal basis of range(A1). Then10

B2 = X1X T1 A2 + Tk−`

((I − X1X T

1 )A2).

10Golub, G. H.; Hoffman, A.; Stewart, G. W. A generalization of theEckart-Young-Mirsky matrix approximation theorem. Linear Algebra Appl. 88/89(1987), 317–327.

50

General weights

Given an mn ×mn symmetric pos def matrix W , define

‖A‖W =√

vec(A)T W vec(A)

Equals Frobenius norm for W = I. General weighted low-rankapproximation problem:

min‖A− B‖W : B ∈ Rm×n has rank at most k

.

EFY. Show that this problem can be rephrased as the previously considered (standard) weighted low-rank approximation problem for thecase of a Kronecker product W = W2 ⊗ W1. Hint: Cholesky decomposition.

I For general W no expression in terms of SVD available needto use general optimization method.

I Similarly, imposing general structures on A (such asnonnegativity, fixing individual entries, ...) usually does not admitsolutions in terms of SVD. Often end up with NP-hard problems.

51

Separable approximation of bivariate functionsGiven Ωx ⊂ Rdx and Ωy ⊂ Rdy aim at finding semi-separableapproximation of f ∈ L2(Ωx × Ωy ) ∼= L2(Ωx )⊗ L2(Ωy ):

f (x , y) ≈ g1(x)h1(y) + · · ·+ gr (x)hr (y)

for g1, . . . ,gr ∈ L2(Ωx ), h1, . . . ,hr ∈ L2(Ωy )

Application to higher-dimensional integrals:∫Ωx

∫Ωy

f (x , y)dµy (y) dµx (x)

≈r∑

i=1

∫Ωx

∫Ωy

gi (x)hi (y) dµy (y) dµx (x)

=r∑

i=1

[ ∫Ωx

gi (x) dµx (x)][ ∫

Ωy

hi (y) dµy (y)]

semi-separable approximation breaks down dimensionality ofintegrals (for separable measures).

52

Separable approximation of bivariate functionsGiven f ∈ L2(Ωx × Ωy ), consider linear operator

Lf : L2(Ωx )→ L2(Ωy ), w 7→∫

Ωx

w(x)f (x , y) dx .

Admits SVD

Lf (·) =∞∑i=1

σiui〈vi , ·〉

with L2 orthonormal bases u1,u2, . . . and v1, v2, . . ..Best semi-separable approximation of f (in L2(Ωx × Ωy )) given by

fr (x , y) =r∑

i=1

σiui (x)vi (y),

provided that∑∞

i=1 σ2i <∞ (Hilbert-Schmidt).

‖f − fr‖L2 = σ2r+1 + σ2

r+2 + · · · .

53

Separable and low-rank approximation

Choose discretization x1, . . . , xm ∈ Ωx , y1, . . . , ym ∈ Ωy . Define

F =

f (x1, y1) f (x1, y2) · · · f (x1, yn)f (x2, y1) f (x2, y2) · · · f (x2, yn)

......

...f (xm, y1) f (xm, y2) · · · f (xm, yn)

and

Fr =

fr (x1, y1) fr (x1, y2) · · · fr (x1, yn)fr (x2, y1) fr (x2, y2) · · · fr (x2, yn)

......

...fr (xm, y1) fr (xm, y2) · · · fr (xm, yn)

=r∑

i=1

gi(x1)gi(x2)

...gi(xm)

hi(y1)hi(y2)

...hi(yn)

T

Fr has rank at most r .EFY. Prove ‖F − Fr‖2

F ≤ σ2r+1 + σ2

r+2 + · · · .

54

5. Subspace Iteration

55

Subspace iteration and low-rank approximation

Subspace iteration = extension of power method.

Input: Matrix A ∈ Rm×n.1: Choose starting matrix X (0) ∈ Rm×k with

(X (0)

)T X (0) = Ik .2: j = 0.3: repeat4: Set j := j + 1.5: Compute Y (j) := AAT X (j−1).6: Compute economy size QR factorization: Y (j) = QR.7: Set X (j) := Q.8: until convergence is detected

As will soon be seen, converges to basis of dominant subspace Uk .Low-rank approximation obtained from

Tk (A) ≈ X (j)(X (j))T A.

56

Convergence of subspace iteration

TheoremConsider SVD A = UΣV T and, for k < n, let Uk = spanu1, . . . ,uk.Assume that σk+1 > σk and θ1(Uk ,X (0)) < π/2. Then the iteratesX (j) = range(X (j)) of the subspace iteration satisfy

tan θ1(Uk ,X (j)) =

∣∣∣∣σk+1

σk

∣∣∣∣2j

tan θ1(Uk ,X (0)).

Sketch of proof. As angles do not depend on choice of bases, mayomit QR decompositions X (j) = (AAT )jX (0). By SVD of A, may setA = Σ and hence Uk = spane1, . . . ,ek. Partition

Σ =

(Σ1 00 Σ2

), X (0) =

(X (0)

1

X (0)2

), Σ1,X

(0)1 ∈ Rk×k .

Result follows from applying expression for tangent of θ1.EFY. Complete details of proof.

57

Numerical experiments

Convergence of subspace iteration for 100× 100 Hilbert matrix.k = 5 σk+1/σk = 0.188. Random starting guess.

0 5 10 15 2010

-30

10-20

10-10

100 Black curve:

tan θ1(Uk ,X (j))

Blue curve:‖T7(A)− X (j)

(X (j)

)T A‖2

Red curve:‖A− X (j)

(X (j)

)T A‖2

58

Numerical experiments

Convergence of subspace iteration for matrix with singular values

1,0.99,0.98,1

10,

0.9910

,0.9810

,1

100,

0.99100

,0.98100

, . . .

k = 7 σk+1/σk = 0.99. Random starting guess.

0 5 10 15 2010

-3

10-2

10-1

100

101

Black curve:tan θ1(Uk ,X (j))

Blue curve:‖T7(A)− X (j)

(X (j)

)T A‖2

Red curve:‖A− X (j)

(X (j)

)T A‖2

59

Numerical experiments

Observations:I Low-rank approximation sufficiently good (for most purposes)

already after 1 iteration.I Convergence to dominant subspace arbitrarily slow, but not

relevant.I Classical, asymptotic convergence analysis insufficient.I Pre-asymptotic analysis needs to take randomization of starting

guess into account.

60