Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized...

80
Randomized Numerical Linear Algebra http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html Acknowledgement: this slides is based on Prof. Petros Drineas and Prof. Michael W. Mahoney’s, Prof. Gunnar Martinsson lecture notes 1/80

Transcript of Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized...

Page 1: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

Randomized Numerical Linear Algebra

http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html

Acknowledgement: this slides is based on Prof. Petros Drineas and Prof. Michael W.Mahoney’s, Prof. Gunnar Martinsson lecture notes

1/80

Page 2: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

2/80

Outline

1 Randomized Numerical Linear Algebra (RandNLA)

2 Approximating Matrix Multiplication

3 Approximate SVD

4 Random Sampling for SVD

5 Single View Algorithm For Matrix Approximation

Page 3: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

3/80

Why RandNLA?

Randomization and sampling allow us to design provably accuratealgorithms for problems that are:

Massive(matrices so large that can not be stored at all, or can only bestored in slow memory devices)

Computationally expensive or NP-hard(combinatorial optimization problems such as the ColumnSubset Selection Problem)

Page 4: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

4/80

RandNLA: sampling rows/columns

Randomized algorithmsBy (carefully) sampling rows/columns of a matrix, we canconstruct new, smaller matrices that are close to the originalmatrix (w.r.t. matrix norms) with high probability. A

B

≈ C

R

By preprocessing the matrix using random projections, we cansample rows/columns much less carefully (uniformly at random)and still get nice bounds with high probability.

Page 5: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

5/80

RandNLA: sampling rows/columns

Matrix perturbation theoryThe resulting smaller matrices behave similarly (in terms ofsingular values and singular vectors) to the original matricesthanks to the norm bounds.

Structural results that “decouple” the “randomized” part fromthe “matrix perturbation” part are important in the analyses ofsuch algorithms.

InterplayApplications in BIG DATA: (Data Mining, Information Retrieval,Machine Learning, Bioinformatics, etc.)Numerical Linear Algebra: Matrix computations and linearalgebra (ie., perturbation theory)Theoretical Computer Science: Randomized and approximationalgorithms

Page 6: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

6/80

Issues

Computing large SVDs: computational timeIn commodity hardware (e.g., a 4GB RAM, dual-core laptop),using MatLab 7.0 (R14), the computation of the SVD of the dense2,240-by-447,143 matrix A takes about 12 minutes.Computing this SVD is not a one-liner, since we can not load thewhole matrix in RAM (runs out-of-memory in MatLab).We compute the eigendecomposition of AAT .

Obviously, running time is a concern.

Machine-precision accuracy is NOT necessary!Data are noisy.Approximate singular vectors work well in our settings.

Page 7: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

7/80

Issues

Selecting good columns that “capture the structure” of the topprincipal components

Combinatorial optimization problem; hard even for small matrices.Often called the Column Subset Selection Problem (CSSP).Not clear that such columns even exist.

The two issues:Fast approximation to the top k singular vectors of a matrix, and

Selecting columns that capture the structure of the top k singularvectors

are connected and can be tackled using the same framework

Page 8: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

8/80

SVD decomposes a matrix as

m× n

A

m× k

Uk

k × n

X

It is easy to see that X = UkA

The SVD has strong optimality properties

Uk: Top k left singular vectors. The columns of Uk are linearcombinations of up to all columns of A.

Page 9: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

9/80

The CX decomposition

Mahoney & Drineas (2009) PNAS

m× n

A

m× c

C

c× n

X

Goal: make (some norm) of A− CX small.

C: c columns of A, with c being as close to k as possible

Moore-Penrose pseudoinverse of A:

AA†A = A,A†AA† = A†, (AA†)∗ = AA†, (A†A)∗ = A†A

Page 10: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

10/80

The CX decomposition

m× n

A

m× c

C

c× n

X

Easy to prove that optimal X = C†A. (with respect to unitarilyinvariant norms; C† is the Moore-Penrose pseudoinverse of C).Thus, the challenging part is to find good columns of A to includein C.

From a mathematical perspective, this is a combinatorialoptimization problem, closely related to the so-called ColumnSubset Selection Problem (CSSP); the CSSP has been heavilystudied in Numerical Linear Algebra.

Page 11: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

11/80

objective for the CX decomposition

We would like to get theorems of the following form

Given an m-by-n matrix A, there exists an efficient algorithm thatpicks a small number of columns of A such that with reasonableprobability:

‖A− CX‖F = ‖A− CC†A‖F ≤ (1 + ε)‖A− Ak‖F

Best rank-k approximation to A: Ak = UkU>k A

Page 12: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

12/80

Outline

1 Randomized Numerical Linear Algebra (RandNLA)

2 Approximating Matrix Multiplication

3 Approximate SVD

4 Random Sampling for SVD

5 Single View Algorithm For Matrix Approximation

Page 13: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

13/80

Approximating Matrix Multiplication

Problem Statement

Given an m-by-n matrix A and an n-by-p matrix B,approximate the product AB, Or equvialently,Approximate the sum of n rank-one matrices

AB =

n∑i=1

A(i)

( B(i))

︸ ︷︷ ︸∈Rm×p

A(i) the i-th column of A

B(i) the i-th row of B

Each term in the summation is a rank-one matrix

Page 14: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

14/80

A sampling approach

AB =

n∑i=1

A(i)

( B(i))

︸ ︷︷ ︸∈Rm×p

AlgorithmFix a set of probabilities pi, i = 1, . . . , n, summing up to 1.

For t = 1, . . . , c,set jt = i, where P(jt = i) = pi.

(Pick c terms of the sum, with replacement, with respect to thepi.)

Approximate the product AB by summing the c terms, afterscaling.

Page 15: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

15/80

Generate Discrete Distributions

Consider a discrete random variable with possible valuesc1 < . . . < cn. The probability attached to ci is pi. Let

q0 = 0, qi =

i∑j=1

pj.

They are the cumulative probabilities associated with ci, i.e.,qi = F(ci).

To sample this distributiongenerate a uniform U

find K ∈ 1, . . . , n such that qK−1 < U < qK

set X = cK

Page 16: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

16/80

With/without replacement

Sampling with replacement:Each data unit in the population is allowed to appear in thesample more than once.It is easy to analyze mathematically.

Sampling without replacement:Each data unit in the population is allowed to appear in thesample no more than once.

Page 17: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

17/80

A sampling approach

AB =

n∑i=1

A(i)

( B(i))

︸ ︷︷ ︸∈Rm×p

≈ 1c

c∑t=1

1pjt

A(jt)

( B(jt))

︸ ︷︷ ︸∈Rm×p

Keeping the terms j1, j2, . . . , jc

Page 18: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

18/80

The algorithm (matrix notation)

m× n

A

n× p

B

m× c

C

c× p

R

Algorithm:Pick c columns of A to form an m-by-c matrix C and thecorresponding c rows of B to form a c-by-p matrix R.

Approximate AB by CR.Note

We pick the columns and rows with non-uniform probabilities.We scale the columns (rows) prior to including them in C(R).

Page 19: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

19/80

The algorithm (matrix notation)

m× n

A

n× p

B

m× c

C

c× p

R

Algorithm:

Create C and R by performing c i.i.d. trials, with replacement.

For t = 1, . . . , c, pick a column A(jt) and a row B(jt) with probability

P(jt = i) =‖A(i)‖2‖B(i)‖2∑nj=1 ‖A(j)‖2‖B(j)‖2

Include A(jt)/(cpjt)1/2 as a column of C, and B(jt)/(cpjt)

1/2 as arow of R

Page 20: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

20/80

The algorithm (matrix notation)

We can also use the sampling matrix notation:Let S be an n-by-c matrix whose t-th column (for t = 1, . . . , c) has asingle non-zero entry, namely

Sjtt =1√cpjt

Clearly:AB ≈ CR = (AS)(STB)

Note: S is sparse (has exactly c non-zero elements, one per column).

Page 21: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

21/80

Simple Lemmas

It is easy to implement this particular sampling in two passes.

The expectation of CR (element-wise) is AB (unbiasedestimator), regardless of the sampling probabilities.

Our particular choice of sampling probabilities minimizes thevariance of the estimator (w.r.t. the Frobenius norm of the errorAB-CR).

Page 22: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

22/80

A bound for the Frobenius norm

For the above algorithm,

E[‖AB− CR‖F] = E[‖AB− ASSTB‖F] ≤ 1c‖A‖F‖B‖F

prove using elementary manipulations of expectation

Measure concentration follows from a martingale argument.

The above bound also implies an upper bound for the spectralnorm of the error AB− CR.

Page 23: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

23/80

Proofs

Let A ∈ Rm×n and B ∈ Rp×p, 1 ≤ c ≤ n, and pi ≥ 0,∑

i pi = 1. Then

E[(CR)ij] = (AB)ij, Var[(CR)ij] =1c

n∑i=1

A2ikB2

kj

pk− 1

c(AB)2

ij

Define Xt =

(A(it)B(it)

cpit

)ij

=Aiit Bit j

cpit. Then

E[Xt] =

n∑k=1

pkAikBkj

cpk=

1c

(AB)ij and E[X2t ] =

n∑k=1

A2ikB2

kj

c2pk

E[(CR)ij] =∑c

t=1 E[Xt] = (AB)ij

Var[Xt] = E[X2t ]− E[Xt]

2 =

n∑k=1

A2ikB2

kj

c2pk− 1

c2 (AB)2ij

Page 24: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

24/80

Proofs

Lemma:

E[‖AB− CR‖2F] =

n∑k=1

|A(k)|2|B(k)|2

cpk− 1

c‖AB‖2

F

Proof:

E[‖AB− CR‖2F] =

n∑i=1

p∑j=1

E[(AB− CR)2ij] =

n∑i=1

p∑j=1

Var[(CR)ij]

=1c

n∑k=1

1pk

(∑i

A2ik

)(∑i

B2kj

)− 1

c‖AB‖2

F

=1c

n∑k=1

1pk|A(k)|2|B(k)|2 −

1c‖AB‖2

F

Page 25: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

25/80

Proofs

Find pk to minimize E[‖AB− CR‖2F]:

min∑nk=1 pk=1

f (p1, . . . , pn) =

n∑k=1

1pk|A(k)|2|B(k)|2

Introduce L = f (p1, . . . , pn) + λ(∑n

k=1 pk − 1) and solve ∂L∂pi

= 0

It gives pk =|A(k)||B(k)|∑n

k′=1 |A(k′)||B(k′)|. Then

E[‖AB− CR‖2F] =

1c

(n∑

k=1

|A(k)||B(k)|

)2

− 1c‖AB‖2

F

≤ 1c‖A‖2

F‖B‖2F

Page 26: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

26/80

Special case: B = AT

If B = AT , then the sampling probabilities are

P(jt = i) =‖A(i)‖2

2

‖A‖2F

Also, R = CT , and the error bounds are:

E[‖AAT − CCT‖F] = E[‖AAT − ASSTAT‖F] ≤ 1c‖A‖2

F

Page 27: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

27/80

Special case: B = AT

A better spectral norm bound via matrix Chernoff/Bernsteininequalities:

Assumptions:

Spectral norm of A is one (not important, just normalization)

Frobenius norm of A is at least 0.2 (not important, simplifiesbounds).

Important: Set

c = Ω

(‖A‖2

F

ε2 ln(‖A‖2

F

ε2√δ

))Then: for any 0 < ε < 1 with probability at least 1− δ

E[‖AAT − CCT‖F] = E[‖AAT − ASSTAT‖F] ≤ ε

Page 28: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

28/80

Special case: B = AT

Notes:The constants hidden in the big-Omega notation are small.

proof: an immediate application of an inequality derived byOliveira (2010) for sums of random Hermitian matrices.

We need a sufficiently large value of c, otherwise the theoremdoes not work.

Page 29: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

29/80

Using a dense S

We approximated the product AB as follows:

AB ≈ CR = (AS)(STB)

Recall that S is an n-by-c sparse matrix (one non-zero entry percolumn).Let’s replace S by a dense matrix, the random sign matrix:

Sij =

+1/√

c, .w.p. 1/2−1/√

c, .w.p. 1/2

The stable rank of A: st(A) = ‖A‖2F/‖A‖2

2:

c = Ω(max(st(A), st(B)) ln(m + p)/ε2)

then, with high probability (see Theorem 3.1 in Magen & ZouziasSODA 2012)

‖AB− CR‖2 = ‖AB− ASSTB‖2 ≤ ε‖A‖2‖B‖2

Page 30: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

30/80

Using a dense S for B = AT

Approximate the product AAT (assuming that the spectral norm of A isone):

AAT ≈ CCT = (AS)(STAT)

Let S be a dense matrix, the random sign matrix:

Sij =

+1/√

c, .w.p. 1/2−1/√

c, .w.p. 1/2

If

c = Ω

(‖A‖2

F

ε2 ln m)

then, with high probability

‖AAT − CCT‖2 = ‖AAT − ASSTAT‖2 ≤ ε

Page 31: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

31/80

Using a dense S

This matrix multiplication approximation is oblivious to the inputmatrices A and B.

Reminiscent of random projections and theJohnson-Lindenstrauss (JL) transform.

Bounds for the Frobenius norm are easier to prove and are verysimilar to the case where S is just a sampling matrix.

We need a sufficiently large value for c, otherwise the (spectralnorm) theorem does not hold.

It holds for arbitrary A and B (not just B = AT ); thesampling-based approach should also be generalizable toarbitrary A and B.

Page 32: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

32/80

Recap: approximating matrix multiplication

We approximated the product AB as follows:

AB ≈ CR = (AS)(STB)

Let S be a sampling matrix (actual columns from A and rows from Bare selected): We need to carefully sample columns of A (rows of B)with probabilities that depend on their norms in order to get “good”bounds of the following form:

E[‖AB− CR‖F] = E[‖AB− ASSTB‖F] ≤ 1√c‖A‖2‖B‖2

‖AAT − CCT‖2 = ‖AAT − ASSTAT‖2 ≤ ε

Page 33: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

33/80

Outline

1 Randomized Numerical Linear Algebra (RandNLA)

2 Approximating Matrix Multiplication

3 Approximate SVD

4 Random Sampling for SVD

5 Single View Algorithm For Matrix Approximation

Page 34: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

34/80

Low-Rank Matrix Approximation

Problem Statement:Given: mxn matrix A, and 0 < k < min(m, n) = n.Goal: Compute a rank-k approximation to A.

Fast low-rank matrix approximation is key to efficiency ofsuperfast direct solvers for integral equations and many largesparse linear systems.

Indispensable tool in mining large data sets.

Randomized algorithms compute accurate truncated SVD.

Minimum work and communication/Exceptionally high successrate.

Page 35: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

35/80

Low-rank Approximation

seek to compute a rank-k approximation with k n

m× n

A

m× k

Uk

k × n

X

Eigenvectors corresponding to leading eigenvalues.

Singular Value Decomposition (SVD) / Principal ComponentAnalysis (PCA).

Spanning columns or rows.The problem being addressed is ubiquitous in applications.

Page 36: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

36/80

SVD - Properties

Theorem: SVDIf A is a real m-by-n matrix, then there exits

U = [u1, . . . , um] ∈ Rm×m and V = [v1, . . . , vn] ∈ Rn×n

such that UTU = I, VTV = I and

UTAV = diag(σ1, . . . , σp) ∈ Rm×n, p = min(m, n),

where σ1 ≥ σ2 ≥ . . . ≥ σp ≥ 0.

Eckart & Young, 1936Let the SVD of A ∈ Rm×n be given in Theorem: SVD. Ifk < r = rank(A) and Ak =

∑ki=1 σiuivT

i , then

minrank(B)=k

‖A− B‖2 = ‖A− Ak‖2 = σk+1.

Page 37: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

37/80

Applications:

semidefinite programming

Low-rank matrix completion

Robust principal component analysis

Sparse principal component analysis

Sparse inverse covariance matrix estimation

Nearest correlation matrix estimation

High dimensional data reduction

Density functional theory for electronic structure calculation

Nonlinear eigenvalue problems

Fast algorithms for elliptic PDEs: more efficient Fast MultipoleMethods, fast direct solvers, construction of special quadraturesfor corners and edges, etc.

Page 38: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

38/80

Review of existing methods: dense matrix

For a dense n× n matrix that fits in RAM, excellent algorithms arealready part of LAPACK (and incorporated into Matlab, Mathematica,etc).

Double precision accuracy.

Very stable.

O(n3) asymptotic complexity. Reasonably small constants.

Require extensive random access to the matrix.

When the target rank k is much smaller than n, there also existO(n2k) methods with similar characteristics (the well-knownGolub-Businger method, RRQR by Gu and Eisentstat, etc).

For small matrices, the state-of-the-art is quite satisfactory. (By“small” we mean something like n ≤ 10000 on today’scomputers.)

Page 39: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

39/80

Review of existing methods: structured matrix

If the matrix is large, but can rapidly be applied to a vector (if it issparse, or sparse in Fourier space, or amenable to the FMM, etc.), socalled Krylov subspace methods often yield excellent accuracy andspeed.

Lanczos-based methods:1 From v ∈ Rn, computes orthonormal basis V for

K(A, v) = span

v,Av,A2v, · · · ,Ak−1v

2 Rayleigh-Ritz: eig(VTAV)⇒ Ritz pairs ≈ eigenpairs3 If “not converged”, update v and go to Step 1.

Strength and weakness:Most efficient in terms of the number of Av (or SpMv)Fast and reliable for computing “not too many” eigenpairsLower concurrency and unable to be warm-started

Page 40: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

40/80

“New” challenges in algorithmic design

The existing state-of-the-art methods of numerical linear algebra thatwe have very briefly outlined were designed for an environmentwhere the matrix fits in RAM and the key to performance was tominimize the number of floating point operations required. Currently,communication is becoming the real bottleneck:

While clock speed is hardly improving at all anymore, the cost ofa flop keeps going down rapidly. (Multi-core processors, GPUs,cloud computing, etc.)The cost of slow storage (hard drives, flash memory, etc.) is alsogoing down rapidly.Communication costs are decreasing, but not rapidly. Movingdata from a hard-drive. Moving data between nodes of a parallelmachine. (Or cloud computer ... ) The amount of fast cachememory close to a processor is not improving much. (In fact, itcould be said to be shrinking — GPUs, multi-core, etc.)“Deluge of data”. Driven by ever cheaper storage and acquisitiontechniques. Web search, data mining in archives of documentsor photos, hyper-spectral imagery, social networks, gene arrays,proteomics data, sensor networks, financial transactions, . . .

Page 41: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

41/80

Review of existing randomized methods

Random column/row selectionDraw at random some columns and suppose that they span theentire column space. If rows are drawn as well, then spectralproperties can be estimated. Crude sampling leads to less thanO(mn) complexity, but is very dangerous.

SparsificationZero out the vast majority of the entries of the matrix. Keep arandom subset of entries, and boost their magnitude to preserve“something”.

Quantization and sparsificationRestrict the entries of the matrix to a small set of values (-1/0/1for instance).

Randomized Subspace iterationRandom sampling + Rayleigh Rtiz procedure

Page 42: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

42/80

Linear Time SVD Algorithm

Input: m-by-n matrix A, 1 ≤ k ≤ c ≤ n, pini=1 such that pi ≥ 0

and∑

i pi = 1

Sampling:For t = 1 to c

pick it ∈ 1, . . . , n with P(it = α) = pα.Set C(t) = A(it)

√cpit

Compute CTC and its SVD, say CTC =∑c

t=1 σt(C)2yt(yt)T

Compute ht = Cyt

σt(C) for t = 1, . . . , k.(Note: A = UΣVT and C = HΣCYT )

Return Hk where H(t)k = ht and σt(C) for t = 1, . . . , k

The left singular vectors of C are with high probability approximationsto the left singular vectors of A

Page 43: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

43/80

Extract approximate SVD

Given A. Let X be an approximation of the left singular vectors ofA corresponding to k largest singular values

method = 2; % = 1 or 2Y = (X’*A)’; % Y = A’*X;switch method

case 1;[V,S,W] = svd(Y,0);U = X*W;

case 2;[V,R] = qr(Y,0);[W,S,Z] = svd(R’);U = X*W; V = V*Z;

end

The pair (U, S, V) is an approximation of the k-dominant SVD

Page 44: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

44/80

Main theoretical results

Let Hk be constructed the linear Time SVD

E[‖A− HkHTk A‖2

F] ≤ ‖A− Ak‖2F + ε‖A‖2

F

Exact SVD of A = UΣVT , Ak = UkΣkVTk = UkUT

k A = AVkVTk .

minrank(B)≤k ‖A− B‖2 = ‖A− Ak‖2 = σk+1(A)

minrank(B)≤k ‖A− B‖2F = ‖A− Ak‖2

F =∑r

t=k+1 σ2t (A)

perturbation theory of matrices

max1≤t≤n

|σt(A + E)−σt(A)| ≤ ‖E‖2,n∑

k=1

(σk(A + E)−σk(A))2 ≤ ‖E‖2F

the latter is known as Hoffman-Wielandt inequality

Exact SVD of C = HΣCYT

Page 45: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

45/80

Proofs

Lemma:

‖A− HkHTk A‖2

F ≤ ‖A− Ak‖2F + 2

√k‖AAT − CCT‖F

‖A− HkHTk A‖2

2 ≤ ‖A− Ak‖22 + 2‖AAT − CCT‖2

Proof of the first inequality‖X‖2

F = Tr(XTX) and Tr(X + Y) = Tr(X) + Tr(Y)

‖A− HkHTk A‖2

F = Tr((A− HkHTk A)T(A− HkHT

k A))

= Tr(ATA)− Tr(ATHkHTk A) = ‖A‖2

F − ‖ATHk‖2F

Using Cauchy-Schwartz inequality:∣∣∣∣∣‖ATHk‖2F −

k∑t=1

σ2t (C)

∣∣∣∣∣ ≤ √k

(k∑

t=1

(|ATht|2 − σ2t (C))2

)1/2

=√

k

(k∑

t=1

(|ATht|2 − |CTht|2)2

)1/2

=√

k

(k∑

t=1

((ht)T(AAT − CCT)ht)2

)1/2

≤√

k‖AAT − CCT‖F

Page 46: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

46/80

Proofs

by Hoffman-Wielandt inequality∣∣∣∣∣k∑

t=1

σ2t (C)−

k∑t=1

σ2t (A)

∣∣∣∣∣ ≤ √k

(k∑

t=1

(σ2t (C)− σ2

t (A))2

)1/2

=√

k

(k∑

t=1

(σt(CCT)− σt(AAT))2

)2

≤√

k

(m∑

t=1

(σt(CCT)− σt(AAT))2

)2

≤√

k‖CCT − AAT‖F

Therefore ∣∣∣∣∣‖ATHk‖2F −

k∑t=1

σ2t (A)

∣∣∣∣∣ ≤ 2√

k‖AAT − CCT‖F

Page 47: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

47/80

Proofs

matrix approximation gives

E[‖AB− CR‖2F] ≤ 1

c‖A‖2

F‖B‖2F

which yields

2√

kE[‖AAT − CCT‖F] ≤(

4kc

)1/2

‖A‖2F

‖ATHk‖2F ≥

k∑t=1

σ2t (A)− 2

√k‖AAT − CCT‖F

If c ≥ 4k/ε2, then

E[‖A− HkHTk A‖2

F] ≤ ‖A‖2F −

k∑t=1

σ2t (A) + 2

√kE[‖AAT − CCT‖F]

≤ ‖A− Ak‖2F + ε‖A‖2

F

Page 48: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

48/80

Back to the CX decomposition

We would like to get theorems of the following form

Given an m-by-n matrix A, there exists an efficient algorithm thatpicks a small number of columns of A such that with reasonableprobability:

‖A− CX‖F = ‖A− CC†A‖F ≤ (1 + ε)‖A− Ak‖F

Let’s start with a simpler, weaker result, connecting the spectralnorm of A-CX to matrix multiplication.

Page 49: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

49/80

Approximating singular vectors

Sample c (=140) columns of the original matrix A and rescalethem appropriately to form a 512-by-c matrix C.Show that A− CX is “small”.

(C† is the pseudoinverse of C and X = C†A)

Page 50: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

50/80

Approximating singular vectors

Sample c (=140) columns of the original matrix A and rescalethem appropriately to form a 512-by-c matrix C.Show that A− CX is “small”.

(C† is the pseudoinverse of C and X = C†A)

Page 51: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

51/80

Approximating singular vectors

The fact that AAT −CCT is small will imply that A−CX is small as well.

Page 52: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

52/80

Proof (spectral norm)

Using the triangle inequality and properties of norms,

‖A− CC†A‖22 = ‖(I − CC†)A‖2

2

= ‖(I − CC†)AAT(I − CC†)T‖2

= ‖(I − CC†)(AAT − CC†)(I − CC†)T‖2

≤ ‖AAT − CC†‖2

I − CC† is a projection matrices(I − CC†)CC† = 0

Page 53: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

53/80

Proof (spectral norm)

Assume that our sampling is done in c i.i.d. trials and the samplingprobabilities are:

P(jt = i) =‖A(i)‖2

2

‖A‖2F

We can use our matrix multiplication result: (We will upper bound thespectral norm by the Frobenius norm to avoid concerns about c,namely whether c exceeds the threshold necessitated by the theory.)

E[‖A− CC†A‖2] ≤ E[‖AAT − CCT‖2]

≤ 1c1/4 ‖A‖F

Page 54: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

54/80

Is this a good bound?

E[‖A− CC†A‖2] ≤ E[‖AAT − CCT‖2] ≤ 1c1/4 ‖A‖F

Problem 1: If c = n we do not get zero error. That’s because ofsampling with replacement. (We know how to analyze uniformsampling without replacement, but we have no bounds onnon-uniform sampling without replacement.)

Problem 2: If A had rank exactly k, we would like a columnselection procedure that drives the error down to zero when c =k. This can be done deterministically simply by selecting klinearly independent columns.

Problem 3: If A had numerical rank k, we would like a bound thatdepends on the norm of A− Ak and not on the norm of A.

Such deterministic bounds exist when c = k and depend on(k(n− k))1/2‖A− Ak‖2

Page 55: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

55/80

Relative-error Frobenius norm bounds

Given an m-by-n matrix A, there exists an O(mn2) algorithm that picks

O((k/ε2) ln(k/ε2)) columns of A

such that with probability at least 0.9

‖A− CX‖F = ‖A− CC†A‖F ≤ (1 + ε)‖A− Ak‖F

Page 56: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

56/80

The algorithm

Input: m-by-n matrix A, 0 < ε < 0.5, the desired accuracyC, the matrix consisting of the selected columns

Sampling algorithmCompute probabilities pj summing to 1

Let c = O((k/ε2) ln(k/ε2)).

In c i.i.d. trials pick columns of A, where in each trial the j-thcolumn of A is picked with probability pj.

Let C be the matrix consisting of the chosen columns.

Note: there is no rescaling of the columns of C in this algorithm;however, since our error matrix is A− CX = A− CC†A, rescaling thecolumns of C (as we did in our matrix multiplication algorithms), doesnot change A− CX = A− CC†A.

Page 57: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

57/80

Towards a relative error bound

Structural result (deterministic):

‖A− CX‖F = ‖A− CC†A‖F ≤ ‖A− Ak‖F + ‖(A− Ak)S(VTk S)†‖F

This holds for any n-by-c matrix S such that C = AS as long asthe k-by-c matrix VT

k S has full rank (equal to k ).

The proof of the structural result critically uses the fact that withX = C†A is the argmin for any unitarily invariant norm of the errorA− CX.

Variants of this structural result have appeared in various papers.

Page 58: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

58/80

The rank of VTk S

Structural result (deterministic):

‖A− CX‖F = ‖A− CC†A‖F ≤ ‖A− Ak‖F + ‖(A− Ak)S(VTk S)†‖F

This holds for any n-by-c matrix S such that C = AS as long asthe k-by-c matrix VT

k S has full rank (equal to k ).

S be a sampling and rescaling matrix, where the samplingprobabilities are the leverage scores: our matrix multiplicationresults (and the fact that the square of the Frobenius norm of Vk

if equal to k) guarantee that, for our choice of c (with constantprobability):

‖VTk Vk − VT

k SSTVk‖2 = ‖Ik − VTk SSTVk‖2 ≤ ε

Page 59: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

59/80

The rank of VTk S

From matrix perturbation theory, if

‖VTk Vk − VT

k SSTVk‖2 = ‖Ik − VTk SSTVk‖2 ≤ ε

it follows that all singular values (σi) of VTk S satisfy

√1− ε ≤ σi(VT

k S) ≤√

1 + ε

By choosing ε small enough, we can guarantee that VTk S has full rank

(with constant probability).

‖(A− Ak)S(VTk S)†‖F ≤ ‖(A− Ak)S‖F‖(VT

k S)†‖2

= σ−1min(VT

k S)‖(A− Ak)S‖F

Page 60: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

60/80

Bounding the second term

To conclude:

‖(A− Ak)S(VTk S)†‖F ≤ ‖(A− Ak)S‖F‖(VT

k S)†‖2

= σ−1min(VT

k S)‖(A− Ak)S‖F

We already have a bound for all singular values of VTk S (go back

two slides).

Prove, using our sampling and rescaling,

E[‖(A− Ak)S‖2F] = ‖A− Ak‖2

F

Collecting, we get a (2 + ε) constant-factor approximation.A more careful (albeit, longer) analysis can improve the result to a(1 + ε) relative error approximation.

Page 61: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

61/80

Using a dense S

Our proof would also work if instead of the sampling matrix S, weused, for example, the dense random sign matrix S:

Sij =

+1/√

c, .w.p. 1/2−1/√

c, .w.p. 1/2

The intuition is clear: the most critical part of the proof is based onapproximate matrix multiplication to bound the singular values of VT

k S.

This also works when S is a dense matrix.

Page 62: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

62/80

Selecting fewer columns

Problem:How many columns do we need to include in the matrix C in order toget relative-error approximations?

Recall: with O((k/ε2)log(k/ε2)) columns, we get (subject to a failureprobability)

‖A− CC†A‖F ≤ (1 + ε)‖A− Ak‖F

Deshpande & Rademacher (FOCS’10): with exactly k columns, weget

‖A− CC†A‖F ≤√

k‖A− Ak‖F

What about the range between k and O(k log k)?

Page 63: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

63/80

Selecting fewer columns

(Boutsidis, Drineas, & Magdon-Ismail, FOCS 2011)Question:What about the range between k and O(k log k)?

Answer:A relative-error bound is possible by selecting c = 3k/ε columns

Technical breakthrough :A combination of sampling strategies with a novel approach oncolumn selection, inspired by the work of Batson, Spielman, &Srivastava (STOC ’09) on graph sparsifiers.

The running time is O((mnk + nk3)ε−1).Simplicity is gone . . .

Page 64: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

64/80

Outline

1 Randomized Numerical Linear Algebra (RandNLA)

2 Approximating Matrix Multiplication

3 Approximate SVD

4 Random Sampling for SVD

5 Single View Algorithm For Matrix Approximation

Page 65: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

65/80

Range finding problem

Given an m× n matrix A and an integer k < min(m, n), find anorthonormal m× k matrix Q such that

A ≈ QQTA

Solving the primitive problem via randomized sampling — intuitionDraw random vectors r1, r2, . . . , rk ∈ Rn.

Form “sample” vectors y1 = Ar1, y2 = Ar2, . . . , yk = Ark ∈ Rm.

Form orthonormal vectors q1, q2, . . . , qk ∈ Rm such that

spanq1, q2, . . . , qk = spany1, y2, . . . , yk

Almost always correct if A has exact rank k

Page 66: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

66/80

Low-Rank Approximation: Randomized Sampling

Algorithm RandSam0Input: mxn matrix A, int k, p.

I Draw a random n× (k + p) matrix Ω

I Compute QR = AΩ

I and SVD: QTA = UΣVT

I Truncate SVD: UkΣkVTk

Output: B = (QUk)ΣkVTk

Easy to implement.

Very efficient computation.

Minimum communication.

Page 67: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

67/80

error for Gaussian test matrices

Ref: Halko, Martinsson, Tropp, 2009 & 2011; Martinsson, Rokhlin,Tygert (2006)

Let A denote an m× n matrix with singular values σjmin(m,n)j=1

Let k denote a target rank and let p denote an over-samplingparameter.

Let Ω denote an n× (k + p) Gaussian matrix.

Let Q denote the m× (k + p) matrix Q = orth(AΩ).If p ≥ 4, then

‖A− QQ∗A‖2 ≤(

1 + 6√

(k + p)p log p)σk+1 + 3

√k + p

∑j>k

σ2j

1/2

except with probability at most 3p−p.

Page 68: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

68/80

Improved Randomized Sampling

Algorithm RandSam1Input: mxn matrix A, int k, p, c.

I Draw a random n× (k + p + c) matrix Ω

I Compute QR = AΩ

I and SVD: QTA = UΣVT

I Truncate SVD: UkΣkVTk

Output: B = (QUk)ΣkVTk

Only change from RandSam0: p becomes p + c

Smallest modification of any algorithm.

c allows a drastically different error bound, controls accuracy.

p remains in control of failure chance.

Page 69: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

69/80

Randomized Power Method

Algorithm RandSam2Input: mxn matrix A, int k, p, c, q

I Draw a random n× (k + p + c) matrix Ω

I Compute QR = (AAT)qAΩ

I and SVD: QTA = UΣVT

I Truncate SVD: UkΣkVTk

Output: B = (QUk)ΣkVTk

QR needs done carefully for numerical accuracy.

Algorithm is old one when q = 0; but q = 1 far more accurate.

Should converge faster when singular values do not decay veryfast.

Page 70: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

70/80

Example 1

We consider a 1000× 1000 matrix A whose singular values are shownbelow:

A is a discrete approximation of a certain compact integral operatornormalized so that ‖A‖ = 1. Curiously, the nature of A is in a strongsense irrelevant: the error distribution depends only on σjmin(m,n)

j=1 .

Page 71: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

71/80

Example 2

We consider a 1000× 1000 matrix A whose singular values are shownbelow:

A is a discrete approximation of a certain compact integral operatornormalized so that ‖A‖ = 1. Curiously, the nature of A is in a strongsense irrelevant: the error distribution depends only on σjmin(m,n)

j=1 .

Page 72: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

72/80

Example 3

The matrix A being analyzed is a 9025× 9025 matrix arising in adiffusion geometry approach to image processing.

To be precise, A is a graph Laplacian on the manifold of 3×3 patches.

Page 73: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

73/80

The pink lines illustrates the performance of the basic randomsampling scheme. The errors for q = 0 are huge, and the estimatedeigenvalues are much too small.

Page 74: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

74/80

Outline

1 Randomized Numerical Linear Algebra (RandNLA)

2 Approximating Matrix Multiplication

3 Approximate SVD

4 Random Sampling for SVD

5 Single View Algorithm For Matrix Approximation

Page 75: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

75/80

Low-rank reconstruction

Given A ∈ Rm×n and a target rank r. Select k and `. Given randommatrices Ω ∈ Rn×k and Ψ ∈ R`×m. Compute

Y = AΩ, W = ΨA,

Then an approximation A is computed:

Form an orthogonal-triangular factorization Y = QR whereQ ∈ Rm×k.

Solve a least-squares problem to obtain X = (ΨQ)†W ∈ Rk×n

Construct the rank-k approximation A = QX

Suppose k = 2r + 1 and ` = 4r + 2, then

E‖A− A‖F ≤ 2 minrank(Z)≤r

‖A− Z‖F

Page 76: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

76/80

Linear update of A

Suppose that A is sent as a sequence of additive updates:

A = H1 + H2 + H3 + · · ·

Then one compute

Y ← Y + HΩ, W = W + ΨH

Suppose that A is sent as a sequence of additive updates:

A = θA + ηH

Then one compute

Y ← θY + ηHΩ, W = θW + ηΨH

Page 77: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

77/80

Intuition

SupposeA ≈ QQ∗A.

We want to form the rank-k approximation Q(Q∗A), but we cannotcompute the factor Q∗A without revisiting the target matrix A.

NoteW = Ψ(QQ∗A) + Ψ(A− QQ∗A) ≈ (ΨQ)(Q∗A)

The construction of X:

X = (ΨQ)†W ≈ Q∗A

HenceA = QX ≈ QQ∗A ≈ A

Page 78: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

78/80

Projection onto a Convex Set.

Let C be a closed and convex set. Define the projection:

ΠC(M) = arg minX

‖X −M‖2F, s.t. X ∈ C

Suppose A ∈ C. Let Ain be an initial approximation of A,

‖A−ΠC(Ain)‖F ≤ ‖A− Ain‖F

Conjugate Symmetric Approximation

Hn = X ∈ Cn×n|X = X∗

The projection

ΠHn(M) =12

(M + M∗)

Page 79: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

79/80

Conjugate Symmetric Approximation.

Let A ∈ Hn. Let A = QX.

ΠHn(A) =12

(QX + X∗Q∗) =12

[Q,X∗](

0 II 0

)[Q,X∗]∗

Let [Q,X∗] = U[T1,T2]. Then

S =12

(T1T∗2 + T2T∗1 )

ConstructAsym = USU∗

Page 80: Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized algorithms By (carefully)sampling rows/columns of a matrix, we can construct new, smaller

80/80

PSD Approximation

Let A be positive semidefinite (PSD). Let A = QX.Form eigenvalue decomposition

S = VDV∗

ComputeAsym = (UV)D(UV)∗

ConstructA+ = ΠHn

+(A) = (UV)D+(UV)∗