Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized...

Randomized Numerical Linear Algebra

http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html

Acknowledgement: this slides is based on Prof. Petros Drineas and Prof. Michael W.Mahoney’s, Prof. Gunnar Martinsson lecture notes

1/80

http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html

2/80

Outline

1 Randomized Numerical Linear Algebra (RandNLA)

2 Approximating Matrix Multiplication

3 Approximate SVD

4 Random Sampling for SVD

5 Single View Algorithm For Matrix Approximation

3/80

Why RandNLA?

Randomization and sampling allow us to design provably accuratealgorithms for problems that are:

Massive(matrices so large that can not be stored at all, or can only bestored in slow memory devices)

Computationally expensive or NP-hard(combinatorial optimization problems such as the ColumnSubset Selection Problem)

4/80

RandNLA: sampling rows/columns

Randomized algorithmsBy (carefully) sampling rows/columns of a matrix, we canconstruct new, smaller matrices that are close to the originalmatrix (w.r.t. matrix norms) with high probability. A

B

≈ C

R

By preprocessing the matrix using random projections, we cansample rows/columns much less carefully (uniformly at random)and still get nice bounds with high probability.

5/80

RandNLA: sampling rows/columns

Matrix perturbation theoryThe resulting smaller matrices behave similarly (in terms ofsingular values and singular vectors) to the original matricesthanks to the norm bounds.

Structural results that “decouple” the “randomized” part fromthe “matrix perturbation” part are important in the analyses ofsuch algorithms.

InterplayApplications in BIG DATA: (Data Mining, Information Retrieval,Machine Learning, Bioinformatics, etc.)Numerical Linear Algebra: Matrix computations and linearalgebra (ie., perturbation theory)Theoretical Computer Science: Randomized and approximationalgorithms

6/80

Issues

Computing large SVDs: computational timeIn commodity hardware (e.g., a 4GB RAM, dual-core laptop),using MatLab 7.0 (R14), the computation of the SVD of the dense2,240-by-447,143 matrix A takes about 12 minutes.Computing this SVD is not a one-liner, since we can not load thewhole matrix in RAM (runs out-of-memory in MatLab).We compute the eigendecomposition of AAT .

Obviously, running time is a concern.

Machine-precision accuracy is NOT necessary!Data are noisy.Approximate singular vectors work well in our settings.

7/80

Issues

Selecting good columns that “capture the structure” of the topprincipal components

Combinatorial optimization problem; hard even for small matrices.Often called the Column Subset Selection Problem (CSSP).Not clear that such columns even exist.

The two issues:Fast approximation to the top k singular vectors of a matrix, and

Selecting columns that capture the structure of the top k singularvectors

are connected and can be tackled using the same framework

8/80

SVD decomposes a matrix as

m× n

A

≈

m× k

Uk

k × n

X

It is easy to see that X = UkA

The SVD has strong optimality properties

Uk: Top k left singular vectors. The columns of Uk are linearcombinations of up to all columns of A.

9/80

The CX decomposition

Mahoney & Drineas (2009) PNAS

m× n

A

≈

m× c

C

c× n

X

Goal: make (some norm) of A− CX small.

C: c columns of A, with c being as close to k as possible

Moore-Penrose pseudoinverse of A:

AA†A = A,A†AA† = A†, (AA†)∗ = AA†, (A†A)∗ = A†A

10/80

The CX decomposition

m× n

A

≈

m× c

C

c× n

X

Easy to prove that optimal X = C†A. (with respect to unitarilyinvariant norms; C† is the Moore-Penrose pseudoinverse of C).Thus, the challenging part is to find good columns of A to includein C.

From a mathematical perspective, this is a combinatorialoptimization problem, closely related to the so-called ColumnSubset Selection Problem (CSSP); the CSSP has been heavilystudied in Numerical Linear Algebra.

11/80

objective for the CX decomposition

We would like to get theorems of the following form

Given an m-by-n matrix A, there exists an efficient algorithm thatpicks a small number of columns of A such that with reasonableprobability:

‖A− CX‖F = ‖A− CC†A‖F ≤ (1 + ε)‖A− Ak‖F

Best rank-k approximation to A: Ak = UkU>k A

12/80

Outline



3 Approximate SVD



13/80

Approximating Matrix Multiplication

Problem Statement

Given an m-by-n matrix A and an n-by-p matrix B,approximate the product AB, Or equvialently,Approximate the sum of n rank-one matrices

AB =

n∑i=1

A(i)

( B(i))

︸︷︷︸∈Rm×p

A(i) the i-th column of A

B(i) the i-th row of B

Each term in the summation is a rank-one matrix

14/80

A sampling approach

AB =

n∑i=1

A(i)

( B(i))

︸︷︷︸∈Rm×p

AlgorithmFix a set of probabilities pi, i = 1, . . . , n, summing up to 1.

For t = 1, . . . , c,set jt = i, where P(jt = i) = pi.

(Pick c terms of the sum, with replacement, with respect to thepi.)

Approximate the product AB by summing the c terms, afterscaling.

15/80

Generate Discrete Distributions

Consider a discrete random variable with possible valuesc1 < . . . < cn. The probability attached to ci is pi. Let

q0 = 0, qi =

i∑j=1

pj.

They are the cumulative probabilities associated with ci, i.e.,qi = F(ci).

To sample this distributiongenerate a uniform U

find K ∈ 1, . . . , n such that qK−1 < U < qK

set X = cK

16/80

With/without replacement

Sampling with replacement:Each data unit in the population is allowed to appear in thesample more than once.It is easy to analyze mathematically.

Sampling without replacement:Each data unit in the population is allowed to appear in thesample no more than once.

17/80

A sampling approach

AB =

n∑i=1

A(i)

( B(i))

︸︷︷︸∈Rm×p

≈ 1c

c∑t=1

1pjt

A(jt)

( B(jt))

︸︷︷︸∈Rm×p

Keeping the terms j1, j2, . . . , jc

18/80

The algorithm (matrix notation)

m× n

A

n× p

B

≈

m× c

C

c× p

R

Algorithm:Pick c columns of A to form an m-by-c matrix C and thecorresponding c rows of B to form a c-by-p matrix R.

Approximate AB by CR.Note

We pick the columns and rows with non-uniform probabilities.We scale the columns (rows) prior to including them in C(R).

19/80


m× n

A

n× p

B

≈

m× c

C

c× p

R

Algorithm:

Create C and R by performing c i.i.d. trials, with replacement.

For t = 1, . . . , c, pick a column A(jt) and a row B(jt) with probability

P(jt = i) =‖A(i)‖2‖B(i)‖2∑nj=1 ‖A(j)‖2‖B(j)‖2

Include A(jt)/(cpjt)1/2 as a column of C, and B(jt)/(cpjt)

1/2 as arow of R

20/80


We can also use the sampling matrix notation:Let S be an n-by-c matrix whose t-th column (for t = 1, . . . , c) has asingle non-zero entry, namely

Sjtt =1√cpjt

Clearly:AB ≈ CR = (AS)(STB)

Note: S is sparse (has exactly c non-zero elements, one per column).

21/80

Simple Lemmas

It is easy to implement this particular sampling in two passes.

The expectation of CR (element-wise) is AB (unbiasedestimator), regardless of the sampling probabilities.

Our particular choice of sampling probabilities minimizes thevariance of the estimator (w.r.t. the Frobenius norm of the errorAB-CR).

22/80

A bound for the Frobenius norm

For the above algorithm,

E[‖AB− CR‖F] = E[‖AB− ASSTB‖F] ≤ 1c‖A‖F‖B‖F

prove using elementary manipulations of expectation

Measure concentration follows from a martingale argument.

The above bound also implies an upper bound for the spectralnorm of the error AB− CR.

23/80

Proofs

Let A ∈ Rm×n and B ∈ Rp×p, 1 ≤ c ≤ n, and pi ≥ 0,∑

i pi = 1. Then

E[(CR)ij] = (AB)ij, Var[(CR)ij] =1c

n∑i=1

A2ikB2

kj

pk− 1

c(AB)2

ij

Define Xt =

(A(it)B(it)

cpit

)ij

=Aiit Bit j

cpit. Then

E[Xt] =

n∑k=1

pkAikBkj

cpk=

1c

(AB)ij and E[X2t ] =

n∑k=1

A2ikB2

kj

c2pk

E[(CR)ij] =∑c

t=1 E[Xt] = (AB)ij

Var[Xt] = E[X2t ]− E[Xt]

2 =

n∑k=1

A2ikB2

kj

c2pk− 1

c2 (AB)2ij

24/80

Proofs

Lemma:

E[‖AB− CR‖2F] =

n∑k=1

|A(k)|2|B(k)|2

cpk− 1

c‖AB‖2

F

Proof:


n∑i=1

p∑j=1

E[(AB− CR)2ij] =

n∑i=1

p∑j=1

Var[(CR)ij]

=1c

n∑k=1

1pk

(∑i

A2ik

)(∑i

B2kj

)− 1

c‖AB‖2

F

=1c

n∑k=1

1pk|A(k)|2|B(k)|2 −

1c‖AB‖2

F

25/80

Proofs

Find pk to minimize E[‖AB− CR‖2F]:

min∑nk=1 pk=1

f (p1, . . . , pn) =

n∑k=1

1pk|A(k)|2|B(k)|2

Introduce L = f (p1, . . . , pn) + λ(∑n

k=1 pk − 1) and solve ∂L∂pi

= 0

It gives pk =|A(k)||B(k)|∑n

k′=1 |A(k′)||B(k′)|. Then


1c

(n∑

k=1

|A(k)||B(k)|

)2

− 1c‖AB‖2

F

≤ 1c‖A‖2

F‖B‖2F

26/80

Special case: B = AT

If B = AT , then the sampling probabilities are

P(jt = i) =‖A(i)‖2

2

‖A‖2F

Also, R = CT , and the error bounds are:

E[‖AAT − CCT‖F] = E[‖AAT − ASSTAT‖F] ≤ 1c‖A‖2

F

27/80


A better spectral norm bound via matrix Chernoff/Bernsteininequalities:

Assumptions:

Spectral norm of A is one (not important, just normalization)

Frobenius norm of A is at least 0.2 (not important, simplifiesbounds).

Important: Set

c = Ω

(‖A‖2

F

ε2 ln(‖A‖2

F

ε2√δ

))Then: for any 0 < ε < 1 with probability at least 1− δ

E[‖AAT − CCT‖F] = E[‖AAT − ASSTAT‖F] ≤ ε

28/80


Notes:The constants hidden in the big-Omega notation are small.

proof: an immediate application of an inequality derived byOliveira (2010) for sums of random Hermitian matrices.

We need a sufficiently large value of c, otherwise the theoremdoes not work.

29/80

Using a dense S

We approximated the product AB as follows:

AB ≈ CR = (AS)(STB)

Recall that S is an n-by-c sparse matrix (one non-zero entry percolumn).Let’s replace S by a dense matrix, the random sign matrix:

Sij =

+1/√

c, .w.p. 1/2−1/√

c, .w.p. 1/2

The stable rank of A: st(A) = ‖A‖2F/‖A‖2

2:

c = Ω(max(st(A), st(B)) ln(m + p)/ε2)

then, with high probability (see Theorem 3.1 in Magen & ZouziasSODA 2012)

‖AB− CR‖2 = ‖AB− ASSTB‖2 ≤ ε‖A‖2‖B‖2

30/80

Using a dense S for B = AT

Approximate the product AAT (assuming that the spectral norm of A isone):

AAT ≈ CCT = (AS)(STAT)

Let S be a dense matrix, the random sign matrix:

Sij =

+1/√

c, .w.p. 1/2−1/√

c, .w.p. 1/2

If

c = Ω

(‖A‖2

F

ε2 ln m)

then, with high probability

‖AAT − CCT‖2 = ‖AAT − ASSTAT‖2 ≤ ε

31/80

Using a dense S

This matrix multiplication approximation is oblivious to the inputmatrices A and B.

Reminiscent of random projections and theJohnson-Lindenstrauss (JL) transform.

Bounds for the Frobenius norm are easier to prove and are verysimilar to the case where S is just a sampling matrix.

We need a sufficiently large value for c, otherwise the (spectralnorm) theorem does not hold.

It holds for arbitrary A and B (not just B = AT ); thesampling-based approach should also be generalizable toarbitrary A and B.

32/80

Recap: approximating matrix multiplication

We approximated the product AB as follows:

AB ≈ CR = (AS)(STB)

Let S be a sampling matrix (actual columns from A and rows from Bare selected): We need to carefully sample columns of A (rows of B)with probabilities that depend on their norms in order to get “good”bounds of the following form:

E[‖AB− CR‖F] = E[‖AB− ASSTB‖F] ≤ 1√c‖A‖2‖B‖2

‖AAT − CCT‖2 = ‖AAT − ASSTAT‖2 ≤ ε

33/80

Outline



3 Approximate SVD



34/80

Low-Rank Matrix Approximation

Problem Statement:Given: mxn matrix A, and 0 < k < min(m, n) = n.Goal: Compute a rank-k approximation to A.

Fast low-rank matrix approximation is key to efficiency ofsuperfast direct solvers for integral equations and many largesparse linear systems.

Indispensable tool in mining large data sets.

Randomized algorithms compute accurate truncated SVD.

Minimum work and communication/Exceptionally high successrate.

35/80

Low-rank Approximation

seek to compute a rank-k approximation with k n

m× n

A

≈

m× k

Uk

k × n

X

Eigenvectors corresponding to leading eigenvalues.

Singular Value Decomposition (SVD) / Principal ComponentAnalysis (PCA).

Spanning columns or rows.The problem being addressed is ubiquitous in applications.

36/80

SVD - Properties

Theorem: SVDIf A is a real m-by-n matrix, then there exits

U = [u1, . . . , um] ∈ Rm×m and V = [v1, . . . , vn] ∈ Rn×n

such that UTU = I, VTV = I and

UTAV = diag(σ1, . . . , σp) ∈ Rm×n, p = min(m, n),

where σ1 ≥ σ2 ≥ . . . ≥ σp ≥ 0.

Eckart & Young, 1936Let the SVD of A ∈ Rm×n be given in Theorem: SVD. Ifk < r = rank(A) and Ak =

∑ki=1 σiuivT

i , then

minrank(B)=k

‖A− B‖2 = ‖A− Ak‖2 = σk+1.

37/80

Applications:

semidefinite programming

Low-rank matrix completion

Robust principal component analysis

Sparse principal component analysis

Sparse inverse covariance matrix estimation

Nearest correlation matrix estimation

High dimensional data reduction

Density functional theory for electronic structure calculation

Nonlinear eigenvalue problems

Fast algorithms for elliptic PDEs: more efficient Fast MultipoleMethods, fast direct solvers, construction of special quadraturesfor corners and edges, etc.

38/80

Review of existing methods: dense matrix

For a dense n× n matrix that fits in RAM, excellent algorithms arealready part of LAPACK (and incorporated into Matlab, Mathematica,etc).

Double precision accuracy.

Very stable.

O(n3) asymptotic complexity. Reasonably small constants.

Require extensive random access to the matrix.

When the target rank k is much smaller than n, there also existO(n2k) methods with similar characteristics (the well-knownGolub-Businger method, RRQR by Gu and Eisentstat, etc).

For small matrices, the state-of-the-art is quite satisfactory. (By“small” we mean something like n ≤ 10000 on today’scomputers.)

39/80

Review of existing methods: structured matrix

If the matrix is large, but can rapidly be applied to a vector (if it issparse, or sparse in Fourier space, or amenable to the FMM, etc.), socalled Krylov subspace methods often yield excellent accuracy andspeed.

Lanczos-based methods:1 From v ∈ Rn, computes orthonormal basis V for

K(A, v) = span

v,Av,A2v, · · · ,Ak−1v

2 Rayleigh-Ritz: eig(VTAV)⇒ Ritz pairs ≈ eigenpairs3 If “not converged”, update v and go to Step 1.

Strength and weakness:Most efficient in terms of the number of Av (or SpMv)Fast and reliable for computing “not too many” eigenpairsLower concurrency and unable to be warm-started

40/80

“New” challenges in algorithmic design

The existing state-of-the-art methods of numerical linear algebra thatwe have very briefly outlined were designed for an environmentwhere the matrix fits in RAM and the key to performance was tominimize the number of floating point operations required. Currently,communication is becoming the real bottleneck:

While clock speed is hardly improving at all anymore, the cost ofa flop keeps going down rapidly. (Multi-core processors, GPUs,cloud computing, etc.)The cost of slow storage (hard drives, flash memory, etc.) is alsogoing down rapidly.Communication costs are decreasing, but not rapidly. Movingdata from a hard-drive. Moving data between nodes of a parallelmachine. (Or cloud computer ... ) The amount of fast cachememory close to a processor is not improving much. (In fact, itcould be said to be shrinking — GPUs, multi-core, etc.)“Deluge of data”. Driven by ever cheaper storage and acquisitiontechniques. Web search, data mining in archives of documentsor photos, hyper-spectral imagery, social networks, gene arrays,proteomics data, sensor networks, financial transactions, . . .

41/80

Review of existing randomized methods

Random column/row selectionDraw at random some columns and suppose that they span theentire column space. If rows are drawn as well, then spectralproperties can be estimated. Crude sampling leads to less thanO(mn) complexity, but is very dangerous.

SparsificationZero out the vast majority of the entries of the matrix. Keep arandom subset of entries, and boost their magnitude to preserve“something”.

Quantization and sparsificationRestrict the entries of the matrix to a small set of values (-1/0/1for instance).

Randomized Subspace iterationRandom sampling + Rayleigh Rtiz procedure

42/80

Linear Time SVD Algorithm

Input: m-by-n matrix A, 1 ≤ k ≤ c ≤ n, pini=1 such that pi ≥ 0

and∑

i pi = 1

Sampling:For t = 1 to c

pick it ∈ 1, . . . , n with P(it = α) = pα.Set C(t) = A(it)

√cpit

Compute CTC and its SVD, say CTC =∑c

t=1 σt(C)2yt(yt)T

Compute ht = Cyt

σt(C) for t = 1, . . . , k.(Note: A = UΣVT and C = HΣCYT )

Return Hk where H(t)k = ht and σt(C) for t = 1, . . . , k

The left singular vectors of C are with high probability approximationsto the left singular vectors of A

43/80

Extract approximate SVD

Given A. Let X be an approximation of the left singular vectors ofA corresponding to k largest singular values

method = 2; % = 1 or 2Y = (X’*A)’; % Y = A’*X;switch method

case 1;[V,S,W] = svd(Y,0);U = X*W;

case 2;[V,R] = qr(Y,0);[W,S,Z] = svd(R’);U = X*W; V = V*Z;

end

The pair (U, S, V) is an approximation of the k-dominant SVD

44/80

Main theoretical results

Let Hk be constructed the linear Time SVD

E[‖A− HkHTk A‖2

F] ≤ ‖A− Ak‖2F + ε‖A‖2

F

Exact SVD of A = UΣVT , Ak = UkΣkVTk = UkUT

k A = AVkVTk .

minrank(B)≤k ‖A− B‖2 = ‖A− Ak‖2 = σk+1(A)

minrank(B)≤k ‖A− B‖2F = ‖A− Ak‖2

F =∑r

t=k+1 σ2t (A)

perturbation theory of matrices

max1≤t≤n

|σt(A + E)−σt(A)| ≤ ‖E‖2,n∑

k=1

(σk(A + E)−σk(A))2 ≤ ‖E‖2F

the latter is known as Hoffman-Wielandt inequality

Exact SVD of C = HΣCYT

45/80

Proofs

Lemma:

‖A− HkHTk A‖2

F ≤ ‖A− Ak‖2F + 2

√k‖AAT − CCT‖F

‖A− HkHTk A‖2

2 ≤ ‖A− Ak‖22 + 2‖AAT − CCT‖2

Proof of the first inequality‖X‖2

F = Tr(XTX) and Tr(X + Y) = Tr(X) + Tr(Y)

‖A− HkHTk A‖2

F = Tr((A− HkHTk A)T(A− HkHT

k A))

= Tr(ATA)− Tr(ATHkHTk A) = ‖A‖2

F − ‖ATHk‖2F

Using Cauchy-Schwartz inequality:∣∣∣∣∣‖ATHk‖2F −

k∑t=1

σ2t (C)

∣∣∣∣∣ ≤ √k

(k∑

t=1

(|ATht|2 − σ2t (C))2

)1/2

=√

k

(k∑

t=1

(|ATht|2 − |CTht|2)2

)1/2

=√

k

(k∑

t=1

((ht)T(AAT − CCT)ht)2

)1/2

≤√

k‖AAT − CCT‖F

46/80

Proofs

by Hoffman-Wielandt inequality∣∣∣∣∣k∑

t=1

σ2t (C)−

k∑t=1

σ2t (A)

∣∣∣∣∣ ≤ √k

(k∑

t=1

(σ2t (C)− σ2

t (A))2

)1/2

=√

k

(k∑

t=1

(σt(CCT)− σt(AAT))2

)2

≤√

k

(m∑

t=1

(σt(CCT)− σt(AAT))2

)2

≤√

k‖CCT − AAT‖F

Therefore ∣∣∣∣∣‖ATHk‖2F −

k∑t=1

σ2t (A)

∣∣∣∣∣ ≤ 2√

k‖AAT − CCT‖F

47/80

Proofs

matrix approximation gives

E[‖AB− CR‖2F] ≤ 1

c‖A‖2

F‖B‖2F

which yields

2√

kE[‖AAT − CCT‖F] ≤(

4kc

)1/2

‖A‖2F

‖ATHk‖2F ≥

k∑t=1

σ2t (A)− 2

√k‖AAT − CCT‖F

If c ≥ 4k/ε2, then

E[‖A− HkHTk A‖2

F] ≤ ‖A‖2F −

k∑t=1

σ2t (A) + 2

√kE[‖AAT − CCT‖F]

≤ ‖A− Ak‖2F + ε‖A‖2

F

48/80

Back to the CX decomposition

We would like to get theorems of the following form

Given an m-by-n matrix A, there exists an efficient algorithm thatpicks a small number of columns of A such that with reasonableprobability:

‖A− CX‖F = ‖A− CC†A‖F ≤ (1 + ε)‖A− Ak‖F

Let’s start with a simpler, weaker result, connecting the spectralnorm of A-CX to matrix multiplication.

49/80

Approximating singular vectors

Sample c (=140) columns of the original matrix A and rescalethem appropriately to form a 512-by-c matrix C.Show that A− CX is “small”.

(C† is the pseudoinverse of C and X = C†A)

50/80


Sample c (=140) columns of the original matrix A and rescalethem appropriately to form a 512-by-c matrix C.Show that A− CX is “small”.

(C† is the pseudoinverse of C and X = C†A)

51/80


The fact that AAT −CCT is small will imply that A−CX is small as well.

52/80

Proof (spectral norm)

Using the triangle inequality and properties of norms,

‖A− CC†A‖22 = ‖(I − CC†)A‖2

2

= ‖(I − CC†)AAT(I − CC†)T‖2

= ‖(I − CC†)(AAT − CC†)(I − CC†)T‖2

≤ ‖AAT − CC†‖2

I − CC† is a projection matrices(I − CC†)CC† = 0

53/80

Proof (spectral norm)

Assume that our sampling is done in c i.i.d. trials and the samplingprobabilities are:

P(jt = i) =‖A(i)‖2

2

‖A‖2F

We can use our matrix multiplication result: (We will upper bound thespectral norm by the Frobenius norm to avoid concerns about c,namely whether c exceeds the threshold necessitated by the theory.)

E[‖A− CC†A‖2] ≤ E[‖AAT − CCT‖2]

≤ 1c1/4 ‖A‖F

54/80

Is this a good bound?

E[‖A− CC†A‖2] ≤ E[‖AAT − CCT‖2] ≤ 1c1/4 ‖A‖F

Problem 1: If c = n we do not get zero error. That’s because ofsampling with replacement. (We know how to analyze uniformsampling without replacement, but we have no bounds onnon-uniform sampling without replacement.)

Problem 2: If A had rank exactly k, we would like a columnselection procedure that drives the error down to zero when c =k. This can be done deterministically simply by selecting klinearly independent columns.

Problem 3: If A had numerical rank k, we would like a bound thatdepends on the norm of A− Ak and not on the norm of A.

Such deterministic bounds exist when c = k and depend on(k(n− k))1/2‖A− Ak‖2

55/80

Relative-error Frobenius norm bounds

Given an m-by-n matrix A, there exists an O(mn2) algorithm that picks

O((k/ε2) ln(k/ε2)) columns of A

such that with probability at least 0.9

‖A− CX‖F = ‖A− CC†A‖F ≤ (1 + ε)‖A− Ak‖F

56/80

The algorithm

Input: m-by-n matrix A, 0 < ε < 0.5, the desired accuracyC, the matrix consisting of the selected columns

Sampling algorithmCompute probabilities pj summing to 1

Let c = O((k/ε2) ln(k/ε2)).

In c i.i.d. trials pick columns of A, where in each trial the j-thcolumn of A is picked with probability pj.

Let C be the matrix consisting of the chosen columns.

Note: there is no rescaling of the columns of C in this algorithm;however, since our error matrix is A− CX = A− CC†A, rescaling thecolumns of C (as we did in our matrix multiplication algorithms), doesnot change A− CX = A− CC†A.

57/80

Towards a relative error bound

Structural result (deterministic):

‖A− CX‖F = ‖A− CC†A‖F ≤ ‖A− Ak‖F + ‖(A− Ak)S(VTk S)†‖F

This holds for any n-by-c matrix S such that C = AS as long asthe k-by-c matrix VT

k S has full rank (equal to k ).

The proof of the structural result critically uses the fact that withX = C†A is the argmin for any unitarily invariant norm of the errorA− CX.

Variants of this structural result have appeared in various papers.

58/80

The rank of VTk S

Structural result (deterministic):

‖A− CX‖F = ‖A− CC†A‖F ≤ ‖A− Ak‖F + ‖(A− Ak)S(VTk S)†‖F

This holds for any n-by-c matrix S such that C = AS as long asthe k-by-c matrix VT

k S has full rank (equal to k ).

S be a sampling and rescaling matrix, where the samplingprobabilities are the leverage scores: our matrix multiplicationresults (and the fact that the square of the Frobenius norm of Vk

if equal to k) guarantee that, for our choice of c (with constantprobability):

‖VTk Vk − VT

k SSTVk‖2 = ‖Ik − VTk SSTVk‖2 ≤ ε

59/80

The rank of VTk S

From matrix perturbation theory, if

‖VTk Vk − VT

k SSTVk‖2 = ‖Ik − VTk SSTVk‖2 ≤ ε

it follows that all singular values (σi) of VTk S satisfy

√1− ε ≤ σi(VT

k S) ≤√

1 + ε

By choosing ε small enough, we can guarantee that VTk S has full rank

(with constant probability).

‖(A− Ak)S(VTk S)†‖F ≤ ‖(A− Ak)S‖F‖(VT

k S)†‖2

= σ−1min(VT

k S)‖(A− Ak)S‖F

60/80

Bounding the second term

To conclude:

‖(A− Ak)S(VTk S)†‖F ≤ ‖(A− Ak)S‖F‖(VT

k S)†‖2

= σ−1min(VT

k S)‖(A− Ak)S‖F

We already have a bound for all singular values of VTk S (go back

two slides).

Prove, using our sampling and rescaling,

E[‖(A− Ak)S‖2F] = ‖A− Ak‖2

F

Collecting, we get a (2 + ε) constant-factor approximation.A more careful (albeit, longer) analysis can improve the result to a(1 + ε) relative error approximation.

61/80

Using a dense S

Our proof would also work if instead of the sampling matrix S, weused, for example, the dense random sign matrix S:

Sij =

+1/√

c, .w.p. 1/2−1/√

c, .w.p. 1/2

The intuition is clear: the most critical part of the proof is based onapproximate matrix multiplication to bound the singular values of VT

k S.

This also works when S is a dense matrix.

62/80

Selecting fewer columns

Problem:How many columns do we need to include in the matrix C in order toget relative-error approximations?

Recall: with O((k/ε2)log(k/ε2)) columns, we get (subject to a failureprobability)

‖A− CC†A‖F ≤ (1 + ε)‖A− Ak‖F

Deshpande & Rademacher (FOCS’10): with exactly k columns, weget

‖A− CC†A‖F ≤√

k‖A− Ak‖F

What about the range between k and O(k log k)?

63/80

Selecting fewer columns

(Boutsidis, Drineas, & Magdon-Ismail, FOCS 2011)Question:What about the range between k and O(k log k)?

Answer:A relative-error bound is possible by selecting c = 3k/ε columns

Technical breakthrough :A combination of sampling strategies with a novel approach oncolumn selection, inspired by the work of Batson, Spielman, &Srivastava (STOC ’09) on graph sparsifiers.

The running time is O((mnk + nk3)ε−1).Simplicity is gone . . .

64/80

Outline



3 Approximate SVD



65/80

Range finding problem

Given an m× n matrix A and an integer k < min(m, n), find anorthonormal m× k matrix Q such that

A ≈ QQTA

Solving the primitive problem via randomized sampling — intuitionDraw random vectors r1, r2, . . . , rk ∈ Rn.

Form “sample” vectors y1 = Ar1, y2 = Ar2, . . . , yk = Ark ∈ Rm.

Form orthonormal vectors q1, q2, . . . , qk ∈ Rm such that

spanq1, q2, . . . , qk = spany1, y2, . . . , yk

Almost always correct if A has exact rank k

66/80

Low-Rank Approximation: Randomized Sampling

Algorithm RandSam0Input: mxn matrix A, int k, p.

I Draw a random n× (k + p) matrix Ω

I Compute QR = AΩ

I and SVD: QTA = UΣVT

I Truncate SVD: UkΣkVTk

Output: B = (QUk)ΣkVTk

Easy to implement.

Very efficient computation.

Minimum communication.

67/80

error for Gaussian test matrices

Ref: Halko, Martinsson, Tropp, 2009 & 2011; Martinsson, Rokhlin,Tygert (2006)

Let A denote an m× n matrix with singular values σjmin(m,n)j=1

Let k denote a target rank and let p denote an over-samplingparameter.

Let Ω denote an n× (k + p) Gaussian matrix.

Let Q denote the m× (k + p) matrix Q = orth(AΩ).If p ≥ 4, then

‖A− QQ∗A‖2 ≤(

1 + 6√

(k + p)p log p)σk+1 + 3

√k + p

∑j>k

σ2j

1/2

except with probability at most 3p−p.

68/80

Improved Randomized Sampling

Algorithm RandSam1Input: mxn matrix A, int k, p, c.

I Draw a random n× (k + p + c) matrix Ω

I Compute QR = AΩ




Only change from RandSam0: p becomes p + c

Smallest modification of any algorithm.

c allows a drastically different error bound, controls accuracy.

p remains in control of failure chance.

69/80

Randomized Power Method

Algorithm RandSam2Input: mxn matrix A, int k, p, c, q

I Draw a random n× (k + p + c) matrix Ω

I Compute QR = (AAT)qAΩ




QR needs done carefully for numerical accuracy.

Algorithm is old one when q = 0; but q = 1 far more accurate.

Should converge faster when singular values do not decay veryfast.

70/80

Example 1

We consider a 1000× 1000 matrix A whose singular values are shownbelow:

A is a discrete approximation of a certain compact integral operatornormalized so that ‖A‖ = 1. Curiously, the nature of A is in a strongsense irrelevant: the error distribution depends only on σjmin(m,n)

j=1 .

71/80

Example 2

We consider a 1000× 1000 matrix A whose singular values are shownbelow:

A is a discrete approximation of a certain compact integral operatornormalized so that ‖A‖ = 1. Curiously, the nature of A is in a strongsense irrelevant: the error distribution depends only on σjmin(m,n)

j=1 .

72/80

Example 3

The matrix A being analyzed is a 9025× 9025 matrix arising in adiffusion geometry approach to image processing.

To be precise, A is a graph Laplacian on the manifold of 3×3 patches.

73/80

The pink lines illustrates the performance of the basic randomsampling scheme. The errors for q = 0 are huge, and the estimatedeigenvalues are much too small.

74/80

Outline



3 Approximate SVD



75/80

Low-rank reconstruction

Given A ∈ Rm×n and a target rank r. Select k and `. Given randommatrices Ω ∈ Rn×k and Ψ ∈ R`×m. Compute

Y = AΩ, W = ΨA,

Then an approximation A is computed:

Form an orthogonal-triangular factorization Y = QR whereQ ∈ Rm×k.

Solve a least-squares problem to obtain X = (ΨQ)†W ∈ Rk×n

Construct the rank-k approximation A = QX

Suppose k = 2r + 1 and ` = 4r + 2, then

E‖A− A‖F ≤ 2 minrank(Z)≤r

‖A− Z‖F

76/80

Linear update of A

Suppose that A is sent as a sequence of additive updates:

A = H1 + H2 + H3 + · · ·

Then one compute

Y ← Y + HΩ, W = W + ΨH

Suppose that A is sent as a sequence of additive updates:

A = θA + ηH

Then one compute

Y ← θY + ηHΩ, W = θW + ηΨH

77/80

Intuition

SupposeA ≈ QQ∗A.

We want to form the rank-k approximation Q(Q∗A), but we cannotcompute the factor Q∗A without revisiting the target matrix A.

NoteW = Ψ(QQ∗A) + Ψ(A− QQ∗A) ≈ (ΨQ)(Q∗A)

The construction of X:

X = (ΨQ)†W ≈ Q∗A

HenceA = QX ≈ QQ∗A ≈ A

78/80

Projection onto a Convex Set.

Let C be a closed and convex set. Define the projection:

ΠC(M) = arg minX

‖X −M‖2F, s.t. X ∈ C

Suppose A ∈ C. Let Ain be an initial approximation of A,

‖A−ΠC(Ain)‖F ≤ ‖A− Ain‖F

Conjugate Symmetric Approximation

Hn = X ∈ Cn×n|X = X∗

The projection

ΠHn(M) =12

(M + M∗)

79/80

Conjugate Symmetric Approximation.

Let A ∈ Hn. Let A = QX.

ΠHn(A) =12

(QX + X∗Q∗) =12

[Q,X∗](

0 II 0

)[Q,X∗]∗

Let [Q,X∗] = U[T1,T2]. Then

S =12

(T1T∗2 + T2T∗1 )

ConstructAsym = USU∗

80/80

PSD Approximation

Let A be positive semidefinite (PSD). Let A = QX.Form eigenvalue decomposition

S = VDV∗

ComputeAsym = (UV)D(UV)∗

ConstructA+ = ΠHn

+(A) = (UV)D+(UV)∗

Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized...

Documents

Transcript of Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized...