Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized...
Transcript of Randomized Numerical Linear Algebrabicmr.pku.edu.cn/~wenzw/bigdata/lect-rand.pdfRandomized...
Randomized Numerical Linear Algebra
http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html
Acknowledgement: this slides is based on Prof. Petros Drineas and Prof. Michael W.Mahoney’s, Prof. Gunnar Martinsson lecture notes
1/80
2/80
Outline
1 Randomized Numerical Linear Algebra (RandNLA)
2 Approximating Matrix Multiplication
3 Approximate SVD
4 Random Sampling for SVD
5 Single View Algorithm For Matrix Approximation
3/80
Why RandNLA?
Randomization and sampling allow us to design provably accuratealgorithms for problems that are:
Massive(matrices so large that can not be stored at all, or can only bestored in slow memory devices)
Computationally expensive or NP-hard(combinatorial optimization problems such as the ColumnSubset Selection Problem)
4/80
RandNLA: sampling rows/columns
Randomized algorithmsBy (carefully) sampling rows/columns of a matrix, we canconstruct new, smaller matrices that are close to the originalmatrix (w.r.t. matrix norms) with high probability. A
B
≈ C
R
By preprocessing the matrix using random projections, we cansample rows/columns much less carefully (uniformly at random)and still get nice bounds with high probability.
5/80
RandNLA: sampling rows/columns
Matrix perturbation theoryThe resulting smaller matrices behave similarly (in terms ofsingular values and singular vectors) to the original matricesthanks to the norm bounds.
Structural results that “decouple” the “randomized” part fromthe “matrix perturbation” part are important in the analyses ofsuch algorithms.
InterplayApplications in BIG DATA: (Data Mining, Information Retrieval,Machine Learning, Bioinformatics, etc.)Numerical Linear Algebra: Matrix computations and linearalgebra (ie., perturbation theory)Theoretical Computer Science: Randomized and approximationalgorithms
6/80
Issues
Computing large SVDs: computational timeIn commodity hardware (e.g., a 4GB RAM, dual-core laptop),using MatLab 7.0 (R14), the computation of the SVD of the dense2,240-by-447,143 matrix A takes about 12 minutes.Computing this SVD is not a one-liner, since we can not load thewhole matrix in RAM (runs out-of-memory in MatLab).We compute the eigendecomposition of AAT .
Obviously, running time is a concern.
Machine-precision accuracy is NOT necessary!Data are noisy.Approximate singular vectors work well in our settings.
7/80
Issues
Selecting good columns that “capture the structure” of the topprincipal components
Combinatorial optimization problem; hard even for small matrices.Often called the Column Subset Selection Problem (CSSP).Not clear that such columns even exist.
The two issues:Fast approximation to the top k singular vectors of a matrix, and
Selecting columns that capture the structure of the top k singularvectors
are connected and can be tackled using the same framework
8/80
SVD decomposes a matrix as
m× n
A
≈
m× k
Uk
k × n
X
It is easy to see that X = UkA
The SVD has strong optimality properties
Uk: Top k left singular vectors. The columns of Uk are linearcombinations of up to all columns of A.
9/80
The CX decomposition
Mahoney & Drineas (2009) PNAS
m× n
A
≈
m× c
C
c× n
X
Goal: make (some norm) of A− CX small.
C: c columns of A, with c being as close to k as possible
Moore-Penrose pseudoinverse of A:
AA†A = A,A†AA† = A†, (AA†)∗ = AA†, (A†A)∗ = A†A
10/80
The CX decomposition
m× n
A
≈
m× c
C
c× n
X
Easy to prove that optimal X = C†A. (with respect to unitarilyinvariant norms; C† is the Moore-Penrose pseudoinverse of C).Thus, the challenging part is to find good columns of A to includein C.
From a mathematical perspective, this is a combinatorialoptimization problem, closely related to the so-called ColumnSubset Selection Problem (CSSP); the CSSP has been heavilystudied in Numerical Linear Algebra.
11/80
objective for the CX decomposition
We would like to get theorems of the following form
Given an m-by-n matrix A, there exists an efficient algorithm thatpicks a small number of columns of A such that with reasonableprobability:
‖A− CX‖F = ‖A− CC†A‖F ≤ (1 + ε)‖A− Ak‖F
Best rank-k approximation to A: Ak = UkU>k A
12/80
Outline
1 Randomized Numerical Linear Algebra (RandNLA)
2 Approximating Matrix Multiplication
3 Approximate SVD
4 Random Sampling for SVD
5 Single View Algorithm For Matrix Approximation
13/80
Approximating Matrix Multiplication
Problem Statement
Given an m-by-n matrix A and an n-by-p matrix B,approximate the product AB, Or equvialently,Approximate the sum of n rank-one matrices
AB =
n∑i=1
A(i)
( B(i))
︸ ︷︷ ︸∈Rm×p
A(i) the i-th column of A
B(i) the i-th row of B
Each term in the summation is a rank-one matrix
14/80
A sampling approach
AB =
n∑i=1
A(i)
( B(i))
︸ ︷︷ ︸∈Rm×p
AlgorithmFix a set of probabilities pi, i = 1, . . . , n, summing up to 1.
For t = 1, . . . , c,set jt = i, where P(jt = i) = pi.
(Pick c terms of the sum, with replacement, with respect to thepi.)
Approximate the product AB by summing the c terms, afterscaling.
15/80
Generate Discrete Distributions
Consider a discrete random variable with possible valuesc1 < . . . < cn. The probability attached to ci is pi. Let
q0 = 0, qi =
i∑j=1
pj.
They are the cumulative probabilities associated with ci, i.e.,qi = F(ci).
To sample this distributiongenerate a uniform U
find K ∈ 1, . . . , n such that qK−1 < U < qK
set X = cK
16/80
With/without replacement
Sampling with replacement:Each data unit in the population is allowed to appear in thesample more than once.It is easy to analyze mathematically.
Sampling without replacement:Each data unit in the population is allowed to appear in thesample no more than once.
17/80
A sampling approach
AB =
n∑i=1
A(i)
( B(i))
︸ ︷︷ ︸∈Rm×p
≈ 1c
c∑t=1
1pjt
A(jt)
( B(jt))
︸ ︷︷ ︸∈Rm×p
Keeping the terms j1, j2, . . . , jc
18/80
The algorithm (matrix notation)
m× n
A
n× p
B
≈
m× c
C
c× p
R
Algorithm:Pick c columns of A to form an m-by-c matrix C and thecorresponding c rows of B to form a c-by-p matrix R.
Approximate AB by CR.Note
We pick the columns and rows with non-uniform probabilities.We scale the columns (rows) prior to including them in C(R).
19/80
The algorithm (matrix notation)
m× n
A
n× p
B
≈
m× c
C
c× p
R
Algorithm:
Create C and R by performing c i.i.d. trials, with replacement.
For t = 1, . . . , c, pick a column A(jt) and a row B(jt) with probability
P(jt = i) =‖A(i)‖2‖B(i)‖2∑nj=1 ‖A(j)‖2‖B(j)‖2
Include A(jt)/(cpjt)1/2 as a column of C, and B(jt)/(cpjt)
1/2 as arow of R
20/80
The algorithm (matrix notation)
We can also use the sampling matrix notation:Let S be an n-by-c matrix whose t-th column (for t = 1, . . . , c) has asingle non-zero entry, namely
Sjtt =1√cpjt
Clearly:AB ≈ CR = (AS)(STB)
Note: S is sparse (has exactly c non-zero elements, one per column).
21/80
Simple Lemmas
It is easy to implement this particular sampling in two passes.
The expectation of CR (element-wise) is AB (unbiasedestimator), regardless of the sampling probabilities.
Our particular choice of sampling probabilities minimizes thevariance of the estimator (w.r.t. the Frobenius norm of the errorAB-CR).
22/80
A bound for the Frobenius norm
For the above algorithm,
E[‖AB− CR‖F] = E[‖AB− ASSTB‖F] ≤ 1c‖A‖F‖B‖F
prove using elementary manipulations of expectation
Measure concentration follows from a martingale argument.
The above bound also implies an upper bound for the spectralnorm of the error AB− CR.
23/80
Proofs
Let A ∈ Rm×n and B ∈ Rp×p, 1 ≤ c ≤ n, and pi ≥ 0,∑
i pi = 1. Then
E[(CR)ij] = (AB)ij, Var[(CR)ij] =1c
n∑i=1
A2ikB2
kj
pk− 1
c(AB)2
ij
Define Xt =
(A(it)B(it)
cpit
)ij
=Aiit Bit j
cpit. Then
E[Xt] =
n∑k=1
pkAikBkj
cpk=
1c
(AB)ij and E[X2t ] =
n∑k=1
A2ikB2
kj
c2pk
E[(CR)ij] =∑c
t=1 E[Xt] = (AB)ij
Var[Xt] = E[X2t ]− E[Xt]
2 =
n∑k=1
A2ikB2
kj
c2pk− 1
c2 (AB)2ij
24/80
Proofs
Lemma:
E[‖AB− CR‖2F] =
n∑k=1
|A(k)|2|B(k)|2
cpk− 1
c‖AB‖2
F
Proof:
E[‖AB− CR‖2F] =
n∑i=1
p∑j=1
E[(AB− CR)2ij] =
n∑i=1
p∑j=1
Var[(CR)ij]
=1c
n∑k=1
1pk
(∑i
A2ik
)(∑i
B2kj
)− 1
c‖AB‖2
F
=1c
n∑k=1
1pk|A(k)|2|B(k)|2 −
1c‖AB‖2
F
25/80
Proofs
Find pk to minimize E[‖AB− CR‖2F]:
min∑nk=1 pk=1
f (p1, . . . , pn) =
n∑k=1
1pk|A(k)|2|B(k)|2
Introduce L = f (p1, . . . , pn) + λ(∑n
k=1 pk − 1) and solve ∂L∂pi
= 0
It gives pk =|A(k)||B(k)|∑n
k′=1 |A(k′)||B(k′)|. Then
E[‖AB− CR‖2F] =
1c
(n∑
k=1
|A(k)||B(k)|
)2
− 1c‖AB‖2
F
≤ 1c‖A‖2
F‖B‖2F
26/80
Special case: B = AT
If B = AT , then the sampling probabilities are
P(jt = i) =‖A(i)‖2
2
‖A‖2F
Also, R = CT , and the error bounds are:
E[‖AAT − CCT‖F] = E[‖AAT − ASSTAT‖F] ≤ 1c‖A‖2
F
27/80
Special case: B = AT
A better spectral norm bound via matrix Chernoff/Bernsteininequalities:
Assumptions:
Spectral norm of A is one (not important, just normalization)
Frobenius norm of A is at least 0.2 (not important, simplifiesbounds).
Important: Set
c = Ω
(‖A‖2
F
ε2 ln(‖A‖2
F
ε2√δ
))Then: for any 0 < ε < 1 with probability at least 1− δ
E[‖AAT − CCT‖F] = E[‖AAT − ASSTAT‖F] ≤ ε
28/80
Special case: B = AT
Notes:The constants hidden in the big-Omega notation are small.
proof: an immediate application of an inequality derived byOliveira (2010) for sums of random Hermitian matrices.
We need a sufficiently large value of c, otherwise the theoremdoes not work.
29/80
Using a dense S
We approximated the product AB as follows:
AB ≈ CR = (AS)(STB)
Recall that S is an n-by-c sparse matrix (one non-zero entry percolumn).Let’s replace S by a dense matrix, the random sign matrix:
Sij =
+1/√
c, .w.p. 1/2−1/√
c, .w.p. 1/2
The stable rank of A: st(A) = ‖A‖2F/‖A‖2
2:
c = Ω(max(st(A), st(B)) ln(m + p)/ε2)
then, with high probability (see Theorem 3.1 in Magen & ZouziasSODA 2012)
‖AB− CR‖2 = ‖AB− ASSTB‖2 ≤ ε‖A‖2‖B‖2
30/80
Using a dense S for B = AT
Approximate the product AAT (assuming that the spectral norm of A isone):
AAT ≈ CCT = (AS)(STAT)
Let S be a dense matrix, the random sign matrix:
Sij =
+1/√
c, .w.p. 1/2−1/√
c, .w.p. 1/2
If
c = Ω
(‖A‖2
F
ε2 ln m)
then, with high probability
‖AAT − CCT‖2 = ‖AAT − ASSTAT‖2 ≤ ε
31/80
Using a dense S
This matrix multiplication approximation is oblivious to the inputmatrices A and B.
Reminiscent of random projections and theJohnson-Lindenstrauss (JL) transform.
Bounds for the Frobenius norm are easier to prove and are verysimilar to the case where S is just a sampling matrix.
We need a sufficiently large value for c, otherwise the (spectralnorm) theorem does not hold.
It holds for arbitrary A and B (not just B = AT ); thesampling-based approach should also be generalizable toarbitrary A and B.
32/80
Recap: approximating matrix multiplication
We approximated the product AB as follows:
AB ≈ CR = (AS)(STB)
Let S be a sampling matrix (actual columns from A and rows from Bare selected): We need to carefully sample columns of A (rows of B)with probabilities that depend on their norms in order to get “good”bounds of the following form:
E[‖AB− CR‖F] = E[‖AB− ASSTB‖F] ≤ 1√c‖A‖2‖B‖2
‖AAT − CCT‖2 = ‖AAT − ASSTAT‖2 ≤ ε
33/80
Outline
1 Randomized Numerical Linear Algebra (RandNLA)
2 Approximating Matrix Multiplication
3 Approximate SVD
4 Random Sampling for SVD
5 Single View Algorithm For Matrix Approximation
34/80
Low-Rank Matrix Approximation
Problem Statement:Given: mxn matrix A, and 0 < k < min(m, n) = n.Goal: Compute a rank-k approximation to A.
Fast low-rank matrix approximation is key to efficiency ofsuperfast direct solvers for integral equations and many largesparse linear systems.
Indispensable tool in mining large data sets.
Randomized algorithms compute accurate truncated SVD.
Minimum work and communication/Exceptionally high successrate.
35/80
Low-rank Approximation
seek to compute a rank-k approximation with k n
m× n
A
≈
m× k
Uk
k × n
X
Eigenvectors corresponding to leading eigenvalues.
Singular Value Decomposition (SVD) / Principal ComponentAnalysis (PCA).
Spanning columns or rows.The problem being addressed is ubiquitous in applications.
36/80
SVD - Properties
Theorem: SVDIf A is a real m-by-n matrix, then there exits
U = [u1, . . . , um] ∈ Rm×m and V = [v1, . . . , vn] ∈ Rn×n
such that UTU = I, VTV = I and
UTAV = diag(σ1, . . . , σp) ∈ Rm×n, p = min(m, n),
where σ1 ≥ σ2 ≥ . . . ≥ σp ≥ 0.
Eckart & Young, 1936Let the SVD of A ∈ Rm×n be given in Theorem: SVD. Ifk < r = rank(A) and Ak =
∑ki=1 σiuivT
i , then
minrank(B)=k
‖A− B‖2 = ‖A− Ak‖2 = σk+1.
37/80
Applications:
semidefinite programming
Low-rank matrix completion
Robust principal component analysis
Sparse principal component analysis
Sparse inverse covariance matrix estimation
Nearest correlation matrix estimation
High dimensional data reduction
Density functional theory for electronic structure calculation
Nonlinear eigenvalue problems
Fast algorithms for elliptic PDEs: more efficient Fast MultipoleMethods, fast direct solvers, construction of special quadraturesfor corners and edges, etc.
38/80
Review of existing methods: dense matrix
For a dense n× n matrix that fits in RAM, excellent algorithms arealready part of LAPACK (and incorporated into Matlab, Mathematica,etc).
Double precision accuracy.
Very stable.
O(n3) asymptotic complexity. Reasonably small constants.
Require extensive random access to the matrix.
When the target rank k is much smaller than n, there also existO(n2k) methods with similar characteristics (the well-knownGolub-Businger method, RRQR by Gu and Eisentstat, etc).
For small matrices, the state-of-the-art is quite satisfactory. (By“small” we mean something like n ≤ 10000 on today’scomputers.)
39/80
Review of existing methods: structured matrix
If the matrix is large, but can rapidly be applied to a vector (if it issparse, or sparse in Fourier space, or amenable to the FMM, etc.), socalled Krylov subspace methods often yield excellent accuracy andspeed.
Lanczos-based methods:1 From v ∈ Rn, computes orthonormal basis V for
K(A, v) = span
v,Av,A2v, · · · ,Ak−1v
2 Rayleigh-Ritz: eig(VTAV)⇒ Ritz pairs ≈ eigenpairs3 If “not converged”, update v and go to Step 1.
Strength and weakness:Most efficient in terms of the number of Av (or SpMv)Fast and reliable for computing “not too many” eigenpairsLower concurrency and unable to be warm-started
40/80
“New” challenges in algorithmic design
The existing state-of-the-art methods of numerical linear algebra thatwe have very briefly outlined were designed for an environmentwhere the matrix fits in RAM and the key to performance was tominimize the number of floating point operations required. Currently,communication is becoming the real bottleneck:
While clock speed is hardly improving at all anymore, the cost ofa flop keeps going down rapidly. (Multi-core processors, GPUs,cloud computing, etc.)The cost of slow storage (hard drives, flash memory, etc.) is alsogoing down rapidly.Communication costs are decreasing, but not rapidly. Movingdata from a hard-drive. Moving data between nodes of a parallelmachine. (Or cloud computer ... ) The amount of fast cachememory close to a processor is not improving much. (In fact, itcould be said to be shrinking — GPUs, multi-core, etc.)“Deluge of data”. Driven by ever cheaper storage and acquisitiontechniques. Web search, data mining in archives of documentsor photos, hyper-spectral imagery, social networks, gene arrays,proteomics data, sensor networks, financial transactions, . . .
41/80
Review of existing randomized methods
Random column/row selectionDraw at random some columns and suppose that they span theentire column space. If rows are drawn as well, then spectralproperties can be estimated. Crude sampling leads to less thanO(mn) complexity, but is very dangerous.
SparsificationZero out the vast majority of the entries of the matrix. Keep arandom subset of entries, and boost their magnitude to preserve“something”.
Quantization and sparsificationRestrict the entries of the matrix to a small set of values (-1/0/1for instance).
Randomized Subspace iterationRandom sampling + Rayleigh Rtiz procedure
42/80
Linear Time SVD Algorithm
Input: m-by-n matrix A, 1 ≤ k ≤ c ≤ n, pini=1 such that pi ≥ 0
and∑
i pi = 1
Sampling:For t = 1 to c
pick it ∈ 1, . . . , n with P(it = α) = pα.Set C(t) = A(it)
√cpit
Compute CTC and its SVD, say CTC =∑c
t=1 σt(C)2yt(yt)T
Compute ht = Cyt
σt(C) for t = 1, . . . , k.(Note: A = UΣVT and C = HΣCYT )
Return Hk where H(t)k = ht and σt(C) for t = 1, . . . , k
The left singular vectors of C are with high probability approximationsto the left singular vectors of A
43/80
Extract approximate SVD
Given A. Let X be an approximation of the left singular vectors ofA corresponding to k largest singular values
method = 2; % = 1 or 2Y = (X’*A)’; % Y = A’*X;switch method
case 1;[V,S,W] = svd(Y,0);U = X*W;
case 2;[V,R] = qr(Y,0);[W,S,Z] = svd(R’);U = X*W; V = V*Z;
end
The pair (U, S, V) is an approximation of the k-dominant SVD
44/80
Main theoretical results
Let Hk be constructed the linear Time SVD
E[‖A− HkHTk A‖2
F] ≤ ‖A− Ak‖2F + ε‖A‖2
F
Exact SVD of A = UΣVT , Ak = UkΣkVTk = UkUT
k A = AVkVTk .
minrank(B)≤k ‖A− B‖2 = ‖A− Ak‖2 = σk+1(A)
minrank(B)≤k ‖A− B‖2F = ‖A− Ak‖2
F =∑r
t=k+1 σ2t (A)
perturbation theory of matrices
max1≤t≤n
|σt(A + E)−σt(A)| ≤ ‖E‖2,n∑
k=1
(σk(A + E)−σk(A))2 ≤ ‖E‖2F
the latter is known as Hoffman-Wielandt inequality
Exact SVD of C = HΣCYT
45/80
Proofs
Lemma:
‖A− HkHTk A‖2
F ≤ ‖A− Ak‖2F + 2
√k‖AAT − CCT‖F
‖A− HkHTk A‖2
2 ≤ ‖A− Ak‖22 + 2‖AAT − CCT‖2
Proof of the first inequality‖X‖2
F = Tr(XTX) and Tr(X + Y) = Tr(X) + Tr(Y)
‖A− HkHTk A‖2
F = Tr((A− HkHTk A)T(A− HkHT
k A))
= Tr(ATA)− Tr(ATHkHTk A) = ‖A‖2
F − ‖ATHk‖2F
Using Cauchy-Schwartz inequality:∣∣∣∣∣‖ATHk‖2F −
k∑t=1
σ2t (C)
∣∣∣∣∣ ≤ √k
(k∑
t=1
(|ATht|2 − σ2t (C))2
)1/2
=√
k
(k∑
t=1
(|ATht|2 − |CTht|2)2
)1/2
=√
k
(k∑
t=1
((ht)T(AAT − CCT)ht)2
)1/2
≤√
k‖AAT − CCT‖F
46/80
Proofs
by Hoffman-Wielandt inequality∣∣∣∣∣k∑
t=1
σ2t (C)−
k∑t=1
σ2t (A)
∣∣∣∣∣ ≤ √k
(k∑
t=1
(σ2t (C)− σ2
t (A))2
)1/2
=√
k
(k∑
t=1
(σt(CCT)− σt(AAT))2
)2
≤√
k
(m∑
t=1
(σt(CCT)− σt(AAT))2
)2
≤√
k‖CCT − AAT‖F
Therefore ∣∣∣∣∣‖ATHk‖2F −
k∑t=1
σ2t (A)
∣∣∣∣∣ ≤ 2√
k‖AAT − CCT‖F
47/80
Proofs
matrix approximation gives
E[‖AB− CR‖2F] ≤ 1
c‖A‖2
F‖B‖2F
which yields
2√
kE[‖AAT − CCT‖F] ≤(
4kc
)1/2
‖A‖2F
‖ATHk‖2F ≥
k∑t=1
σ2t (A)− 2
√k‖AAT − CCT‖F
If c ≥ 4k/ε2, then
E[‖A− HkHTk A‖2
F] ≤ ‖A‖2F −
k∑t=1
σ2t (A) + 2
√kE[‖AAT − CCT‖F]
≤ ‖A− Ak‖2F + ε‖A‖2
F
48/80
Back to the CX decomposition
We would like to get theorems of the following form
Given an m-by-n matrix A, there exists an efficient algorithm thatpicks a small number of columns of A such that with reasonableprobability:
‖A− CX‖F = ‖A− CC†A‖F ≤ (1 + ε)‖A− Ak‖F
Let’s start with a simpler, weaker result, connecting the spectralnorm of A-CX to matrix multiplication.
49/80
Approximating singular vectors
Sample c (=140) columns of the original matrix A and rescalethem appropriately to form a 512-by-c matrix C.Show that A− CX is “small”.
(C† is the pseudoinverse of C and X = C†A)
50/80
Approximating singular vectors
Sample c (=140) columns of the original matrix A and rescalethem appropriately to form a 512-by-c matrix C.Show that A− CX is “small”.
(C† is the pseudoinverse of C and X = C†A)
51/80
Approximating singular vectors
The fact that AAT −CCT is small will imply that A−CX is small as well.
52/80
Proof (spectral norm)
Using the triangle inequality and properties of norms,
‖A− CC†A‖22 = ‖(I − CC†)A‖2
2
= ‖(I − CC†)AAT(I − CC†)T‖2
= ‖(I − CC†)(AAT − CC†)(I − CC†)T‖2
≤ ‖AAT − CC†‖2
I − CC† is a projection matrices(I − CC†)CC† = 0
53/80
Proof (spectral norm)
Assume that our sampling is done in c i.i.d. trials and the samplingprobabilities are:
P(jt = i) =‖A(i)‖2
2
‖A‖2F
We can use our matrix multiplication result: (We will upper bound thespectral norm by the Frobenius norm to avoid concerns about c,namely whether c exceeds the threshold necessitated by the theory.)
E[‖A− CC†A‖2] ≤ E[‖AAT − CCT‖2]
≤ 1c1/4 ‖A‖F
54/80
Is this a good bound?
E[‖A− CC†A‖2] ≤ E[‖AAT − CCT‖2] ≤ 1c1/4 ‖A‖F
Problem 1: If c = n we do not get zero error. That’s because ofsampling with replacement. (We know how to analyze uniformsampling without replacement, but we have no bounds onnon-uniform sampling without replacement.)
Problem 2: If A had rank exactly k, we would like a columnselection procedure that drives the error down to zero when c =k. This can be done deterministically simply by selecting klinearly independent columns.
Problem 3: If A had numerical rank k, we would like a bound thatdepends on the norm of A− Ak and not on the norm of A.
Such deterministic bounds exist when c = k and depend on(k(n− k))1/2‖A− Ak‖2
55/80
Relative-error Frobenius norm bounds
Given an m-by-n matrix A, there exists an O(mn2) algorithm that picks
O((k/ε2) ln(k/ε2)) columns of A
such that with probability at least 0.9
‖A− CX‖F = ‖A− CC†A‖F ≤ (1 + ε)‖A− Ak‖F
56/80
The algorithm
Input: m-by-n matrix A, 0 < ε < 0.5, the desired accuracyC, the matrix consisting of the selected columns
Sampling algorithmCompute probabilities pj summing to 1
Let c = O((k/ε2) ln(k/ε2)).
In c i.i.d. trials pick columns of A, where in each trial the j-thcolumn of A is picked with probability pj.
Let C be the matrix consisting of the chosen columns.
Note: there is no rescaling of the columns of C in this algorithm;however, since our error matrix is A− CX = A− CC†A, rescaling thecolumns of C (as we did in our matrix multiplication algorithms), doesnot change A− CX = A− CC†A.
57/80
Towards a relative error bound
Structural result (deterministic):
‖A− CX‖F = ‖A− CC†A‖F ≤ ‖A− Ak‖F + ‖(A− Ak)S(VTk S)†‖F
This holds for any n-by-c matrix S such that C = AS as long asthe k-by-c matrix VT
k S has full rank (equal to k ).
The proof of the structural result critically uses the fact that withX = C†A is the argmin for any unitarily invariant norm of the errorA− CX.
Variants of this structural result have appeared in various papers.
58/80
The rank of VTk S
Structural result (deterministic):
‖A− CX‖F = ‖A− CC†A‖F ≤ ‖A− Ak‖F + ‖(A− Ak)S(VTk S)†‖F
This holds for any n-by-c matrix S such that C = AS as long asthe k-by-c matrix VT
k S has full rank (equal to k ).
S be a sampling and rescaling matrix, where the samplingprobabilities are the leverage scores: our matrix multiplicationresults (and the fact that the square of the Frobenius norm of Vk
if equal to k) guarantee that, for our choice of c (with constantprobability):
‖VTk Vk − VT
k SSTVk‖2 = ‖Ik − VTk SSTVk‖2 ≤ ε
59/80
The rank of VTk S
From matrix perturbation theory, if
‖VTk Vk − VT
k SSTVk‖2 = ‖Ik − VTk SSTVk‖2 ≤ ε
it follows that all singular values (σi) of VTk S satisfy
√1− ε ≤ σi(VT
k S) ≤√
1 + ε
By choosing ε small enough, we can guarantee that VTk S has full rank
(with constant probability).
‖(A− Ak)S(VTk S)†‖F ≤ ‖(A− Ak)S‖F‖(VT
k S)†‖2
= σ−1min(VT
k S)‖(A− Ak)S‖F
60/80
Bounding the second term
To conclude:
‖(A− Ak)S(VTk S)†‖F ≤ ‖(A− Ak)S‖F‖(VT
k S)†‖2
= σ−1min(VT
k S)‖(A− Ak)S‖F
We already have a bound for all singular values of VTk S (go back
two slides).
Prove, using our sampling and rescaling,
E[‖(A− Ak)S‖2F] = ‖A− Ak‖2
F
Collecting, we get a (2 + ε) constant-factor approximation.A more careful (albeit, longer) analysis can improve the result to a(1 + ε) relative error approximation.
61/80
Using a dense S
Our proof would also work if instead of the sampling matrix S, weused, for example, the dense random sign matrix S:
Sij =
+1/√
c, .w.p. 1/2−1/√
c, .w.p. 1/2
The intuition is clear: the most critical part of the proof is based onapproximate matrix multiplication to bound the singular values of VT
k S.
This also works when S is a dense matrix.
62/80
Selecting fewer columns
Problem:How many columns do we need to include in the matrix C in order toget relative-error approximations?
Recall: with O((k/ε2)log(k/ε2)) columns, we get (subject to a failureprobability)
‖A− CC†A‖F ≤ (1 + ε)‖A− Ak‖F
Deshpande & Rademacher (FOCS’10): with exactly k columns, weget
‖A− CC†A‖F ≤√
k‖A− Ak‖F
What about the range between k and O(k log k)?
63/80
Selecting fewer columns
(Boutsidis, Drineas, & Magdon-Ismail, FOCS 2011)Question:What about the range between k and O(k log k)?
Answer:A relative-error bound is possible by selecting c = 3k/ε columns
Technical breakthrough :A combination of sampling strategies with a novel approach oncolumn selection, inspired by the work of Batson, Spielman, &Srivastava (STOC ’09) on graph sparsifiers.
The running time is O((mnk + nk3)ε−1).Simplicity is gone . . .
64/80
Outline
1 Randomized Numerical Linear Algebra (RandNLA)
2 Approximating Matrix Multiplication
3 Approximate SVD
4 Random Sampling for SVD
5 Single View Algorithm For Matrix Approximation
65/80
Range finding problem
Given an m× n matrix A and an integer k < min(m, n), find anorthonormal m× k matrix Q such that
A ≈ QQTA
Solving the primitive problem via randomized sampling — intuitionDraw random vectors r1, r2, . . . , rk ∈ Rn.
Form “sample” vectors y1 = Ar1, y2 = Ar2, . . . , yk = Ark ∈ Rm.
Form orthonormal vectors q1, q2, . . . , qk ∈ Rm such that
spanq1, q2, . . . , qk = spany1, y2, . . . , yk
Almost always correct if A has exact rank k
66/80
Low-Rank Approximation: Randomized Sampling
Algorithm RandSam0Input: mxn matrix A, int k, p.
I Draw a random n× (k + p) matrix Ω
I Compute QR = AΩ
I and SVD: QTA = UΣVT
I Truncate SVD: UkΣkVTk
Output: B = (QUk)ΣkVTk
Easy to implement.
Very efficient computation.
Minimum communication.
67/80
error for Gaussian test matrices
Ref: Halko, Martinsson, Tropp, 2009 & 2011; Martinsson, Rokhlin,Tygert (2006)
Let A denote an m× n matrix with singular values σjmin(m,n)j=1
Let k denote a target rank and let p denote an over-samplingparameter.
Let Ω denote an n× (k + p) Gaussian matrix.
Let Q denote the m× (k + p) matrix Q = orth(AΩ).If p ≥ 4, then
‖A− QQ∗A‖2 ≤(
1 + 6√
(k + p)p log p)σk+1 + 3
√k + p
∑j>k
σ2j
1/2
except with probability at most 3p−p.
68/80
Improved Randomized Sampling
Algorithm RandSam1Input: mxn matrix A, int k, p, c.
I Draw a random n× (k + p + c) matrix Ω
I Compute QR = AΩ
I and SVD: QTA = UΣVT
I Truncate SVD: UkΣkVTk
Output: B = (QUk)ΣkVTk
Only change from RandSam0: p becomes p + c
Smallest modification of any algorithm.
c allows a drastically different error bound, controls accuracy.
p remains in control of failure chance.
69/80
Randomized Power Method
Algorithm RandSam2Input: mxn matrix A, int k, p, c, q
I Draw a random n× (k + p + c) matrix Ω
I Compute QR = (AAT)qAΩ
I and SVD: QTA = UΣVT
I Truncate SVD: UkΣkVTk
Output: B = (QUk)ΣkVTk
QR needs done carefully for numerical accuracy.
Algorithm is old one when q = 0; but q = 1 far more accurate.
Should converge faster when singular values do not decay veryfast.
70/80
Example 1
We consider a 1000× 1000 matrix A whose singular values are shownbelow:
A is a discrete approximation of a certain compact integral operatornormalized so that ‖A‖ = 1. Curiously, the nature of A is in a strongsense irrelevant: the error distribution depends only on σjmin(m,n)
j=1 .
71/80
Example 2
We consider a 1000× 1000 matrix A whose singular values are shownbelow:
A is a discrete approximation of a certain compact integral operatornormalized so that ‖A‖ = 1. Curiously, the nature of A is in a strongsense irrelevant: the error distribution depends only on σjmin(m,n)
j=1 .
72/80
Example 3
The matrix A being analyzed is a 9025× 9025 matrix arising in adiffusion geometry approach to image processing.
To be precise, A is a graph Laplacian on the manifold of 3×3 patches.
73/80
The pink lines illustrates the performance of the basic randomsampling scheme. The errors for q = 0 are huge, and the estimatedeigenvalues are much too small.
74/80
Outline
1 Randomized Numerical Linear Algebra (RandNLA)
2 Approximating Matrix Multiplication
3 Approximate SVD
4 Random Sampling for SVD
5 Single View Algorithm For Matrix Approximation
75/80
Low-rank reconstruction
Given A ∈ Rm×n and a target rank r. Select k and `. Given randommatrices Ω ∈ Rn×k and Ψ ∈ R`×m. Compute
Y = AΩ, W = ΨA,
Then an approximation A is computed:
Form an orthogonal-triangular factorization Y = QR whereQ ∈ Rm×k.
Solve a least-squares problem to obtain X = (ΨQ)†W ∈ Rk×n
Construct the rank-k approximation A = QX
Suppose k = 2r + 1 and ` = 4r + 2, then
E‖A− A‖F ≤ 2 minrank(Z)≤r
‖A− Z‖F
76/80
Linear update of A
Suppose that A is sent as a sequence of additive updates:
A = H1 + H2 + H3 + · · ·
Then one compute
Y ← Y + HΩ, W = W + ΨH
Suppose that A is sent as a sequence of additive updates:
A = θA + ηH
Then one compute
Y ← θY + ηHΩ, W = θW + ηΨH
77/80
Intuition
SupposeA ≈ QQ∗A.
We want to form the rank-k approximation Q(Q∗A), but we cannotcompute the factor Q∗A without revisiting the target matrix A.
NoteW = Ψ(QQ∗A) + Ψ(A− QQ∗A) ≈ (ΨQ)(Q∗A)
The construction of X:
X = (ΨQ)†W ≈ Q∗A
HenceA = QX ≈ QQ∗A ≈ A
78/80
Projection onto a Convex Set.
Let C be a closed and convex set. Define the projection:
ΠC(M) = arg minX
‖X −M‖2F, s.t. X ∈ C
Suppose A ∈ C. Let Ain be an initial approximation of A,
‖A−ΠC(Ain)‖F ≤ ‖A− Ain‖F
Conjugate Symmetric Approximation
Hn = X ∈ Cn×n|X = X∗
The projection
ΠHn(M) =12
(M + M∗)
79/80
Conjugate Symmetric Approximation.
Let A ∈ Hn. Let A = QX.
ΠHn(A) =12
(QX + X∗Q∗) =12
[Q,X∗](
0 II 0
)[Q,X∗]∗
Let [Q,X∗] = U[T1,T2]. Then
S =12
(T1T∗2 + T2T∗1 )
ConstructAsym = USU∗
80/80
PSD Approximation
Let A be positive semidefinite (PSD). Let A = QX.Form eigenvalue decomposition
S = VDV∗
ComputeAsym = (UV)D(UV)∗
ConstructA+ = ΠHn
+(A) = (UV)D+(UV)∗