Transcript of Randomized Algorithms for Low-Rank Matrix Decomposition Ben Sapp, WPE II May 6, 2011 1.
- Slide 1
- Randomized Algorithms for Low-Rank Matrix Decomposition Ben
Sapp, WPE II May 6, 2011 1
- Slide 2
- Low-Rank Decomposition: The Goal where 2
- Slide 3
- Advantages Requires mk+nk numbers to represent the matrix,
instead of mn. I.e., compression. Less numbers = less storage space
and faster matrix multiplication. In many applications, exposes the
structure of the data. 3
- Slide 4
- Exposing structure: an example Lets say your matrix is full of
points in n -dimensional space which have some underlying linear
structure: This data can be almost exactly described by
2-dimensions, i.e., a rank-2 matrix. description of the 2 axes in
Coefficients (embedding) of the points in 4
- Slide 5
- Formulation: The Fixed-Rank Problem Non-convex constraint set
But, global optimum exists And there is a known solution 5
- Slide 6
- Classical Solution: The Fixed-Rank Problem Singular Value
Decomposition of A: Optimal Solution: truncated SVD (proof to come
later) 6
- Slide 7
- Truncated SVD Properties Properties (via power method): O( mnk
), ok, but O( k ) passes through the data Expects cheap random
access Iterative Issues: Datasets are huge! Netflix dataset 2GB
FERET Face DB 5GB Wiki English 14GB Watson KB 1TB Architecture is
parallel and decentralized Data access is expensive 7
- Slide 8
- Low-Rank Algorithm Desiderata 1-2 passes over the matrix as
pre-processing / examination step. Remainder of the work sub-O(mn),
and should depend on desired rank k, rather than ambient dimensions
m x n. Trade off accuracy with computation time Decentralized,
simple and numerically stable 8
- Slide 9
- Randomization to the Rescue! 1-2 passes over the matrix in
pre-processing/examination step. Remainder of the work sub-O( mn ),
depending on underlying rank, rather than ambient dimensions
Tradeoff accuracy with computation time Decentralized, simple and
numerically stable Randomized meta-algorithm: 1.Given A ( m x n ),
randomly obtain Y ( m x s or s x s) in 1 pass. 2.Compute exact SVD
on Y in O( ms 2 ) or O( s 3 ). 3.Use Y s decomposition to
approximate A s decomposition. 1.Given A ( m x n ), randomly obtain
Y ( m x s or s x s) in 1 pass. 2.Compute exact SVD on Y in O( ms 2
) or O( s 3 ). 3.Use Y s decomposition to approximate A s
decomposition. 9
- Slide 10
- Outline Introduction Linear Algebra preliminaries and intuition
The algorithms SampleCols AdaSample RandProj A comparison 10
- Slide 11
- Singular Value Decomposition Any real rectangular matrix A can
be factored into the form Tall & skinny: Short & fat: (m
> n) (m < n) 11
- Slide 12
- SVD Properties U and V are unitary matrices with mutually
orthonormal columns These columns are called the left and right
singular vectors left singular vectors right singular vectors
12
- Slide 13
- SVD Properties contains the singular values on the diagonal, in
order: which correspond to the left and right singular vectors:,,
13
- Slide 14
- SVD Properties For a vector, the geometric interpretation of :
1. Rotate x by rotation matrix V T. 2. Scale the result along the
coordinate axes. 4. Rotate the result by U. 3. Either discard n-m
dimensions (m n) to map to. 14
- Slide 15
- Fundamental Subspaces Define range( A ) as the set of all
vectors which A maps to: and null(A) as the set of vectors that A
maps to zero: If A has rank k { u 1,,u k } are an orthonormal basis
for range( A ) u1u1 u2u2 u3u3 synonymous: range( A ) is the linear
subspace spanned by the columns of A 15
- Slide 16
- Frobenius norm Equivalent definitions Is unitarily invariant :
in Matlab: sum(X(:).^2) since tr( XY) = tr (YX) 16
- Slide 17
- The Optimal Low-Rank Solution Theorem (Eckart-Young, 1936)..
Then is minimized by with optimal value Let 17
- Slide 18
- The Optimal Low-Rank Solution Proof. First, the Frobenius norm
is unitarily invariant: Thus, To make the right term diagonal, can
construct A of the form should be diagonal too 18 diagonal
- Slide 19
- The Optimal Low-Rank Solution Proof (continued). At this point,
we have Since rank( A ) is at most k, at most k singular values can
be non-zero. Conclude: In summary: 19
- Slide 20
- Orthogonal Projections Project matrix A onto orthonormal basis
Q: Vector x projection onto unit vector: 20
- Slide 21
- Randomized Meta-algorithm Input: Matrix A, m x n, target rank
k, number of samples s Output: which approximately solves 1.Form
lower dimensional Y from sampling s rows and/or columns from A, or
by applying s random projections. Y is m x s or s x s. 2.Compute
the top k left singular vectors of Y to form the best rank- k basis
Q for range( Y). 3.Project A onto the subspace spanned by Q:
21
- Slide 22
- The Main Idea Meta-Algorithm 1.Form Y from A randomly 2.Get
optimal rank- k basis Q for span of Y via SVD 3.Project A onto Q
Bounds: How far is from? 22
- Slide 23
- Outline Introduction Linear Algebra preliminaries and intuition
The algorithms SampleCols AdaSample RandProj A comparison 23
- Slide 24
- Comparing the algorithms Method Running Time # PassesError
w.h.p. SampleCols ? ?? SampleRowsCols ? ?? AdaSample ? ?? RandProj
? ?? Exact partial SVD O(mnk) O(k) 24
- Slide 25
- Sampling Rows & Columns Simple idea: too much data to deal
with? Subsample! But, sample proportional to squared magnitude not
so useful more informative 25
- Slide 26
- First pass: SampleCols Input: Matrix A, m x n, target rank k,
number of samples s Output: which approximately solves 1.Sample s
columns from A: A(:,i 1 ),,A(:,i s ), proportional to their squared
magnitude: 2.Form 3.Compute Q = [q 1 q k ], the top k left singular
values of Y. 4.Project A onto the subspace spanned by Q: 26
[Frieze, Kannan & Vempala, 1998]
- Slide 27
- Running time: SampleCols Input: Matrix A, m x n, target rank k,
number of samples s Output: which approximately solves 1.Sample s
columns from A: A(:,i 1 ),,A(:,i s ), proportional to their squared
magnitude: 2.Form 3.Compute Q = [q 1 q k ], the top k left singular
values of Y. 4.Project A onto the subspace spanned by Q: O(mn) O(ms
2 ) O(ms) 27
- Slide 28
- Analysis: SampleCols How different is A from Y on average?
Exact: Randomly use s cols: Lets start by analyzing a randomized
matrix-vector multiplication algorithm: 28
- Slide 29
- Analysis: SampleCols Random matrix-vector multiplication Exact:
Randomly use s rows: Let random variable X take on value with
probability p i. Then and How do we easily bound this variance?
Then set 29
- Slide 30
- Analysis: SampleCols Handle on the variance of matrix-vector
multiplication: With s samples, variance gets s times better: Lets
extend the idea to matrix-matrix multiplication, randomly choosing
col i from A and corresponding row i from matrix B : Define random
variable Z Then 30 A B col i j row i j
- Slide 31
- Analysis: SampleCols Handle on the variance of matrix-matrix
multiplication: Now, lets look at variance when B = A T. Let Z = YY
T. Then plugging into the above we have 31
- Slide 32
- Analysis: SampleCols When Y is a sampled version of A as in
SampleCols, then Need one more lemma, which quantifies distortion
when projecting on matrix onto anothers range. 32
- Slide 33
- Analysis: SampleCols Given any two matrices A and B, let Q be a
top- k basis of range( B ). Then 33 projection of A onto best k-
basis of B projection of A onto best k- basis of A differs by at
most this much range( A ) range( B )
- Slide 34
- Analysis: SampleCols Lemma (Distortion from sampling). When Y
is a sampled version of A as in SampleCols, then Lemma (Distortion
from projection). Given any two matrices A and B, let Q be a basis
of the top k left singular values of B. Then Taking expectation
w.r.t. sampled columns of the second lemma, we obtain a bound for
SampleCols: 34
- Slide 35
- Comparing the algorithms Method Running Time # Passes Error
w.h.p. SampleCols O( mn+ms 2 ) 2 SampleRowsCols ? ?? AdaSample ? ??
RandProj ? ?? Exact partial SVD O(mnk) O(k) 35
- Slide 36
- One step further: SampleRowsCols Input: Matrix A, m x n, target
rank k, number of samples s Output: which approximately solves
1.Sample s columns from A: A(:,i 1 ),,A(:,i s ), proportional to
their squared magnitude. Scale and form Y, m x s. 2.Sample s rows
from Y: Y (i 1,:),,Y(i s,:), proportional to their squared
magnitude. Scale and form W, s x s. 3.Compute V = [v 1 v s ], the
top s right singular values of W. 4.Compute Q = [q 1 q k ], where q
i = 5.Project A onto the subspace spanned by Q: 36 [Frieze, Kannan
& Vempala, 2004]
- Slide 37
- Running time: SampleRowsCols Input: Matrix A, m x n, target
rank k, number of samples s Output: which approximately solves
1.Sample s columns from A: A(:,i 1 ),,A(:,i s ), proportional to
their squared magnitude. Scale and form Y, m x s. 2.Sample s rows
from Y: Y (i 1,:),,Y(i s,:), proportional to their squared
magnitude. Scale and form W, s x s. 3.Compute V = [v 1 v s ], the
top s right singular values of W. 4.Compute Q = [q 1 q k ], where q
i = 5.Project A onto the subspace spanned by Q: O( mn ) O( ms ) O(
s 3 ) O( ms 2 ) Total running time: O( mn + s 3 ) 37
- Slide 38
- Analysis: SampleRowsCols 38 compute basis for rows of W sample
cols convert from row basis of W to col basis of Y sample rows
project A onto Q exact SampleCols
- Slide 39
- Analysis: SampleRowCols It turns out this additive error
dominates the errors incurred from other steps of the algorithm,
i.e., Lemma (Good left projections from good right projections).
Letas in SampleRowsCols. Then thus, we can bound the algorithm as
follows Theorem (SampleRowsCols average Frobenius error).
SampleRowsCols finds a rank k matrix such that 39
- Slide 40
- Comparing the algorithms Method Running Time # Passes Error
w.h.p. SampleCols O( mn+ms 2 ) 2 SampleRowsCols O( mn+s 3 ) 2
AdaSample ? ?? RandProj ? ?? Exact partial SVD O(mnk) O(k) 40
- Slide 41
- SampleCols is easy to break Error is additive, and we have no
control over Consider: a few important points This data has a
near-perfect rank-2 decomposition. But, SampleCols will almost
surely miss the outliers! 41
- Slide 42
- Improvement: AdaSample Sample some and form a basis. Next round
of sampling should be proportional to the residual part of A not
captured by the current sample. 42
- Slide 43
- AdaSample Input: Matrix A, m x n, target rank k, number of
samples s Output: which approximately solves 1.Start with empty
sample set S = { }, E := A; 2.For t = 1 to T a.Pick a subset S t of
s rows of A, with row i chosen according to b.Update c.Update
3.Return 43 [Deshpande, Rademacher, Vempala & Wang, 2006]
- Slide 44
- Analysis: AdaSample Lets look at one iteration of AdaSample:
Let A be m x n and be a linear subspace. Let and S a collection of
s rows of A sampled proportional to. Then Proof is similar to
derivation of bound on SampleCols. where is the optimal projection
of A onto subspace L with rank k. 44
- Slide 45
- Analysis: AdaSample Proof. 45 Let A be m x n and be a linear
subspace. Let and S a collection of s rows of A sampled
proportional to. Then Applying this lemma T times :
- Slide 46
- Running Time: AdaSample At iteration t, we need to 1.Extend the
orthonormal basis for S to an orthonormal basis for Orthogonalizing
s new vectors against st orthogonal, n x 1 vectors: O( nts 2 ).
2.Project A onto new portion of the basis spanned by S t, takes O(
mns ) So the iterative process takes Finally, we need to compute
singular vectors to obtain our final rank-k basis, taking O(ms 2 T
2 ). Total running time: basis for S A 46
- Slide 47
- Comparing the algorithms MethodRunning Time # Passes Error
w.h.p. SampleCols O( mn+ms 2 ) 2 SampleRowsCols O( mn+s 3 ) 2
AdaSample O( mnsT + s 2 T 2 (m+n) ) 2T2T RandProj ? ?? Exact
partial SVD O(mnk) O(k) 47
- Slide 48
- Recap so far Previous algorithms attempted to capture range ( A
) by (intelligently) sampling rows and/or columns. What if instead,
we probed A with a variety of vectors to get a feel for range(A) ?
48
- Slide 49
- RandProj: Geometric intuition 49
- Slide 50
- RandProj Input: Matrix A, m x n, target rank k, number of
samples s Output: which approximately solves 1.Draw a random test
matrix 2.Form 3.Compute an orthonormal basis for the range of Y via
SVD. 4.Return 50 [Halko, Martinsson & Tropp, 2010]
- Slide 51
- Running time: RandProj Input: Matrix A, m x n, target rank k,
number of samples s Output: which approximately solves 1.Draw a
random test matrix 2.Form 3.Compute an orthonormal basis for the
range of Y via SVD. 4.Return Total Running Time: O( mns ) O( ms 2 )
O( mns + ms 2 ) 51
- Slide 52
- Analysis: RandProj Partition the SVD of A like so: Letand. Then
52 optimal errorextra cost proportional to tail singular values:
wasted sampling
- Slide 53
- Analysis: RandProj Take expectations w.r.t. Lemma (Gaussian
matrix properties). Let G be a zero-mean, unit variance Gaussian
matrix of size k x s, and B,C fixed: Lemma (Random projection error
bound). 53
- Slide 54
- Comparing the algorithms MethodRunning Time # Passes Error
w.h.p. SampleCols O( mn+ms 2 ) 2 SampleRowsCols O( mn+s 3 ) 2
AdaSample O( mnsT + s 2 T 2 (m+n) ) 2T2T RandProj O( mns + ms 2 ) 2
Exact partial SVD O( mnk ) O( k ) 54
- Slide 55
- RandProj refinement Can combine RandProj with power iterations:
Drives noise down from tail n-k singular values exponentially fast:
But pay in running time & # of passes: O( (q+1)mns + ms 2 )
& 2 q passes through data 55
- Slide 56
- RandProj refinements (2) Can combine RandProj with structured
random matrices: Subsampled random Fourier transform (SRFT)
matrices. Compute using FFT in O( mn log s ) instead of standard O(
mns ) Difficult to prove bounds, but works as well as Gaussian in
practice. 56
- Slide 57
- Comparing the algorithms MethodRunning Time # Passes Error
w.h.p. SampleCols O( mn+ms 2 ) 2 SampleRowsCols O( mn+s 3 ) 2
AdaSample O( mnsT + s 2 T 2 (m+n) ) 2T2T RandProj O( mns + ms 2 ) 2
RandProj+power O( (q+1)mns + ms 2 ) 2q2q Exact partial SVD O( mnk )
O( k ) 57
- Slide 58
- Outline Introduction Linear Algebra preliminaries and intuition
The algorithms SampleCols AdaSample RandProj A comparison 58
- Slide 59
- Which one is best? Fix k and s, assume bounds are tight, and
that Error Time power iterations rounds of adaptive sampling
optimal error O ( mn+s 3 ) SampleRowsCols O( mn+ms 2 ) SampleCols
O( mnsT + s 2 T 2 (m+n) ) AdaSample 59 O( (q+1)mns + ms 2 )
RandProj
- Slide 60
- Caveats Bounds are not tight Relative scaling of time-vs-error
axes depend on m, n, s, k, and Which is best in practice? 60
- Slide 61
- Experiments function [U,S,V] = adaSample(A,s,T) S = []; E = A;
for t=1:T %sample columns p = sqrt(sum(E.^2,1)); p = p / sum(p);
idx = sample_from_weights(p,s); St = A(:,idx); S = [S St];
%orthogonalize & project [Q,R] = qr(S,0); proj_A_on_Qt =
Q*(Q'*A); E = A - proj_A_on_Qt; end %wrap up B = Q'*A; [U,S,V] =
svds(B,s*T); U = Q*U; function [U,S,V] = adaSample(A,s,T) S = []; E
= A; for t=1:T %sample columns p = sqrt(sum(E.^2,1)); p = p /
sum(p); idx = sample_from_weights(p,s); St = A(:,idx); S = [S St];
%orthogonalize & project [Q,R] = qr(S,0); proj_A_on_Qt =
Q*(Q'*A); E = A - proj_A_on_Qt; end %wrap up B = Q'*A; [U,S,V] =
svds(B,s*T); U = Q*U; function [U,S,V] = randProj(A,s,q) Omega =
randn(size(A,2),s); Y = A*Omega; for i=1:q Y = A'*Y; Y = A*Y; end
[Q,R] = qr(Y,0); B = Q'*A; [U0,S,V] = svds(B,s); U = Q*U0; function
[U,S,V] = randProj(A,s,q) Omega = randn(size(A,2),s); Y = A*Omega;
for i=1:q Y = A'*Y; Y = A*Y; end [Q,R] = qr(Y,0); B = Q'*A;
[U0,S,V] = svds(B,s); U = Q*U0; function [U,S,V] =
sampleColsRows(A,s) %subsample Y = subsample(A,s); W =
subsample(Y,s); %SVD of W [Uw,S,Vw] = svd(W); S = sqrt(S); U =
Y*Vw; [U,R] = qr(U); V = (U'*A)'; V =
bsxfun(@times,V,1./sqrt(sum(V.^2))); subfun [Xsub] = subsample(X,s)
p_col = sqrt(sum(A.^2)); p_col = p_col / sum(p_col); colidx =
sample_from_weights(p_col,s); Xsub = X(:,colidx); Xsub =
bsxfun(@times,Xsub,... 1./sqrt(p_col(colidx)))/sqrt(s); function
[U,S,V] = sampleColsRows(A,s) %subsample Y = subsample(A,s); W =
subsample(Y,s); %SVD of W [Uw,S,Vw] = svd(W); S = sqrt(S); U =
Y*Vw; [U,R] = qr(U); V = (U'*A)'; V =
bsxfun(@times,V,1./sqrt(sum(V.^2))); subfun [Xsub] = subsample(X,s)
p_col = sqrt(sum(A.^2)); p_col = p_col / sum(p_col); colidx =
sample_from_weights(p_col,s); Xsub = X(:,colidx); Xsub =
bsxfun(@times,Xsub,... 1./sqrt(p_col(colidx)))/sqrt(s); 61
- Slide 62
- Experiments Eigenfaces from Labeled Faces in the Wild 13,233
images, each 96x96 pixels, collected online in 2007 A is a
13233x9216 matrix, 975.6Mb double precision 62
- Slide 63
- Eigenfaces examples pixels loadings k -dim. face basis 63 We
want the top k = 25 eigenfaces.
- Slide 64
- Time (seconds) in ~ 4.5 minutes via Matlabs svds(A,25) 64
- Slide 65
- Time (seconds) in ~ 4 minutes RandProj SampleRowsCols AdaSample
65
- Slide 66
- log(Time) in ~ 4 minutes RandProj SampleRowsCols AdaSample
66
- Slide 67
- log(Time) in ~ 4 minutes RandProj SampleRowsCols AdaSample
RandProj + power iters 4.6 secs! 67 q =1 q=2 q=3
- Slide 68
- Summary Exact eigenfaces in 4+ minutes Approximate eigenfaces
in 4 seconds (75 random projections + 1 power iteration) 68
- Slide 69
- Conclusion Classical truncated SVD ill-suited for large
datasets Randomized algorithms allow an error-vs-computation
tradeoff require only a few passes through the data simple and
robust 69
- Slide 70
- Thanks! 70
- Slide 71
- function [U,S,V] = randProj(A,s,q) Omega = randn(size(A,2),s);
Y = A*Omega; for i=1:q Y = A'*Y; Y = A*Y; end [Q,R] = qr(Y,0); B =
Q'*A; [U0,S,V] = svds(B,s); U = Q*U0; function [U,S,V] =
randProj(A,s,q) Omega = randn(size(A,2),s); Y = A*Omega; for i=1:q
Y = A'*Y; Y = A*Y; end [Q,R] = qr(Y,0); B = Q'*A; [U0,S,V] =
svds(B,s); U = Q*U0; function [U,S,V] = sampleColsRows(A,s)
%subsample Y = subsample(A,s); W = subsample(Y,s); %SVD of W'*W
[Uw,S,Vw] = svd(W'*W); S = sqrt(S); U = Y*Vw; [U,R] = qr(U); V =
(U'*A)'; V = bsxfun(@times,V,1./sqrt(sum(V.^2))); subfun [Xsub] =
subsample(X,s) p_col = sqrt(sum(A.^2)); p_col = p_col / sum(p_col);
colidx = sample_from_weights(p_col,s); Xsub = X(:,colidx); Xsub =
bsxfun(@times,Xsub,... 1./sqrt(p_col(colidx)))/sqrt(s); function
[U,S,V] = sampleColsRows(A,s) %subsample Y = subsample(A,s); W =
subsample(Y,s); %SVD of W'*W [Uw,S,Vw] = svd(W'*W); S = sqrt(S); U
= Y*Vw; [U,R] = qr(U); V = (U'*A)'; V =
bsxfun(@times,V,1./sqrt(sum(V.^2))); subfun [Xsub] = subsample(X,s)
p_col = sqrt(sum(A.^2)); p_col = p_col / sum(p_col); colidx =
sample_from_weights(p_col,s); Xsub = X(:,colidx); Xsub =
bsxfun(@times,Xsub,... 1./sqrt(p_col(colidx)))/sqrt(s); LFW: svds
takes ~8 GB of memory and uses up to all 8 cores, takes 263 secs,
k=25 for X = 13233 x 9216 (96x96 images) function [U,S,V] =
adaSample(A,s,T) S = []; E = A; for t=1:T %sample columns p =
sqrt(sum(E.^2,1)); p = p / sum(p); idx = sample_from_weights(p,s);
St = A(:,idx); S = [S St]; %orthogonalize & project [Q,R] =
qr(S,0); proj_A_on_Qt = Q*(Q'*A); E = A - proj_A_on_Qt; end %wrap
up B = Q'*A; [U,S,V] = svds(B,s*T); U = Q*U; function [U,S,V] =
adaSample(A,s,T) S = []; E = A; for t=1:T %sample columns p =
sqrt(sum(E.^2,1)); p = p / sum(p); idx = sample_from_weights(p,s);
St = A(:,idx); S = [S St]; %orthogonalize & project [Q,R] =
qr(S,0); proj_A_on_Qt = Q*(Q'*A); E = A - proj_A_on_Qt; end %wrap
up B = Q'*A; [U,S,V] = svds(B,s*T); U = Q*U; 71
- Slide 72
- Classical Algorithm: Truncated SVD 1.Multiply k random vectors
by A or A T. 2.Orthogonalize. 3.Repeat until convergence.
1.Multiply k random vectors by A or A T. 2.Orthogonalize. 3.Repeat
until convergence. random vectors converge to singular vectors Full
SVD of an m x n matrix takes O( mn min (m,n) ) time. If we only
care about the top k singular values and vectors, takes O( mnk ),
via power method: 72
- Slide 73
- Matrix Norms Frobenius norm L 2 operator norm, a.k.a. spectral
norm: in Matlab: sum(X(:).^2) 73
- Slide 74
- Formulation: the fixed-rank problem Non-convex feasible set
But, global optimum exists 74
- Slide 75
- SVD Properties Eigenvectors ofarewith eigenvalues. Proof.
Also,eigenvectors ofarewith eigenvalues. 75
- Slide 76
- Analysis: SampleCols Given any two matrices A and B, let Q be a
basis of the top k left singular values of B. Then Proof. Easy to
show from manipulating the trace definition of the Frobenius norm:
76
- Slide 77
- Analysis: SampleRowsCols Apply distortion lemma weve already
seen, twice: once for rows, once for columns: When Y is a sampled
version of A as in SampleCols, then Need one more piece of glue,
relating right projections (of W ) to left projections (of Y ).
Letas in SampleRowsCols. Then 77
- Slide 78
- Analysis: RandProj Let G be a zero-mean, unit variance Gaussian
matrix of size k x s, and B,C fixed matrices of agreeable
dimensions. Then 78
- Slide 79
- Analysis: RandProj Lemma (Random projection error bound).
Taking expectation w.r.t. and and using some properties of
expectations of Gaussian random matrices, we can bound the
algorithm: RandProj finds a rank k matrix such that 79
- Slide 80
- 80 [Halko, Martinsson & Tropp, 2010] [Frieze, Kannan &
Vempala, 1998] [Deshpande, Rademacher, Vempala & Wang,
2006]