Faster least squares approximation

32
Faster least squares approximation Tamas Sarlos Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 http://www.ilab.sztaki.hu/~stamas Joint work with P. Drineas, M. Mahoney, and S. Muthukrishnan.

description

Faster least squares approximation. Tamas Sarlos Yahoo! Research IPAM 23 Oct 2007 http://www.ilab.sztaki.hu/~stamas Joint work with P. Drineas, M. Mahoney, and S. Muthukrishnan. Outline. Randomized methods for Least squares approximation Lp regression Regularized regression - PowerPoint PPT Presentation

Transcript of Faster least squares approximation

Page 1: Faster least squares approximation

Faster least squares approximation

Tamas SarlosTamas Sarlos

Yahoo! Research

IPAM 23 Oct 2007

http://www.ilab.sztaki.hu/~stamas

Joint work with P. Drineas, M. Mahoney, and S. Muthukrishnan.

Page 2: Faster least squares approximation

Outline

Randomized methods for

Least squares approximation Lp regression Regularized regression Low rank matrix approximation

Page 3: Faster least squares approximation

Models, curve fitting, and data analysis

In MANY applications (in statistical data analysis and scientific computation), one has n observations (values of a dependent variable y measured at values of an independent variable t):

Model y(t) by a linear combination of d basis functions:

A is an n x d “design matrix” with elements:

In matrix-vector notation:

Page 4: Faster least squares approximation

Least Squares (LS) Approximation

We are interested in over-constrained L2 regression problems, n >> d.

Typically, there is no x such that Ax = b.

Want to find the “best” x such that Ax ≈ b.

Ubiquitous in applications & central to theory:

Statistical interpretation: best linear unbiased estimator.

Geometric interpretation: orthogonally project b onto span(A).

Page 5: Faster least squares approximation

Many applications of this!

• Astronomy: Predicting the orbit of the asteroid Ceres (in 1801!).

Gauss (1809) -- see also Legendre (1805) and Adrain (1808).

First application of “least squares optimization” and runs in O(nd2) time!

• Bioinformatics: Dimension reduction for classification of gene expression microarray data.

• Medicine: Inverse treatment planning and fast intensity-modulated radiation therapy.

• Engineering: Finite elements methods for solving Poisson, etc. equation.

• Control theory: Optimal design and control theory problems.

• Economics: Restricted maximum-likelihood estimation in econometrics.

• Image Analysis: Array signal and image processing.

• Computer Science: Computer vision, document and information retrieval.

• Internet Analysis: Filtering and de-noising of noisy internet data.

• Data analysis: Fit parameters of a biological, chemical, economic, social, internet, etc. model to experimental data.

Page 6: Faster least squares approximation

Exact solution to LS Approximation

Cholesky Decomposition: If A is full rank and well-conditioned,

decompose ATA = RTR, where R is upper triangular, and

solve the normal equations: RTRx=ATb.

QR Decomposition: Slower but numerically stable, esp. if A is rank-deficient.

Write A=QR, and solve Rx = QTb.

Singular Value Decomposition:Most expensive, but best if A is very ill-conditioned.

Write A=UVT, in which case: xOPT = A+b = V-1kUTb.

Complexity is O(nd2) for all of these, but constant factors differ.

Projection of b on the subspace

spanned by the columns of A

Pseudoinverse of A

Page 7: Faster least squares approximation

Reduction of LS Approximation to fast matrix multiplication

Theorem: For any n-by-d matrix A, and d dimensional vector

b, where n>d, we can compute x*=A+b in O(n/dM(d)) time. Here M(d) denotes the time needed to multiply two d-by-d matrices.

• Current best M(d)=d, =2.376… Coppersmith–Winograd (1990)

Proof: Ibarra et al. (1982) generalize the LUP decomposition for rectangular matrixes

Direct proof:

• Recall the normal equations ATAx = ATb

• Group the rows of A into blocks of d and compute ATA as the sum of n/d square products, each d-by-d

• Solve the normal equations

• LS in O(nd1.376) time, but impractical. We assume O(nd2).

Page 8: Faster least squares approximation

Original (expensive) sampling algorithm

Algorithm

1. Fix a set of probabilities pi, i=1,…,n.

2. Pick the i-th row of A and the i-th element of b with probability

min {1, rpi},

and rescale by (1/min{1,rpi})1/2.

1. Solve the induced problem.

Drineas, Mahoney, and Muthukrishnan (SODA, 2006)

• These probabilities pi’s are statistical leverage scores!

• “Expensive” to compute, O(nd2) time, these pi’s!

Theorem:

Let:

If the pi satisfy:

for some (0,1], then w.p. ≥

1-,

Page 9: Faster least squares approximation

Fast LS via uniform sampling

Algorithm

1. Pre-process A and b with a “randomized Hadamard transform”.

2. Uniformly sample

r=O(d log(n) log(d log(n))/)

constraints.

3. Solve the induced problem:

Drineas, Mahoney, Muthukrishnan, and Sarlos (2007)

• Non-uniform sampling will work, but it is “slow.”

• Uniform sampling is “fast,” but will it work?

Page 10: Faster least squares approximation

A structural lemma

Approximate by solving

where is any matrix.

Let be the matrix of left singular vectors of A.

assume that -fraction of mass of b lies in span(A).

Lemma: Assume that:

Then, we get relative-error approximation:

Page 11: Faster least squares approximation

Randomized Hadamard preprocessing

Fact 1: Multiplication by HnDn doesn’t change the solution:

Fact 2: Multiplication by HnDn is fast - only O(n log(r)) time, where r is the number of elements of the output vector we need to “touch”.

(since Hn and Dn are orthogonal matrices).

Fact 3: Multiplication by HnDn approximately uniformizes all leverage scores:

Let Hn be an n-by-n deterministic Hadamard matrix, and Let Dn be an n-by-n random diagonal matrix with +1/-1 chosen u.a.r. on the diagonal.

Facts implicit or explicit in: Ailon & Chazelle (2006), or Ailon and Liberty (2008).

Page 12: Faster least squares approximation

A matrix perturbation theorem

Page 13: Faster least squares approximation

Establishing structural conditions

Lemma:|1-i2(SHUA)| ≤ 1/10, with constant probability,

i.e,. singular values of SHUA are close to 1.

Proof: Apply spectral norm matrix perturbation bound.

Lemma:||(SHUA)TSHb|| ≤ OPT, with constant probability,

i.e., SHUA is nearly orthogonal to b.

Proof: Apply Forbenius norm matrix perturbation bound.

Bottom line: • Fast algorithm approximately uniformizes the leverage of each data point.• Uniform sampling is optimal up to O(1/log(n)) factor, so over-sample a bit.• Overall running time O(ndlog(…)+d3log(…)/)

Page 14: Faster least squares approximation

Fast LS via sparse projection

Algorithm

1. Pre-process A and b with a randomized Hadamard transform.

2. Multiply preprocessed input by sparse random k x n matrix T, where

and where k=O(d/) and q=O(d log2(n)/n+d2log(n)/n) .

1. Solve the induced problem:

Drineas, Mahoney, Muthukrishnan, and Sarlos (2007) - sparse projection matrix from Matousek’s version of Ailon-Chazelle 2006

• Dense projections will work, but it is “slow.”

• Sparse projection is “fast,” but will it work?

-> YES! Sparsity parameter q of T related to non-uniformity of leverage scores!

Page 15: Faster least squares approximation

Lp regression problems

We are interested in over-constrained Lp regression problems, n >> d.

Lp regression problems are convex programs (or better!).

There exist poly-time algorithms.

We want to solve them faster!

Dasgupta, Drineas, Harb, Kumar, and Mahoney (2008)

Page 16: Faster least squares approximation

Lp norms and their unit balls

Recall the Lp norm for :

Some inequality relationships include:

Page 17: Faster least squares approximation

Solution to Lp regression

For p=2, Least-squares approximation (minimize ||Ax-b||2):

For p=∞, Chebyshev or mini-max approximation (minimize ||Ax-b||∞):

For p=1, Sum of absolute residuals approximation (minimize ||Ax-b||1):

Lp regression can be cast as a convex program for all .

Page 18: Faster least squares approximation

What made the L2 result work?

The L2 sampling algorithm worked because:

• For p=2, an orthogonal basis (from SVD, QR, etc.) is a “good” or “well-conditioned” basis.

(This came for free, since orthogonal bases are the obvious choice.)

• Sampling w.r.t. the “good” basis allowed us to perform “subspace-preserving sampling.”

(This allowed us to preserve the rank of the matrix and distances in its span.)

Can we generalize these two ideas to p2?

Page 19: Faster least squares approximation

Sufficient condition

Easy

Approximate minx||b-Ax||p by solving minx||S(b-Ax)||p giving x’

Assume that the matrix S is an isometry over the span of A and b: for any vector y in span(Ab) we have (1-)||y||p ||Sy||p (1+ ) ||y|| p

Then, we get relative-error approximation:

||b-Ax||p (1+ ) minx|| S(b-Ax)||p

(This requires O(d/2) for L2 regression.)

How to construct an Lp isometry?

Page 20: Faster least squares approximation

p-well-conditioned basis (definition)

Let A be an n x m matrix of rank d<<n, let p [1,∞), and q its dual.

Definition: An n x d matrix U is an (,,p)-well-conditioned basis for span(A) if:

(1) |||U|||p ≤ , (where |||U|||p = (ij|Uij|p)1/p )

(2) for all z in Rd, ||z||q ≤ ||Uz||p.

U is a p-well-conditioned basis if ,=dO(1), independent of m,n.

Page 21: Faster least squares approximation

p-well-conditioned basis (existence)

Let A be an n x m matrix of rank d<<n, let p [1,∞), and q its dual.

Theorem: There exists an (,,p)-well-conditioned basis U for span(A) s.t.:

if p < 2, then = d1/p+1/2 and = 1,if p = 2, then = d1/2 and = 1,if p > 2, then = d1/p+1/2 and = d1/q-1/2.

U can be computed in O(nmd+nd5log n) time (or just O(nmd) if p = 2).

Page 22: Faster least squares approximation

p-well-conditioned basis (construction)

Algorithm:

• Let A=QR be any QR decomposition of A.

(Stop if p=2.)

• Define the norm on Rd by ||z||Q,p ||Qz||p.

• Let C be the unit ball of the norm ||•||Q,p.

• Let the d x d matrix F define the Lowner-John ellipsoid of C.

• Decompose F=GTG,

where G is full rank and upper triangular.

• Return U = QG-1

as the p-well-conditioned basis.

Page 23: Faster least squares approximation

Subspace-preserving sampling

Let A be an n x m matrix of rank d<<n, let p [1,∞).

Let U be an (,,p)-well-conditioned basis for span(A),

Theorem: Randomly sample rows of A according to the probability distribution:

where:

Then, with probability 1- , the following holds for all x in Rm:

Page 24: Faster least squares approximation

Regularized regression• If you don’t trust your data (entirely)

• Minimize ||b-Ax||p+ ||x||r, where r arbitrary

• Lasso is especially popular: minx||b-Ax||2+ ||x||1

• Sampled Lp regression works for the regularized version too

• We want more: imagine

Page 25: Faster least squares approximation

Ridge regression• Tikhonov regularization, back to L2

• Minimize ||b-Ax||22+ 2||x||2

2

• Equivalent to augmenting n-by-d A with d and b with 0

• SVD of AUDiag(sqrt(i))VT

• Let D = i sqrt(i) << d

• For (1+)relative error:• precondition with randomized FFT• sample r=O(D log(n) log(D log(n))/) rows uniformly.

Page 26: Faster least squares approximation

SVD of a matrix

U (V): orthogonal matrix containing the left (right) singular vectors of A.

: diagonal matrix containing the singular values of A, ordered non-increasingly.

rank of A, the number of non-zero singular values.

Exact computation of the SVD takes O(min{mn2 , m2n}) time.

Any m x n matrix A can be decomposed as:

Page 27: Faster least squares approximation

SVD and low-rank approximations

Uk (Vk): orthogonal matrix containing the top k left (right) singular vectors of A.

k: diagonal matrix containing the top k singular values of A.

• Ak is the “best” matrix among all rank-k matrices wrt. to the spectral and Frobenius norms

• This optimality property of very useful in, e.g., Principal Component Analysis (PCA), LSI, etc.

Truncate the SVD by keeping k ≤ terms:

Page 28: Faster least squares approximation

Approximate SVD• Starting with the seminal work of Frieze et al (1998) and Papadimitriou et al (1998) till 2005 several results of the form

||A-Ak’||F ≤ ||A-Ak||F + t||A||F

• ||A||F might be a significantly larger than ||A-Ak||F

• Four independent results [H-P, DV, MRT, S 06] on

||A-Ak’||F ≤ (1+)||A-Ak||F

• This year Shyanalkumar and Deshpande …

Page 29: Faster least squares approximation

Relative error SVD via random projections

Theorem Let S be n by r=O(k/) random matrix with1 entries. Project A to the subspace spanned by the rows of SA and then compute Ak’ as the rank-k SVD of the projection. Then ||A-Ak’||F ≤ (1+)||A-Ak||F with prob. at least 1/3

• By keeping an orthonormal basis of SA total time is O(Mr+(n+m)r^2) with M non-zeroes • We need to make only 2 passes over the data

Can use fast random projections and avoid the SA projection, spectral norm (talks by R and T)

Page 30: Faster least squares approximation

SVD proof sketch Can show:

||A-Ak’||F2 ≤ ||A-Ak||F

2 + ||Ak-Ak(SAk)+(SA)||F

2

For all j columns of A consider the regression A(j)≈Akx

Exact: Ak(j), error is ||A(j)-Ak

(j)||2

Approximate: Ak-Ak(SAk)+SA(j)

Page 31: Faster least squares approximation

Conclusion

Fast methods based on randomized preprocessing and sampling or sparse projections for

Least squares approximation Regularized regression Low rank matrix approximation

Page 32: Faster least squares approximation

Thank you!