Fast Algorithms for Analyzing Massive Data

19
Fast Algorithms for Analyzing Massive Data Alexander Gray Georgia Institute of Technology www.fast-lab.org

description

Fast Algorithms for Analyzing Massive Data. Alexander Gray Georgia Institute of Technology www.fast-lab.org. The FASTlab F undamental A lgorithmic and S tatistical T ools Laboratory www.fast-lab.org. Alexander Gray: Assoc Prof , Applied Math + CS; PhD CS - PowerPoint PPT Presentation

Transcript of Fast Algorithms for Analyzing Massive Data

Page 1: Fast Algorithms  for Analyzing Massive Data

Fast Algorithms for Analyzing Massive Data

Alexander GrayGeorgia Institute of Technology

www.fast-lab.org

Page 2: Fast Algorithms  for Analyzing Massive Data

The FASTlabFundamental Algorithmic and Statistical Tools Laboratory

www.fast-lab.org

1. Alexander Gray: Assoc Prof, Applied Math + CS; PhD CS

2. Arkadas Ozakin: Research Scientist, Math + Physics; PhD Physics

3. Dongryeol Lee: PhD student, CS + Math4. Ryan Riegel: PhD student, CS + Math5. Sooraj Bhat: PhD student, CS6. Nishant Mehta: PhD student, CS7. Parikshit Ram: PhD student, CS + Math8. William March: PhD student, Math + CS9. Hua Ouyang: PhD student, CS10. Ravi Sastry: PhD student, CS11. Long Tran: PhD student, CS12. Ryan Curtin: PhD student, EE13. Ailar Javadi: PhD student, EE14. Anita Zakrzewska: PhD student, CS

+ 5-10 MS students and undergraduates

Page 3: Fast Algorithms  for Analyzing Massive Data

7 tasks ofmachine learning / data mining

1. Querying: spherical range-search O(N), orthogonal range-search O(N), nearest-neighbor O(N), all-nearest-neighbors O(N2)

2. Density estimation: mixture of Gaussians, kernel density estimation O(N2), kernel conditional density estimation O(N3)

3. Classification: decision tree, nearest-neighbor classifier O(N2), kernel discriminant analysis O(N2), support vector machine O(N3) , Lp SVM

4. Regression: linear regression, LASSO, kernel regression O(N2), Gaussian process regression O(N3)

5. Dimension reduction: PCA, non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3); Gaussian graphical models, discrete graphical models

6. Clustering: k-means, mean-shift O(N2), hierarchical (FoF) clustering O(N3)

7. Testing and matching: MST O(N3), bipartite cross-matching O(N3), n-point correlation 2-sample testing O(Nn), kernel embedding

Page 4: Fast Algorithms  for Analyzing Massive Data

7 tasks ofmachine learning / data mining

1. Querying: spherical range-search O(N), orthogonal range-search O(N), nearest-neighbor O(N), all-nearest-neighbors O(N2)

2. Density estimation: mixture of Gaussians, kernel density estimation O(N2), kernel conditional density estimation O(N3)

3. Classification: decision tree, nearest-neighbor classifier O(N2), kernel discriminant analysis O(N2), support vector machine O(N3), Lp SVM

4. Regression: linear regression, LASSO, kernel regression O(N2), Gaussian process regression O(N3)

5. Dimension reduction: PCA, non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3); Gaussian graphical models, discrete graphical models

6. Clustering: k-means, mean-shift O(N2), hierarchical (FoF) clustering O(N3)

7. Testing and matching: MST O(N3), bipartite cross-matching O(N3), n-point correlation 2-sample testing O(Nn), kernel embedding

Page 5: Fast Algorithms  for Analyzing Massive Data

7 tasks ofmachine learning / data mining

1. Querying: spherical range-search O(N), orthogonal range-search O(N), nearest-neighbor O(N), all-nearest-neighbors O(N2)

2. Density estimation: mixture of Gaussians, kernel density estimation O(N2), kernel conditional density estimation O(N3), submanifold density estimation [Ozakin & Gray, NIPS 2010], O(N3), convex adaptive kernel estimation [Sastry & Gray, AISTATS 2011] O(N4)

3. Classification: decision tree, nearest-neighbor classifier O(N2), kernel discriminant analysis O(N2), support vector machine O(N3) , Lp SVM, non-negative SVM [Guan et al, 2011]

4. Regression: linear regression, LASSO, kernel regression O(N2), Gaussian process regression O(N3)

5. Dimension reduction: PCA, non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3); Gaussian graphical models, discrete graphical models, rank-preserving maps [Ouyang and Gray, ICML 2008] O(N3); isometric separation maps [Vasiiloglou, Gray, and Anderson MLSP 2009] O(N3); isometric NMF [Vasiiloglou, Gray, and Anderson MLSP 2009] O(N3); functional ICA [Mehta and Gray, 2009], density preserving maps [Ozakin and Gray, in prep] O(N3)

6. Clustering: k-means, mean-shift O(N2), hierarchical (FoF) clustering O(N3)7. Testing and matching: MST O(N3), bipartite cross-matching O(N3), n-point

correlation 2-sample testing O(Nn), kernel embedding

Page 6: Fast Algorithms  for Analyzing Massive Data

7 tasks ofmachine learning / data mining

1. Querying: spherical range-search O(N), orthogonal range-search O(N), nearest-neighbor O(N), all-nearest-neighbors O(N2)

2. Density estimation: mixture of Gaussians, kernel density estimation O(N2), kernel conditional density estimation O(N3)

3. Classification: decision tree, nearest-neighbor classifier O(N2), kernel discriminant analysis O(N2), support vector machine O(N3) , Lp SVM

4. Regression: linear regression, kernel regression O(N2), Gaussian process regression O(N3), LASSO

5. Dimension reduction: PCA, non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3), Gaussian graphical models, discrete graphical models

6. Clustering: k-means, mean-shift O(N2), hierarchical (FoF) clustering O(N3)

7. Testing and matching: MST O(N3), bipartite cross-matching O(N3), n-point correlation 2-sample testing O(Nn), kernel embedding

ComputationalProblem!

Page 7: Fast Algorithms  for Analyzing Massive Data

The “7 Giants” of Data(computational problem types)

[Gray, Indyk, Mahoney, Szalay, in National Acad of Sci Report on Analysis of Massive Data, in prep]

1. Basic statistics: means, covariances, etc.

2. Generalized N-body problems: distances, geometry

3. Graph-theoretic problems: discrete graphs

4. Linear-algebraic problems: matrix operations

5. Optimizations: unconstrained, convex

6. Integrations: general dimension

7. Alignment problems: dynamic prog, matching

Page 8: Fast Algorithms  for Analyzing Massive Data

7 general strategies

1. Divide and conquer / indexing (trees)

2. Function transforms (series)

3. Sampling (Monte Carlo, active learning)

4. Locality (caching)

5. Streaming (online)

6. Parallelism (clusters, GPUs)

7. Problem transformation (reformulations)

Page 9: Fast Algorithms  for Analyzing Massive Data

• Fastest approach for:– nearest neighbor, range search (exact) ~O(logN) [Bentley

1970], all-nearest-neighbors (exact) O(N) [Gray & Moore, NIPS

2000], [Ram, Lee, March, Gray, NIPS 2010], anytime nearest neighbor (exact) [Ram & Gray, SDM 2012], max inner product [Ram & Gray, under review]

– mixture of Gaussians [Moore, NIPS 1999], k-means [Pelleg and

Moore, KDD 1999], mean-shift clustering O(N) [Lee & Gray, AISTATS

2009], hierarchical clustering (single linkage, friends-of-friends) O(NlogN) [March & Gray, KDD 2010]

– nearest neighbor classification [Liu, Moore, Gray, NIPS 2004], kernel discriminant analysis O(N) [Riegel & Gray, SDM 2008]

– n-point correlation functions ~O(Nlogn) [Gray & Moore, NIPS 2000],

[Moore et al. Mining the Sky 2000], multi-matcher jackknifed npcf [March & Gray, under review]

1. Divide and conquer

Page 10: Fast Algorithms  for Analyzing Massive Data

3-point correlation

(biggest previous: 20K)

VIRGO simulation data,N = 75,000,000

naïve: 5x109 sec. (~150 years)multi-tree: 55 sec. (exact)

n=2: O(N)

n=3: O(Nlog3)

n=4: O(N2)

Page 11: Fast Algorithms  for Analyzing Massive Data

3-point correlation

Naive - O(Nn)(estimated)

Single bandwidth

[Gray & Moore 2000, Moore et al.

2000]

Multi-bandwidth

[March & Gray in prep 2010]

new

2 point cor.2 point cor.100 matchers100 matchers

2.0 x 107 s

352.8 s56,000

4.96 s71.1

3 point cor.3 point cor.243 matchers243 matchers

1.1 x 1011 s

891.6 s1.23 x

108

13.58 s65.6

4 point cor. 4 point cor. 216 matchers216 matchers

2.3 x 1014 s

14530 s1.58 x 1010

503.6 s28.8

106 points, galaxy simulation data

Page 12: Fast Algorithms  for Analyzing Massive Data

2. Function transforms

• Fastest approach for:– Kernel estimation (low-ish

dimension): dual-tree fast Gauss transforms (multipole/Hermite expansions) [Lee, Gray, Moore NIPS 2005], [Lee and Gray, UAI 2006]

– KDE and GP (kernel density estimation, Gaussian process regression) (high-D): random Fourier functions [Lee and Gray, in prep]

Page 13: Fast Algorithms  for Analyzing Massive Data

3. Sampling

• Fastest approach for (approximate):– PCA: cosine trees [Holmes, Gray, Isbell, NIPS 2008]

– Kernel estimation: bandwidth learning [Holmes, Gray, Isbell, NIPS

2006], [Holmes, Gray, Isbell, UAI 2007], Monte Carlo multipole method (with SVD trees) [Lee & Gray, NIPS 2009]

– Nearest-neighbor: distance-approx: spill trees with random proj: [Liu, Moore, Gray, Yang, NIPS 2004], rank-approximate: [Ram, Ouyang, Gray, NIPS 2009]

Rank-approximate NN:•Best meaning-retaining approximation criterion in the face of high-dimensional distances •More accurate than LSH

Page 14: Fast Algorithms  for Analyzing Massive Data

3. Sampling

• Active learning: the sampling can depend on previous samples– Linear classifiers:

rigorous framework for pool-based active learning [Sastry and Gray, AISTATS 2012]

• Empirically allows reduction in the number of objects that require labeling

• Theoretical rigor: unbiasedness

Page 15: Fast Algorithms  for Analyzing Massive Data

4. Caching

• Fastest approach for (using disk):– Nearest-neighbor, 2-point: Disk-based treee

algorithms in Microsoft SQL Server [Riegel, Aditya, Budavari, Gray, in prep]

• Builds kd-tree on top of built-in B-trees• Fixed-pass algorithm to build kd-tree

No. of points MLDB (Dual tree) Naive

40,000 8 seconds 159 seconds

200,000 43 seconds 3480 seconds

2,000,000 297 seconds 80 hours

10,000,000 29 mins 27 sec 74 days

20,000,000 58mins 48sec 280 days

40,000,000 112m 32 sec 2 years

Page 16: Fast Algorithms  for Analyzing Massive Data

5. Streaming / online

• Fastest approach for (approximate, or streaming):– Online learning/stochastic optimization: just use the

current sample to update the gradient• SVM (squared hinge loss): stochastic Frank-Wolfe [Ouyang

and Gray, SDM 2010]

• SVM, LASSO, et al.: noise-adaptive stochastic approximation [Ouyang and Gray, in prep, on arxiv], accelerated non-smooth SGD [Ouyang and Gray, under review]

– faster than SGD

– solves step size problem

– beats all existing convergence rates

Page 17: Fast Algorithms  for Analyzing Massive Data

6. Parallelism

• Fastest approach for (using many machines):– KDE, GP, n-point: distributed trees [Lee and Gray, SDM 2012], 6000+

cores; [March et al, in prep for Gordon Bell Prize 2012], 100K cores?• Each process owns the global tree and its local tree• First log p levels built in parallel; each process determines where to send data• Asynchronous averaging; provable convergence

– SVM, LASSO, et al.: distributed online optimization [Ouyang and Gray, in prep, on arxiv]

• Provable theoretical speedup for the first time

Page 18: Fast Algorithms  for Analyzing Massive Data

7. Transformationsbetween problems

• Change the problem type:– Linear algebra on kernel matrices N-body inside conjugate gradient [Gray, TR 2004]

– Euclidean graphs N-body problems [March & Gray, KDD 2010]

– HMM as graph matrix factorization [Tran & Gray, in prep]

• Optimizations: reformulate the objective and constraints:– Maximum variance unfolding: SDP via Burer-Monteiro convex relaxation [Vasiloglou,

Gray, Anderson MLSP 2009]

– Lq SVM, 0<q<1: DC programming [Guan & Gray, CSDA 2011]

– L0 SVM: mixed integer nonlinear program via perspective cuts [Guan & Gray, under review]

– Do reformulations automatically [Agarwal et al, PADL 2010], [Bhat et al, POPL 2012]

• Create new ML methods with desired computational properties:– Density estimation trees: nonparametric density estimation, O(NlogN) [Ram & Gray, KDD

2011]

– Local linear SVMs: nonlinear classification, O(NlogN) [Sastry & Gray, under review]

– Discriminative local coding: nonlinear classification O(NlogN) [Mehta & Gray, under review]

Page 19: Fast Algorithms  for Analyzing Massive Data

Software• For academic use only: MLPACK

– Open source, C++, written by students

– Data must fit in RAM: distributed in progress

• For institutions: Skytree Server– First commercial-grade high-performance machine learning server

– Fastest, biggest ML available: up to 10,000x faster than existing solutions (on one machine)

– V.12, April 2012-ish: distributed, streaming

– Connects to stats packages, Matlab, DBMS, Python, etc

– www.skytreecorp.com

– Colleagues: Email me to try it out: [email protected]