Data-Intensive Statistical Challenges in Astrophysics Alex Szalay The Johns Hopkins University...

27
Data-Intensive Statistical Challenges in Astrophysics Alex Szalay The Johns Hopkins University Collaborators: T. Budavari, C-W Yip (JHU), M. Mahoney (Stanford), I. Csabai, L. Dobos (Hungary)

Transcript of Data-Intensive Statistical Challenges in Astrophysics Alex Szalay The Johns Hopkins University...

Data-Intensive Statistical Challenges in Astrophysics

Alex SzalayThe Johns Hopkins University

Collaborators: T. Budavari, C-W Yip (JHU), M. Mahoney (Stanford), I. Csabai, L. Dobos (Hungary)

The Age of Surveys

CMB Surveys (pixels)• 1990 COBE 1000• 2000 Boomerang 10,000• 2002 CBI 50,000• 2003 WMAP 1 Million• 2008 Planck 10 Million

Galaxy Redshift Surveys (obj)• 1986 CfA 3500• 1996 LCRS 23000• 2003 2dF

250000• 2008 SDSS 1000000• 2012 BOSS

2000000• 2012 LAMOST 2500000

Angular Galaxy Surveys (obj)• 1970 Lick

1M• 1990 APM

2M• 2005 SDSS

200M• 2011 PS1

1000M• 2020 LSST

30000MTime Domain• QUEST• SDSS Extension survey• Dark Energy Camera• Pan-STARRS• LSST…

Petabytes/year …

Sloan Digital Sky Survey

• “The Cosmic Genome Project”• Two surveys in one

– Photometric survey in 5 bands– Spectroscopic redshift survey

• Data is public– 2.5 Terapixels of images => 5 Tpx– 10 TB of raw data => 120TB processed– 0.5 TB catalogs => 35TB in the end

• Started in 1992, finished in 2008• Extra data volume enabled by

– Moore’s Law– Kryder’s Law

Analysis of Galaxy Spectra

• Sparse signal in large dimensions• Much noise, and very rare events• 4Kx1M SVD problem, perfect for randomized

algorithms• Motivated our work on robust incremental PCA

Galaxy Properties from Galaxy Spectra

Continuum EmissionsSpectral Lines

Galaxy Diversity from PCA

[Average Spectrum]

[Stellar Continuum]

[Finer Continuum Features + Age]

[Age]Balmer series hydrogen lines

[Metallicity] Mg b, Na D, Ca II Triplet

1st

2nd

3rd

4th

5th

PC

Streaming PCA

• Initialization– Eigensystem of a small, random subset– Truncate at p largest eigenvalues

• Incremental updates– Mean and the low-rank A matrix– SVD of A yields new eigensystem

• Randomized algorithm!

T. Budavari, D. Mishin 2011

Robust PCA

• PCA minimizes σRMS of the residuals r = y – Py– Quadratic formula: r2 extremely sensitive to outliers

• We optimize a robust M-scale σ2 (Maronna 2005)– Implicitly given by

• Fits in with the iterative method!• Outliers can be processed separately

Eigenvalues in Streaming PCA

Classic Robust

9

Examples with SDSS Spectra

Built on top of the Incremental Robust PCA

• Principal Component Pursuit (I. Csabai et al)• Importance sampling (C-W Yip et al)

Principal component pursuit

• Low rank approximation of data matrix: X • Standard PCA:

– works well if the noise distribution is Gaussian– outliers can cause bias

• Principal component pursuit

– “sparse” spiky noise/outliers: try to minimize the number of outliers while keeping the rank low

– NP-hard problem

• The L1 trick:

– numerically feasible convex problem (Augmented Lagrange Multiplier)

kEranktosubjectEX )(min2

* E. Candes, et al. “Robust Principal Component Analysis”. preprint, 2009. Abdelkefi et al. ACM CoNEXT Workshop (traffic anomaly detection)

kNrankANXtosubjectA )(,min0

ANXtosubjectANAN

,

1*min

21*,

)(min ANXtosubjectANAN

• Slowly varying continuum + absorption lines

• Highly variable “sparse” emission lines

• This is the simple version of PCP: the position of the lines are known• but there are many of

them, automatic detection can be useful

• spiky noise can bias standard PCA

DATA:Streaming robust PCA implementation for galaxy spectrum catalog (L. Dobos et al.)

SDSS 1M galaxy spectraMorphological subclassesRobust averages + first few PCA directions

Testing on Galaxy Spectra

PCA

PCA reconstruction

Residual

Principal component pursuit

Low rank

Sparse

Residual

λ=0.6/sqrt(n), ε=0.03

Not Every Data Direction is Equal

A = C X

Gal

axy

ID

Wavelength

Gal

axy

ID

Selected WavelengthsWavelength

Procedure:1. Perform SVD of A = U VT

2. Pick number of eigenvectors = K3. Calculate Leverage Score = i ||VT

ij||2 / K

Selected W

avelengths

Mahoney and Drineas 2009

Wavelength Sampling Probability

k = 2 c = 7

k = 4c = 16

k = 6c = 25

k = 8c = 29

Ranking Astronomical Line Indices

(Yip et al. 2012 in prep.)(Worthey et al. 94; Trager et al. 98)

Subspace Analysis of Spectra Cutouts:

- Othogonality- Divergence- Commonality

Identify Informative Regions

“NewMethod”1. Pick the λ with largest Pλ2. Define its region of influence using λ Pλ convergence.

Mask λ’s from future selection.

3. Go back to Step 1, or quit.

“MahoneySecond”4. Over-select λ’s from the targeted number.

5. Merge selected λ if two pixels lie within a certain distance

6. Quit.

Identifying New Line Indices, Objectively

(Yip et al. 2012 in prep.)

New Spectral Regions (MahoneySecond; k = 5; Overselecting 10 X; Combining if < 30 Å)

NewMethod vs MahoneySecond

NM

M2

Gunawan & Neswan 2000)

Angle between Subspaces

JHU Lick

λ Pλ

JHU Lick

Line Indices for Galaxy Parameter Estimations

Importance Sampling and Galaxies

• Lick indices are ad hoc• The new indices are objective

– Recover atomic lines– Recover molecular bands– Recover Lick indices– Informative regions are orthogonal to each other,

in contrast to Lick

• Future– Emission line indices– More accurate parameter estimation of galaxies

Summary

Non-Incremental changes on the way• Science is moving increasingly from hypothesis-

driven to data-driven discoveries• Need randomized, incremental algorithms

– Best result in 1 min, 1 hour, 1 day, 1 week

• New computational tools and strategies

… not just statistics, not just computer science, not just astronomy, not just genomics…

Astronomy has always been data-driven….now becoming more generally accepted