Download - Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.

Computational AstroStatisticsBob Nichol (Carnegie Mellon)

Motivation & Goals

Multi-Resolutional KD-trees (examples)

Npt functions (application)

Mixture models (applications)

Bayes network anomaly detection (application)

Very high dimensional data

NVO Problems

Collaborators

Chris Miller, Percy Gomez, Kathy Romer, Andy Connolly, Andrew Hopkins, Mariangela Bernardi,

Tomo Goto (Astro)Larry Wasserman, Chris Genovese, Wong Jang,

Pierpaolo Brutti (Statistics)Andrew Moore, Jeff Schneider, Brigham Anderson, Alex Gray, Dan Pelleg (CS)

Alex Szalay, Gordon Richards, Istvan Szapudi & others (SDSS)

Pittsburgh Computational AstroStatistics (PiCA) Group

(See http://www.picagroup.org)

First MotivationCosmology is moving from a “discovery”

science into a “statistical” scienceDrive for ``high precision’’ measurements:

Cosmological parameters to a few percent;Accurate description of the complex structure in

the universe;Control of observational and sampling biases

New statistical tools – e.g. non-parametric analyses – are often computationally intensive.

Also, often want to re-sample or Monte Carlo data.

Second MotivationLast decade was dedicated to building more

telescopes and instruments; more coming this decade as well (SDSS, Planck, LSST, 2MASS,

DPOSS, MAP). Also, larger simulations.

We have a “Data Flood”; SDSS is terabytes of data a night, while LSST is an SDSS every 5

nights! Petabytes by end of 00’s

Highly correlated datasets and high dimensionality

Existing statistics and algorithms do not scale into these regimes

New Paradigm where we must build new tools before we can analyze &

visualize data

SDSSSDSS

SDSS Data

FACTOR OF 12,000,000

Area 10000 sq deg 3

Objects 2.5 billion 200

Spectra 1.5 million 200

Depth R=23 10

Attributes 144 presently 10

SDSS Science Most Distant Object! 100,000 spectra!

Start with tree data structures: Multi-resolutional kd-trees

Scale to n-dimensions (although for very high dimensions use new tree structures)

Use Cached Representation (store at each node summary sufficient statistics). Compute counts

from these statisticsPrune the tree which is stored in memory!See Moore et al. 2001 (astro-ph/0012333)

Many applications; suite of algorithms!

Goal to build new, fast & efficient statistical algorithms

Range SearchesFast range searches and catalog matching

Prune cells outside range

Also Prune cells inside!Greater saving in time

N-point correlation functions

The 2-point function has a long history in cosmology (Peebles 1980). It is the excess joint probability of a pair

of points over that expected from a poisson process. Also long history (as point processes) in Statistics:

Similarly, the three-point is defined as (so on!)

Same 2pt, very different 3ptNaively, this is an n^N process, but all it is, is a

set of range searches.

Dual Tree Approach

Usually binned into annuli rmin< r < rmax . Thus, for each r transverse both trees and prune pairs of nodes with either dmin < rmin ; dmax > rmax.

Also, if dmin > rmin & dmax<rmax all pairs in these nodes are within annuli. Therefore, only need to calculate pairs cutting the boundaries.

Extra speed-ups are possible doing multiple r’s together and controlled approximations

Time depends on density of points

and binsize & scale

N*N

NlogNN*N*N

Fast Mixture ModelsDescribe the data in N-dimensions as a mixture of, say, Gaussians (kernel shape less important than

bandwidth!)

The parameters of the model are then N gaussians each with a mean and covariance

Iterate, testing using BIC and AIC at each iteration. Fast because of kdtrees (20 mins for

100,000 points on a PC!)

Employ heuristic splitting algorithm as well

Details in Connolly et al. 2000 (astro-ph/0008187)

EM-Based Gaussian Mixture Clustering: 1

Applications

Used in SDSS quasar selection (used to map the multi-color stellar locus)

Gordon Richards @ PSU

Anomaly detector (look for low probability points in N-dimensions)

Optimal smoothing of large-scale structure

SDSS QSO target selection in 4D color-space

Cluster 9999 spectroscopically confirmed stars

Cluster 8833 spectroscopically

confirmed QSOs (33 gaussians)

99% for stars, 96% for QSOs

Bayes Net Anomaly Detector

Instead of using a single joint probability function (fitted to data) factorize into a smaller

set of conditional probabilities Directional and acyclical

If we know graph and conditional probabilities, we have valid probability function

to whole model

Use 1.5 million SDSS sources to learn model (25 variables each)

Then evaluate the likelihood of each data being drawn from the model

Lowest 1000 are anomalous; look at ‘em and follow `em up at Keck

Unfortunately, a lot of error Advantage of Bayes Net is that to tells you why it was anomalous; the most unusual conditional probabilitiesTherefore, iterate loop and get scientist to highlight obvious errors; then suppress those errors so they do not return againIssue of productivity!

Will Only Get Worse

LSST will do an SDSS every 5 nights looking for transient objects producing petabytes of data (2007)

VISTA will collect 300 Terabytes of data (2005)

Archival Science is upon us! HST database has 20GBytes per day

downloaded (10 times more than goes in!)

Will Only Get Worse II

Surveys spanning electromagnetic spectrumCombining these surveys is hard: different sensitivities, resolutions and physicsMixture of imaging, catalogs and spectraDifference between continuum and point processesThousands of attributes per source

What is VO?

The “Virtual Observatory” must: Federate multi-wavelength data sources

(interoperability)Must empower everyone (democratise)

Be fast, distributed and easyAllow input and output

Computer Science + Statistics!Scientists will need help through autonomous

scientific discovery of large, multi-dimensional, correlated datasets

Scientists will need fast databases Scientists will need distributed computing and fast

networks Scientists will need new visualization tools

CS and Statistics looking for new challenges: Also no data-rights & privacy issues

New breed of students needed with IT skills

Symbiotic Relationship Symbiotic Relationship

VO PrototypeIdeally we would like all parts of the VO to be web-servises

DB C# dym

EMdymhttp

.NEThttp

Lessons We Learnt

Tough to marry research c code developed under linux to MS (pointers to memory)

.NET has “unsafe” memory

.NET server is hard to set up!

Migrate to using VOTables to perform all I/O.Have server running at CMU so we can control code

Very High Dimensions

Using LLE and Isomap; looking for lower

dimensional manifolds in higher dimensional spaces

500x2000 space from

SDSS spectra

SummaryEra of New Cosmology: Massive data sources and

search for subtle features & high precision measurements

Need new methods that scale into these new regimes; ``a virtual universe’’ (students will need different

skills). Perfect synergy with Stats, CS, PhysicsGood algorithms are as good as faster and more

computers!The “glue” to make a “virtual observatory” is hard

and complex. Don’t under-estimate the job

Are the Features Real? (FDR)!Are the Features Real? (FDR)!

This is an example of multiplehypothesis testing e.g. is every point

consistent with a smooth p(k)?

Let us first look at a simulated example: consider a 1000x1000 image

with 40000 sources.

FDR 30389 1505 9611 958495

2sigma 31497 22728 8503 937272

Bonferroni 27137 0 12863 960000

FDR makes 15 times few mistakes for the same power as traditional 2-sigma

Why? Controls a scientifically meaningful quantity: FDR = No. of false discoveries/Total no. of discoveries

And it is adaptive to the size of the dataset

We used a FDR of 0.25i.e. 25% of circled Points are in error

Therefore, we can say with statistical rigor that most of these points a rejected and are thus

``features’’

No single point is a 3sigma deviation

New statistics has enabled an astronomical discovery