Computational AstroStatisticsBob Nichol (Carnegie Mellon)
Motivation & Goals
Multi-Resolutional KD-trees (examples)
Npt functions (application)
Mixture models (applications)
Bayes network anomaly detection (application)
Very high dimensional data
NVO Problems
Collaborators
Chris Miller, Percy Gomez, Kathy Romer, Andy Connolly, Andrew Hopkins, Mariangela Bernardi,
Tomo Goto (Astro)Larry Wasserman, Chris Genovese, Wong Jang,
Pierpaolo Brutti (Statistics)Andrew Moore, Jeff Schneider, Brigham Anderson, Alex Gray, Dan Pelleg (CS)
Alex Szalay, Gordon Richards, Istvan Szapudi & others (SDSS)
Pittsburgh Computational AstroStatistics (PiCA) Group
(See http://www.picagroup.org)
First MotivationCosmology is moving from a “discovery”
science into a “statistical” scienceDrive for ``high precision’’ measurements:
Cosmological parameters to a few percent;Accurate description of the complex structure in
the universe;Control of observational and sampling biases
New statistical tools – e.g. non-parametric analyses – are often computationally intensive.
Also, often want to re-sample or Monte Carlo data.
Second MotivationLast decade was dedicated to building more
telescopes and instruments; more coming this decade as well (SDSS, Planck, LSST, 2MASS,
DPOSS, MAP). Also, larger simulations.
We have a “Data Flood”; SDSS is terabytes of data a night, while LSST is an SDSS every 5
nights! Petabytes by end of 00’s
Highly correlated datasets and high dimensionality
Existing statistics and algorithms do not scale into these regimes
New Paradigm where we must build new tools before we can analyze &
visualize data
SDSSSDSS
SDSSSDSS
SDSS Data
FACTOR OF 12,000,000
Area 10000 sq deg 3
Objects 2.5 billion 200
Spectra 1.5 million 200
Depth R=23 10
Attributes 144 presently 10
SDSS Science Most Distant Object! 100,000 spectra!
Start with tree data structures: Multi-resolutional kd-trees
Scale to n-dimensions (although for very high dimensions use new tree structures)
Use Cached Representation (store at each node summary sufficient statistics). Compute counts
from these statisticsPrune the tree which is stored in memory!See Moore et al. 2001 (astro-ph/0012333)
Many applications; suite of algorithms!
Goal to build new, fast & efficient statistical algorithms
Range SearchesFast range searches and catalog matching
Prune cells outside range
Also Prune cells inside!Greater saving in time
N-point correlation functions
The 2-point function has a long history in cosmology (Peebles 1980). It is the excess joint probability of a pair
of points over that expected from a poisson process. Also long history (as point processes) in Statistics:
Similarly, the three-point is defined as (so on!)
Same 2pt, very different 3ptNaively, this is an n^N process, but all it is, is a
set of range searches.
Dual Tree Approach
Usually binned into annuli rmin< r < rmax . Thus, for each r transverse both trees and prune pairs of nodes with either dmin < rmin ; dmax > rmax.
Also, if dmin > rmin & dmax<rmax all pairs in these nodes are within annuli. Therefore, only need to calculate pairs cutting the boundaries.
Extra speed-ups are possible doing multiple r’s together and controlled approximations
Time depends on density of points
and binsize & scale
N*N
NlogNN*N*N
Fast Mixture ModelsDescribe the data in N-dimensions as a mixture of, say, Gaussians (kernel shape less important than
bandwidth!)
The parameters of the model are then N gaussians each with a mean and covariance
Iterate, testing using BIC and AIC at each iteration. Fast because of kdtrees (20 mins for
100,000 points on a PC!)
Employ heuristic splitting algorithm as well
Details in Connolly et al. 2000 (astro-ph/0008187)
EM-Based Gaussian Mixture Clustering: 1
EM-Based Gaussian Mixture Clustering: 2
EM-Based Gaussian Mixture Clustering: 4
EM-Based Gaussian Mixture Clustering: 20
Applications
Used in SDSS quasar selection (used to map the multi-color stellar locus)
Gordon Richards @ PSU
Anomaly detector (look for low probability points in N-dimensions)
Optimal smoothing of large-scale structure
SDSS QSO target selection in 4D color-space
Cluster 9999 spectroscopically confirmed stars
Cluster 8833 spectroscopically
confirmed QSOs (33 gaussians)
99% for stars, 96% for QSOs
Bayes Net Anomaly Detector
Instead of using a single joint probability function (fitted to data) factorize into a smaller
set of conditional probabilities Directional and acyclical
If we know graph and conditional probabilities, we have valid probability function
to whole model
Use 1.5 million SDSS sources to learn model (25 variables each)
Then evaluate the likelihood of each data being drawn from the model
Lowest 1000 are anomalous; look at ‘em and follow `em up at Keck
Unfortunately, a lot of error Advantage of Bayes Net is that to tells you why it was anomalous; the most unusual conditional probabilitiesTherefore, iterate loop and get scientist to highlight obvious errors; then suppress those errors so they do not return againIssue of productivity!
Will Only Get Worse
LSST will do an SDSS every 5 nights looking for transient objects producing petabytes of data (2007)
VISTA will collect 300 Terabytes of data (2005)
Archival Science is upon us! HST database has 20GBytes per day
downloaded (10 times more than goes in!)
Will Only Get Worse II
Surveys spanning electromagnetic spectrumCombining these surveys is hard: different sensitivities, resolutions and physicsMixture of imaging, catalogs and spectraDifference between continuum and point processesThousands of attributes per source
What is VO?
The “Virtual Observatory” must: Federate multi-wavelength data sources
(interoperability)Must empower everyone (democratise)
Be fast, distributed and easyAllow input and output
Computer Science + Statistics!Scientists will need help through autonomous
scientific discovery of large, multi-dimensional, correlated datasets
Scientists will need fast databases Scientists will need distributed computing and fast
networks Scientists will need new visualization tools
CS and Statistics looking for new challenges: Also no data-rights & privacy issues
New breed of students needed with IT skills
Symbiotic Relationship Symbiotic Relationship
VO PrototypeIdeally we would like all parts of the VO to be web-servises
DB C# dym
EMdymhttp
.NEThttp
Lessons We Learnt
Tough to marry research c code developed under linux to MS (pointers to memory)
.NET has “unsafe” memory
.NET server is hard to set up!
Migrate to using VOTables to perform all I/O.Have server running at CMU so we can control code
Very High Dimensions
Using LLE and Isomap; looking for lower
dimensional manifolds in higher dimensional spaces
500x2000 space from
SDSS spectra
SummaryEra of New Cosmology: Massive data sources and
search for subtle features & high precision measurements
Need new methods that scale into these new regimes; ``a virtual universe’’ (students will need different
skills). Perfect synergy with Stats, CS, PhysicsGood algorithms are as good as faster and more
computers!The “glue” to make a “virtual observatory” is hard
and complex. Don’t under-estimate the job
Are the Features Real? (FDR)!Are the Features Real? (FDR)!
This is an example of multiplehypothesis testing e.g. is every point
consistent with a smooth p(k)?
Let us first look at a simulated example: consider a 1000x1000 image
with 40000 sources.
FDR 30389 1505 9611 958495
2sigma 31497 22728 8503 937272
Bonferroni 27137 0 12863 960000
FDR makes 15 times few mistakes for the same power as traditional 2-sigma
Why? Controls a scientifically meaningful quantity: FDR = No. of false discoveries/Total no. of discoveries
And it is adaptive to the size of the dataset
We used a FDR of 0.25i.e. 25% of circled Points are in error
Therefore, we can say with statistical rigor that most of these points a rejected and are thus
``features’’
No single point is a 3sigma deviation
New statistics has enabled an astronomical discovery
Top Related