Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional...
-
Upload
alice-atkinson -
Category
Documents
-
view
214 -
download
0
Transcript of Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional...
![Page 1: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/1.jpg)
Computational AstroStatisticsBob Nichol (Carnegie Mellon)
Motivation & Goals
Multi-Resolutional KD-trees (examples)
Npt functions (application)
Mixture models (applications)
Bayes network anomaly detection (application)
Very high dimensional data
NVO Problems
![Page 2: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/2.jpg)
Collaborators
Chris Miller, Percy Gomez, Kathy Romer, Andy Connolly, Andrew Hopkins, Mariangela Bernardi,
Tomo Goto (Astro)Larry Wasserman, Chris Genovese, Wong Jang,
Pierpaolo Brutti (Statistics)Andrew Moore, Jeff Schneider, Brigham Anderson, Alex Gray, Dan Pelleg (CS)
Alex Szalay, Gordon Richards, Istvan Szapudi & others (SDSS)
Pittsburgh Computational AstroStatistics (PiCA) Group
(See http://www.picagroup.org)
![Page 3: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/3.jpg)
First MotivationCosmology is moving from a “discovery”
science into a “statistical” scienceDrive for ``high precision’’ measurements:
Cosmological parameters to a few percent;Accurate description of the complex structure in
the universe;Control of observational and sampling biases
New statistical tools – e.g. non-parametric analyses – are often computationally intensive.
Also, often want to re-sample or Monte Carlo data.
![Page 4: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/4.jpg)
Second MotivationLast decade was dedicated to building more
telescopes and instruments; more coming this decade as well (SDSS, Planck, LSST, 2MASS,
DPOSS, MAP). Also, larger simulations.
We have a “Data Flood”; SDSS is terabytes of data a night, while LSST is an SDSS every 5
nights! Petabytes by end of 00’s
Highly correlated datasets and high dimensionality
Existing statistics and algorithms do not scale into these regimes
New Paradigm where we must build new tools before we can analyze &
visualize data
![Page 5: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/5.jpg)
SDSSSDSS
![Page 6: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/6.jpg)
SDSSSDSS
![Page 7: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/7.jpg)
SDSS Data
FACTOR OF 12,000,000
Area 10000 sq deg 3
Objects 2.5 billion 200
Spectra 1.5 million 200
Depth R=23 10
Attributes 144 presently 10
SDSS Science Most Distant Object! 100,000 spectra!
![Page 8: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/8.jpg)
Start with tree data structures: Multi-resolutional kd-trees
Scale to n-dimensions (although for very high dimensions use new tree structures)
Use Cached Representation (store at each node summary sufficient statistics). Compute counts
from these statisticsPrune the tree which is stored in memory!See Moore et al. 2001 (astro-ph/0012333)
Many applications; suite of algorithms!
Goal to build new, fast & efficient statistical algorithms
![Page 9: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/9.jpg)
![Page 10: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/10.jpg)
Range SearchesFast range searches and catalog matching
Prune cells outside range
Also Prune cells inside!Greater saving in time
![Page 11: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/11.jpg)
N-point correlation functions
The 2-point function has a long history in cosmology (Peebles 1980). It is the excess joint probability of a pair
of points over that expected from a poisson process. Also long history (as point processes) in Statistics:
Similarly, the three-point is defined as (so on!)
![Page 12: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/12.jpg)
Same 2pt, very different 3ptNaively, this is an n^N process, but all it is, is a
set of range searches.
![Page 13: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/13.jpg)
Dual Tree Approach
Usually binned into annuli rmin< r < rmax . Thus, for each r transverse both trees and prune pairs of nodes with either dmin < rmin ; dmax > rmax.
Also, if dmin > rmin & dmax<rmax all pairs in these nodes are within annuli. Therefore, only need to calculate pairs cutting the boundaries.
Extra speed-ups are possible doing multiple r’s together and controlled approximations
![Page 14: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/14.jpg)
Time depends on density of points
and binsize & scale
N*N
NlogNN*N*N
![Page 15: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/15.jpg)
Fast Mixture ModelsDescribe the data in N-dimensions as a mixture of, say, Gaussians (kernel shape less important than
bandwidth!)
The parameters of the model are then N gaussians each with a mean and covariance
Iterate, testing using BIC and AIC at each iteration. Fast because of kdtrees (20 mins for
100,000 points on a PC!)
Employ heuristic splitting algorithm as well
Details in Connolly et al. 2000 (astro-ph/0008187)
![Page 16: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/16.jpg)
EM-Based Gaussian Mixture Clustering: 1
![Page 17: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/17.jpg)
EM-Based Gaussian Mixture Clustering: 2
![Page 18: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/18.jpg)
EM-Based Gaussian Mixture Clustering: 4
![Page 19: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/19.jpg)
EM-Based Gaussian Mixture Clustering: 20
![Page 20: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/20.jpg)
![Page 21: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/21.jpg)
Applications
Used in SDSS quasar selection (used to map the multi-color stellar locus)
Gordon Richards @ PSU
Anomaly detector (look for low probability points in N-dimensions)
Optimal smoothing of large-scale structure
![Page 22: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/22.jpg)
SDSS QSO target selection in 4D color-space
Cluster 9999 spectroscopically confirmed stars
Cluster 8833 spectroscopically
confirmed QSOs (33 gaussians)
99% for stars, 96% for QSOs
![Page 23: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/23.jpg)
![Page 24: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/24.jpg)
Bayes Net Anomaly Detector
Instead of using a single joint probability function (fitted to data) factorize into a smaller
set of conditional probabilities Directional and acyclical
If we know graph and conditional probabilities, we have valid probability function
to whole model
![Page 25: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/25.jpg)
Use 1.5 million SDSS sources to learn model (25 variables each)
Then evaluate the likelihood of each data being drawn from the model
Lowest 1000 are anomalous; look at ‘em and follow `em up at Keck
![Page 26: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/26.jpg)
Unfortunately, a lot of error Advantage of Bayes Net is that to tells you why it was anomalous; the most unusual conditional probabilitiesTherefore, iterate loop and get scientist to highlight obvious errors; then suppress those errors so they do not return againIssue of productivity!
![Page 27: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/27.jpg)
Will Only Get Worse
LSST will do an SDSS every 5 nights looking for transient objects producing petabytes of data (2007)
VISTA will collect 300 Terabytes of data (2005)
Archival Science is upon us! HST database has 20GBytes per day
downloaded (10 times more than goes in!)
![Page 28: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/28.jpg)
Will Only Get Worse II
Surveys spanning electromagnetic spectrumCombining these surveys is hard: different sensitivities, resolutions and physicsMixture of imaging, catalogs and spectraDifference between continuum and point processesThousands of attributes per source
![Page 29: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/29.jpg)
What is VO?
The “Virtual Observatory” must: Federate multi-wavelength data sources
(interoperability)Must empower everyone (democratise)
Be fast, distributed and easyAllow input and output
![Page 30: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/30.jpg)
Computer Science + Statistics!Scientists will need help through autonomous
scientific discovery of large, multi-dimensional, correlated datasets
Scientists will need fast databases Scientists will need distributed computing and fast
networks Scientists will need new visualization tools
CS and Statistics looking for new challenges: Also no data-rights & privacy issues
New breed of students needed with IT skills
Symbiotic Relationship Symbiotic Relationship
![Page 31: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/31.jpg)
VO PrototypeIdeally we would like all parts of the VO to be web-servises
DB C# dym
EMdymhttp
.NEThttp
![Page 32: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/32.jpg)
Lessons We Learnt
Tough to marry research c code developed under linux to MS (pointers to memory)
.NET has “unsafe” memory
.NET server is hard to set up!
Migrate to using VOTables to perform all I/O.Have server running at CMU so we can control code
![Page 33: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/33.jpg)
Very High Dimensions
Using LLE and Isomap; looking for lower
dimensional manifolds in higher dimensional spaces
500x2000 space from
SDSS spectra
![Page 34: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/34.jpg)
SummaryEra of New Cosmology: Massive data sources and
search for subtle features & high precision measurements
Need new methods that scale into these new regimes; ``a virtual universe’’ (students will need different
skills). Perfect synergy with Stats, CS, PhysicsGood algorithms are as good as faster and more
computers!The “glue” to make a “virtual observatory” is hard
and complex. Don’t under-estimate the job
![Page 35: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/35.jpg)
Are the Features Real? (FDR)!Are the Features Real? (FDR)!
This is an example of multiplehypothesis testing e.g. is every point
consistent with a smooth p(k)?
![Page 36: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/36.jpg)
Let us first look at a simulated example: consider a 1000x1000 image
with 40000 sources.
FDR 30389 1505 9611 958495
2sigma 31497 22728 8503 937272
Bonferroni 27137 0 12863 960000
FDR makes 15 times few mistakes for the same power as traditional 2-sigma
Why? Controls a scientifically meaningful quantity: FDR = No. of false discoveries/Total no. of discoveries
![Page 37: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/37.jpg)
And it is adaptive to the size of the dataset
![Page 38: Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f395503460f94c57039/html5/thumbnails/38.jpg)
We used a FDR of 0.25i.e. 25% of circled Points are in error
Therefore, we can say with statistical rigor that most of these points a rejected and are thus
``features’’
No single point is a 3sigma deviation
New statistics has enabled an astronomical discovery