Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an...

27
Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living in an exponential world….)

Transcript of Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an...

Page 1: Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living.

Analyzing Large Datasets in Astrophysics

Alexander SzalayThe Johns Hopkins University

Towards an International Virtual Observatory,Garching, 2002

(Living in an exponential world….)

Page 2: Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living.

Alex Szalay, Garching 2002 2

Outline

Collecting DataExponential Growth

Making DiscoveriesPublishing DataVO: How will it work?Web Services

Atomic vs Composite services

Distributed queries with SkyQueryCross-Matching AlgorithmSkyNode Web Services + Portal

Statistical Analysis of large data sets

Page 3: Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living.

Alex Szalay, Garching 2002 3

The World is Exponential

Astrophysical data is growing exponentially

Doubling every year (Moore’s Law+):both data sizes and number of data sets

Computational resources scale the same way

Constant $$$ will keep up with the data

Main problem is the software component

Currently components are not reusedSoftware costs are increasingly larger fractionAggregate costs are growing exponentially

Page 4: Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living.

Alex Szalay, Garching 2002 4

Making Discoveries

When and where are discoveries made?Always at the edges and boundariesGoing deeper, using more colors….

Metcalfe’s lawUtility of computer networks grows as the number of possible connections: O(N2)

VO: Federation of N archivesPossibilities for new discoveries grow as O(N2)

Current sky surveys have proven thisVery early discoveries from SDSS, 2MASS, DPOSS

Page 5: Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living.

Alex Szalay, Garching 2002 5

Publishing Data

Roles

Authors

Publishers

Curators

Consumers

Traditional

Scientists

Journals

Libraries

Scientists

Emerging

Collaborations

Project www site

Bigger Archives

Scientists

Page 6: Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living.

Alex Szalay, Garching 2002 6

Changing Roles

Exponential growth:Projects last at least 3-5 yearsData sent upwards only at the end of the projectData will be never centralized

More responsibility on projectsBecoming Publishers and CuratorsLarger fraction of budget spent on softwareLot of development duplicated, wasted

More standards are neededEasier data interchange, fewer tools

More templates are neededDevelop less software on your own

Page 7: Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living.

Alex Szalay, Garching 2002 7

Emerging New Concepts

Standardizing distributed dataWeb Services, supported on all platformsCustom configure remote data dynamicallyXML: Extensible Markup LanguageSOAP: Simple Object Access ProtocolWSDL: Web Services Description Language

Standardizing distributed computingGrid ServicesCustom configure remote computing dynamicallyBuild your own remote computer, and discardVirtual Data: new data sets on demand

Page 8: Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living.

Alex Szalay, Garching 2002 8

NVO: How Will It Work?

Define commonly used `atomic’ servicesBuild higher level toolboxes/portals on topWe do not build `everything for everybody’Use the 90-10 rule:

Define the standards and interfacesBuild the frameworkBuild the 10% of services that are used by 90%Let the users build the rest from the components

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5

# of services# o

f u

sers

Page 9: Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living.

Alex Szalay, Garching 2002 9

Atomic Services

Metadata information about resourcesWavebandSky coverageTranslation of names to universal dictionary (UCD)

Simple search patterns on the resourcesCone SearchImage mosaicUnit conversions

Simple filtering, counting, histogrammingOn-the-fly recalibrations

Page 10: Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living.

Alex Szalay, Garching 2002 10

Higher Level Services

Built on Atomic ServicesPerform more complex tasksExamples

Automated resource discoveryCross-identificationsPhotometric redshiftsOutlier detectionsVisualization facilities

Expectation:Build custom portals in matter of days from existing building blocks (like today in IRAF or IDL)

Page 11: Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living.

Alex Szalay, Garching 2002 11

SkyQuery

Distributed Query tool using a set of servicesFeasibility study, built in 6 weeks from scratch

Tanu Malik (JHU CS grad student) Tamas Budavari (JHU astro postdoc)

Implemented in C# and .NETWon 2nd prize of Microsoft XML ContestAllows queries like:

SELECT o.objId, o.r, o.type, t.objId FROM SDSS:PhotoPrimary o,

TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5

AND AREA(181.3,-0.76,6.5) AND o.type=3 and (o.I - t.m_j)>2

Page 12: Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living.

Alex Szalay, Garching 2002 12

Architecture

Image cutout

SkyNodeSDSS

SkyNode2Mass

SkyNodeFirst

SkyQuery

Web Page

Page 13: Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living.

Alex Szalay, Garching 2002 13

Cross-id Steps

Parse queryGet countsSort by countsMake planCross-match

Recursively, from small to large

Select necessary attributes onlyReturn outputInsert cutout image

SELECT o.objId, o.r, o.type, t.objId FROM SDSS:PhotoPrimary o,

TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5 AND AREA(181.3,-0.76,6.5) AND (o.i - t.m_j) > 2 AND o.type=3

Page 14: Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living.

Alex Szalay, Garching 2002 14

Monte-Carlo Simulation

Comparing different algorithms for 3-way xid

Transmit all the dataTransmit after filteringRecursive cross-match

SurveysSDSS2MASSFirst

Random variables:Sky Area (0..10 sqdeg)Selectivity of each subselect (0..1)Efficiency of join (0.5..2)Selectivity of common select (0..1)

0

500

1000

1500

2000

-4 -2 0 2 4log cost

0

500

1000

1500

2000

-4 -2 0 2 4log cost

Page 15: Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living.

Alex Szalay, Garching 2002 15

SkyNode

Metadata functions (SOAP)Info, Tables, Columns, Schema, Functions, Keysearch

Query functions (SOAP)Dataset Query(String sqlCmd)Dataset Xmatch(Dataset input, String sqlCmd, float eps)

Database MS SQL ServerUpload datasetVery fast spatial search engine (HTM-based)crossmatch takes <3 ms/object over 15M in SDSSUser defined functions and stored procedures

Page 16: Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living.

Alex Szalay, Garching 2002 16

Data Flow

SkyNode 1

SkyQuery

SkyNode 2

SkyNode 3

query

http://www.skyquery.net

Page 17: Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living.

Alex Szalay, Garching 2002 17

Optimal Statistics

The examples for optimal statistics have poor scaling

Correlation functions N2, likelihood techniques N3

As data sizes grow at Moore’s law, computers can only keep up with at most N logN algorithmsWhat goes?

Notion of optimal is in the sense of statistical errorsAssumes infinite computational resourcesAssumes that only source of error is statistical`Cosmic Variance’: we can only observe the Universe from one location (finite sample size)

Solutions require combination of Statistics and CSNew algorithms: not worse than N logN

Page 18: Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living.

Alex Szalay, Garching 2002 18

Clever Data Structures

Heavy use of tree structures:Up-front cost, but only N logNLarge speedup laterTree-codes for correlations (A. Moore et al 2001)

Fast, approximate heuristic algorithmsNo need to be more accurate than cosmic varianceFast CMB analysis by Szapudi etal (2001)

• N logN instead of N3 => 1 day instead of 10 million years

Take cost of computation into accountControlled level of accuracyBest result in a given time, given our computing resources

Page 19: Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living.

Alex Szalay, Garching 2002 19

Angular Clustering with Photo-z

w() by Peebles and Groth:The first example of publishing and analyzing large data

Samples based on rest-frame quantitiesStrictly volume limited samplesLargest angular correlation study to dateVery clear detection of

Luminosity and color dependence

Results consistent with 3D clusteringT. Budavari, A. Connolly, I. Csabai, I. Szapudi, A. Szalay, S. Dodelson, J. Frieman, R. Scranton, D. Johnston

and the SDSS Collaboration

Page 20: Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living.

Alex Szalay, Garching 2002 20

The Samples

343k343k 254k254k 185k185k 316k316k 280k280k 326k326k 185k185k 127k127k

-20 > Mr >-21

1182k1182k

-21 > Mr >-23

931k931k

0.1<z<0.3-20 > Mr

2.2M2.2M

-21 > Mr >-22

662k662k

-22 > Mr >-23

269k269k

0.1<z<0.5-21.4 > Mr

3.1M3.1M

10 stripes: 10M10M

mr<21 : 15M15M

All: 50M50M

2800 square degrees in 10 stripes, data in custom DB

Page 21: Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living.

Alex Szalay, Garching 2002 21

The Stripes

10 stripes over the SDSS area, covering about 2800 square degreesAbout 20% lost due to bad seeingMasks: seeing, bright stars, etc.Images generated from query by web service

Page 22: Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living.

Alex Szalay, Garching 2002 22

The Masks

Stripe 11 + masksMasks are derived from the database

Search and intersect extended objects with boundaries

Page 23: Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living.

Alex Szalay, Garching 2002 23

The Analysis

eSpICE : I.Szapudi, S.Colombi and S.PrunetIntegrated with the database by T. BudavariExtremely fast processing (N logN)

1 stripe with about 1 million galaxies is processed in 3 mins

Usual figure was 10 min for 10,000 galaxies => 70 days

Each stripe processed separately for each cut2D angular correlation function computedw(): average with rejection of pixels along the scan

flat field vector causes mock correlations

Page 24: Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living.

Alex Szalay, Garching 2002 24

Angular Correlations I.

Luminosity dependence: 3 cuts-20> M > -21-21> M > -22-22> M > -23

Page 25: Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living.

Alex Szalay, Garching 2002 25

Angular Correlations II.

Color Dependence4 bins by rest-frame SED type

Page 26: Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living.

Alex Szalay, Garching 2002 26

Summary

Exponential data growth – distributed dataWeb Services – hierarchical architectureUse the 90-10 rule (maybe 80-20)There are clever ways to federate datasets!Statistical analyses do not follow Moore’s lawNeed to revisit optimal statisticsGive interesting new tools into the hands of smart young people…They will quickly turn them into cutting edge science

Page 27: Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living.

Alex Szalay, Garching 2002 27

Virtual Observatory

Astronomy with an attitude…