Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins...

36
Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000

description

Why Now ? The past decade has witnessed a thousand-fold increase in computer speed a dramatic decrease in the cost of computing & storage a dramatic increase in access to broadly distributed data large archives at multiple sites and high speed networks significant increases in detector size and performance These form the basis for science of qualitatively different nature

Transcript of Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins...

Page 1: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

Towards a Virtual Observatory

Alex SzalayDepartment of Physics and Astronomy

The Johns Hopkins University

ADASS 2000

Page 2: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

The Virtual Observatory

• National/Global• distributed in scope across institutions, agencies and countries• available to all astronomers and the public

• Virtual• not tied to a single “brick-and-mortar” location• supports astronomical “observations” and discoveries

via remote access to digital representations of the sky

• Observatory• general purpose • access to large areas of the sky at multiple wavelengths• supports a wide range of astronomical explorations• enables discovery via new computational tools

Page 3: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

Why Now ?

The past decade has witnessed• a thousand-fold increase in computer speed• a dramatic decrease in the cost of computing & storage• a dramatic increase in access to broadly distributed data • large archives at multiple sites and high speed networks• significant increases in detector size and performance

These form the basis for scienceof qualitatively different nature

Page 4: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

Nature of Astronomical Data

• Imaging– 2D map of the sky at multiple wavelengths

• Derived catalogs– subsequent processing of images– extracting object parameters (400+ per object)

• Spectroscopic follow-up– spectra: more detailed object properties– clues to physical state and formation history– lead to distances: 3D maps

• Numerical simulations• All inter-related!

Page 5: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

Trends

Future dominated by detector improvements

Total area of 3m+ telescopes in the world in m2, total number of CCD pixels in Megapix, as a function of time. Growth over 25 years is a factor of 30 in glass, 3000 in pixels.

• Moore’s Law growth in CCD capabilities

• Gigapixel arrays on the horizon

• Improvements in computing and storage will track growth in data volume

• Investment in software is critical, and growing

Page 6: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

The Age of Mega-Surveys

• The next generation mega-surveys and archives will change astronomy, due to– top-down design– large sky coverage– sound statistical plans– well controlled systematics

• The technology to store and access the data is herewe are riding Moore’s law

• Data mining will lead to stunning new discoveries• Integrating these archives is for the whole community

=> Virtual Observatory

Page 7: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

Ongoing surveys

• Large number of new surveys– multi-TB in size, 100 million objects or more– individual archives planned, or under way

• Multi-wavelength view of the sky– more than 13 wavelength coverage in 5 years

• Impressive early discoveries– finding exotic objects by unusual colors

• L,T dwarfs, high-z quasars– finding objects by time variability

• gravitational microlensing

MACHO2MASSDENISSDSSGALEXFIRSTDPOSSGSC-IICOBE MAPNVSSFIRSTROSATOGLE ...

Page 8: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

The Discovery Process

discover significant patternsfrom the analysis of statistically rich and unbiased image/catalog databases

understand complex astrophysical systems via confrontation between data and large numerical simulations

Past: observations of small, carefully selected samplesof objects in a narrow wavelength band

Future: high quality, homogeneous multi-wavelengthdata on millions of objects, allowing us to

The discovery process will rely heavily on advanced visualizationand statistical analysis tools

Page 9: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

The Necessity of the VO

• Enormous scientific interest in the survey data • The environment to exploit these huge sky surveys

does not exist today!– 1 Terabyte at 10 Mbyte/s takes 1 day– Hundreds of intensive queries and thousands of casual

queries per-day– Data will reside at multiple locations, in many different formats– Existing analysis tools do not scale to Terabyte data sets

• Acute need in a few years, solution will not just happen

Page 10: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

VO- The challenges

• Size of the archived data40,000 square degrees is 2 Trillion pixels– One band 4 Terabytes– Multi-wavelength 10-100 Terabytes– Time dimension 10 Petabytes

• Current techniques inadequate– new archival methods– new analysis tools– new standards

• Hardware/networking requirements– scalable solutions required

• Transition to the new astronomy

Page 11: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

VO: A New Initiative

• Priority in the Astronomy and Astrophysics Survey• Enable new science not previously possible• Maximize impact of large current and future efforts• Create the necessary new standards • Develop the software tools needed • Ensure that the community has network and

hardware resources to carry out the science• Keep up with evolving technology

Page 12: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

New Astronomy- Different!

• Data “Avalanche”– the flood of Terabytes of data is already happening,

whether we like it or not– our present techniques of handling these data do not scale

well with data volume• Systematic data exploration

– will have a central role– statistical analysis of the “typical” objects– automated search for the “rare” events

• Digital archives of the sky– will be the main access to data– hundreds to thousands of queries per day

Page 13: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

Examples: Data Pipelines

Page 14: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

Examples: Rare EventsDiscovery of several newobjects by SDSS & 2MASS

SDSS T-dwarf (June 1999)

Page 15: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

Examples: Reprocessing

Gravitational lensing

28,000 foreground galaxies over 2,045,000 background galaxies in test data (McKay etal 1999)

Page 16: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

Examples: Galaxy Clustering

• Shape of fluctuation spectrum– cosmological parameters and initial conditions

• The new surveys (SDSS) are the first when logN~30• Starts with a query• Compute correlation function

– All pairwise distances N2, N log N possible• Power spectrum

– Optimal: the Karhunen-Loeve transform– Signal-to-noise eigenmodes– N3 in the number of pixels

• Likelihood analysis in 30 dimensions:needs to be done many times over

Page 17: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

Relation to the HEP Problem

• Similarities– need to handle large amounts of data– data is located at multiple sites– data should be highly clustered– substantial amounts of custom reprocessing – need for a hierarchical organization of resources– scalable solutions required

• Differences of Astro from HEP– data migration is in opposite direction– the role of small queries is more important– relations between separate data sets (same sky)– data size currently smaller, we can keep it all on disk

Page 18: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

Data Migration Path

portal

HEP Astro

Tier 0

Tier 1

Tier 2

Tier 3

Page 19: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

Queries are I/O limited

• In our applications few fixed access patterns– one cannot build indices for all possible queries– worst case scenario is linear scan of the whole table

• Increasingly large differences between– Random access

• controlled by seek time (5-10ms), <1000 random I/O /sec– Sequential I/O

• dramatic improvements, 100 MB/sec per SCSI channel easy• reached 215 MB/sec on a single 2-way Dell server

• Often much faster to scan than to seek• Good layout => more sequential I/O

Page 20: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

Fast Query Execution

• Query plan– given the layout of the database, how can one get

the result the fastest possible way– evaluate a cost function, using a heuristic algorithm

• Indexing– clustered and unclustered

• database layout is crucial– one-dimensional indices:

• B-tree, R-tree– higher dimensional indices:

• Quad-tree, Oc-tree,KD-tree, R+-tree– typically logN access instead of N

Page 21: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

Distributed Archives

• Network speeds are much slower– minimize data transfer– run queries locally

• I/O will scale almost linearly with nodes– 1 GB/sec aggregate I/O engine can be built for <$100K

• Non-trivial problems in – load balancing– query parallelization– queries across inhomogeneous data sources

• These problems are not specific to astronomy– commercial solutions are around the corner

Page 22: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

User Interface Analysis Engine

Master

Objectivity

Objectivity

RAID

Slave

Objectivity

RAID

Slave

Objectivity

RAID

Slave

Objectivity

RAID

Slave

SX Engine Objectivity Federation

SDSS: Distributed Layout

Page 23: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

Astronomy Issues

• What are the user profiles• How will the data be analyzed• How to organize searches• Geometric indexing• Color searches and localization• SDSS Architecture• Virtual data

Page 24: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

User Profiles

• Power users– sophisticated, with lots of resources– research is centered around the archival data

• moderate number of very large queries, large output sizes

• Astronomers– frequent, but casual lookup of objects/regions– archives help their research, but not central

• large number of small queries, cross-identification requests

• Wide public– browsing a `Virtual Telescope’– can have large public appeal– need special packaging

• very large number of simple requests

Page 25: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

Geometric Approach

• Main problem– fast, indexed searches of Terabytes in N-dim space– searches are not axis-parallel

• simple B-tree indexing does not work

• Geometric approach– use the geometric nature of the data– quantize data into containers of `friends’

• objects of similar colors• close on the sky• clustered together on disk

– containers represent coarse-grained map of the data• multidimensional index-tree (eg KD-tree)

Page 26: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

Geometric Indexing

“Divide and Conquer” Partitioning

3 N M

HierarchicalTriangular

Mesh

Split as k-d treeStored as r-tree

of bounding boxes

Using regularindexing

techniques

Attributes Number

Sky Position 3Multiband Fluxes N = 5+Other M= 100+

Page 27: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

Computing Virtual Data

• Analyze large output volumes next to the database– send results only (`Virtual Data’):

the system `knows’ how to compute the result (Analysis Engine)

• Analysis: different CPU to I/O ratio than database– multilayered approach

• Highly scalable architecture required– distributed configuration – scalable to data grids

• Multiply redundant network paths between data-nodes and compute-nodes– `Data-wolf’ cluster => Virtual Data Grid

Page 28: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

A Data Grid Node

Hardware requirements• Large distributed database engines

– with few Gbyte/s aggregate I/O speed

• High speed (>10 Gbit/s) backbones– cross-connecting the major archives

• Scalable computing environment– with hundreds of CPUs for analysis

Objectivity

RAIDObjectivity

RAIDObjectivity

RAIDObjectivity

RAIDObjectivity

RAIDObjectivity

RAIDObjectivity

RAID

RAID

Database layer 2 GBytes/sec

Compute nodeCompute node

Compute nodeCompute node

Compute nodeCompute node

Compute nodeCompute node

Compute nodeCompute node

Compute nodeCompute node

Compute nodeCompute node

Compute nodeCompute node

Compute nodeCompute node

Compute nodeCompute node

Compute nodeCompute node

Compute nodeCompute node

Compute layer200 CPUs

Interconnect layer 1 Gbits/sec/node

Other nodes10 Gbits/s

Page 29: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

Clustering of Galaxies

Generic features of galaxy clustering:• Self organized clustering driven by long range forces• These lead to clustering on all scales• Clustering hierarchy: distribution of galaxy counts is

approximately lognormal• Scenarios: ‘top-down’ vs ‘bottom-up’

Page 30: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

Clustering of Computers

• Problem sizes have lognormal distribution– multiplicative process

• Optimal queuing strategy– run smallest job in queue– median scale set by local resources: largest jobs never finish

• Always need more computing– ‘infall’ to larger clusters nearby– asymptotically long-tailed distribution of compute power

• Short range forces: supercomputers • Long range forces: onset of high speed networking• Self-organized clustering of computing resources

– the Computational Grid

Page 31: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

VO: Conceptual Architecture

Data Archives

Analysis tools

Discovery toolsUser

Gateway

Page 32: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

The Flavor/Role of the NVO

• Highly Distributed and Decentralized

• Multiple Phases, built on top of another• Establish standards, meta-data formats• Integrate main catalogs• Develop initial querying tools• Develop collaboration requirements,

establish procedure to import new catalogs• Develop distributed analysis environment• Develop advanced visualization tools• Develop advanced querying tools

Page 33: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

NVO Development Functions

• Software development– query generation/optimization, software agents, user

interfaces, discovery tools, visualization tools

• Standards development– Meta-data, meta-services, streaming formats, object

relationships, object attributes,...

• Infrastructure development– archival storage systems, query engines, compute

servers, high speed connections of main centers

• Train the Next Generation– train scientists equally at home in astronomy and

modern computer science, statistics, visualization

Page 34: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

The Mission of the VO

The Virtual Observatory should provide seamless integration of the digitally

represented multi-wavelength sky enable efficient simultaneous access to multi-

Terabyte to Petabyte databases develop and maintain tools to find patterns and

discoveries contained within the large databases

develop and maintain tools to confront data with sophisticated numerical simulations

Page 35: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

Sociological Impact

• The VO is an entirely new way of doing astronomy• Will have a serious sociological impact• One can expect a love-hate response from the

community• Needs to be driven by science goals• Technology will constantly evolve• Educational impact two-fold

– need new skills to use it efficiently– provides enormous new opportunities

• Astronomy is rather unique in its wide appeal– the VO can reach out to an even wider audience

Page 36: Towards a Virtual Observatory Alex Szalay Department of Physics and Astronomy The Johns Hopkins University ADASS 2000.

Conclusions

• Databases became an essential part of astronomy: most data access will soon be via digital archives

• Data at separate locations, distributed worldwide, evolving in time: move queries not data (if you can)!

• Computations in both processing and analysis will be substantial: need to create a `Virtual Data Grid’

• Problems similar to HEP, lot of commonalities, but data flow more complex

• Interoperability of archives is essential:the Virtual Observatory is inevitable

www.voforum.orgwww.voforum.org